Hi David, Your code looks good to me. As you are using INT64, min/max truncate does not apply. I think, it should work fine.
Cheers, Gabor On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <[email protected]> wrote: > Hello Gang, > > I am trying to build an application. One function it has is to scan a > directory of Parquet files and then determine the maximum "sequence number" > (id) across all files. This is the solution I came up with, but is this > correct? How would you do such a thing? > > I wrote the files with parquet-avro writer. > > try (DirectoryStream<java.nio.file.Path> directoryStream = > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) { > > PrimitiveType type = > Types.required(PrimitiveTypeName.INT64).named("seq"); > Statistics<?> stats = Statistics.getBuilderForReading(type).build(); > > for (java.nio.file.Path path : directoryStream) { > ParquetFileReader reader = > ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), new > Configuration())); > > for (final BlockMetaData rowGroup : reader.getRowGroups()) { > for (final ColumnChunkMetaData column : rowGroup.getColumns()) { > if ("seq".equals(column.getPath().toDotString())) { > stats.mergeStatistics(column.getStatistics()); > } > } > } > } > > Thanks. >
