Hello Gang,
I am trying to build an application. One function it has is to scan a
directory of Parquet files and then determine the maximum "sequence number"
(id) across all files. This is the solution I came up with, but is this
correct? How would you do such a thing?
I wrote the files with parquet-avro writer.
try (DirectoryStream<java.nio.file.Path> directoryStream =
Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
PrimitiveType type = Types.required(PrimitiveTypeName.INT64).named("seq");
Statistics<?> stats = Statistics.getBuilderForReading(type).build();
for (java.nio.file.Path path : directoryStream) {
ParquetFileReader reader =
ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), new
Configuration()));
for (final BlockMetaData rowGroup : reader.getRowGroups()) {
for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
if ("seq".equals(column.getPath().toDotString())) {
stats.mergeStatistics(column.getStatistics());
}
}
}
}
Thanks.