Hello Gang,

I am trying to build an application.  One function it has is to scan a
directory of Parquet files and then determine the maximum "sequence number"
(id) across all files.  This is the solution I came up with, but is this
correct?  How would you do such a thing?

I wrote the files with parquet-avro writer.

try (DirectoryStream<java.nio.file.Path> directoryStream =
Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {

  PrimitiveType type = Types.required(PrimitiveTypeName.INT64).named("seq");
  Statistics<?> stats = Statistics.getBuilderForReading(type).build();

  for (java.nio.file.Path path : directoryStream) {
    ParquetFileReader reader =
ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), new
Configuration()));

    for (final BlockMetaData rowGroup : reader.getRowGroups()) {
      for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
        if ("seq".equals(column.getPath().toDotString())) {
          stats.mergeStatistics(column.getStatistics());
        }
      }
   }
}

Thanks.

Reply via email to