Hi David,

Your code looks good to me. As you are using INT64, min/max truncate does
not apply. I think, it should work fine.

Cheers,
Gabor

On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <[email protected]> wrote:

> Hello Gang,
>
> I am trying to build an application.  One function it has is to scan a
> directory of Parquet files and then determine the maximum "sequence number"
> (id) across all files.  This is the solution I came up with, but is this
> correct?  How would you do such a thing?
>
> I wrote the files with parquet-avro writer.
>
> try (DirectoryStream<java.nio.file.Path> directoryStream =
> Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
>
>   PrimitiveType type =
> Types.required(PrimitiveTypeName.INT64).named("seq");
>   Statistics<?> stats = Statistics.getBuilderForReading(type).build();
>
>   for (java.nio.file.Path path : directoryStream) {
>     ParquetFileReader reader =
> ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), new
> Configuration()));
>
>     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
>       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
>         if ("seq".equals(column.getPath().toDotString())) {
>           stats.mergeStatistics(column.getStatistics());
>         }
>       }
>    }
> }
>
> Thanks.
>

Reply via email to