Hi,

Statistics objects are mainly created for internal use. The check you
mentioned is to ensure that only the corresponding column statistics are
summarized.
The code you've written works properly because you create and use the
Statistics object as we use it internally. However, it is quite easy to
misuse it.
It is also worth mentioning that the code works properly because your type
is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it
would not always be that trivial.
So, if this code works for your case you may use it but I would not suggest
generalizing it for other cases and neither would suggest extending the
existing code to support it.

Regards,
Gabor

On Mon, Mar 9, 2020 at 4:12 PM David Mollitor <[email protected]> wrote:

> Hello,
>
> One thing that would have made this even easier... the 'mergeStatsistics'
> method throws an exception if the columns are not equal on the RHS/LHS of
> the method.  I had to add that toDotString check to avoid this scenario.  I
> could have just caught (and ignored) that exception to remove that extra
> check, but the overhead would have been heavy, and it would have added even
> more code.
>
> The 'mergeStatistics' method is already doing a comparison check internally
> (that's why it throws an exception),  is there any interest in adding a new
> method signature that returns true/false if the merge was successful,
> instead of throwing an exception?
>
> Then the code just becomes:
>
> for (final BlockMetaData rowGroup : reader.getRowGroups()) {
>       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
>             boolean success =
> stats.mergeStatistics(column.getStatistics());
>       }
> }
>
>
>
> On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
> <[email protected]> wrote:
>
> > Hi David,
> >
> > Your code looks good to me. As you are using INT64, min/max truncate does
> > not apply. I think, it should work fine.
> >
> > Cheers,
> > Gabor
> >
> > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <[email protected]> wrote:
> >
> > > Hello Gang,
> > >
> > > I am trying to build an application.  One function it has is to scan a
> > > directory of Parquet files and then determine the maximum "sequence
> > number"
> > > (id) across all files.  This is the solution I came up with, but is
> this
> > > correct?  How would you do such a thing?
> > >
> > > I wrote the files with parquet-avro writer.
> > >
> > > try (DirectoryStream<java.nio.file.Path> directoryStream =
> > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> > >
> > >   PrimitiveType type =
> > > Types.required(PrimitiveTypeName.INT64).named("seq");
> > >   Statistics<?> stats = Statistics.getBuilderForReading(type).build();
> > >
> > >   for (java.nio.file.Path path : directoryStream) {
> > >     ParquetFileReader reader =
> > > ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()),
> > new
> > > Configuration()));
> > >
> > >     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > >       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> > >         if ("seq".equals(column.getPath().toDotString())) {
> > >           stats.mergeStatistics(column.getStatistics());
> > >         }
> > >       }
> > >    }
> > > }
> > >
> > > Thanks.
> > >
> >
>

Reply via email to