Re: Finding Max Value of Column

David Mollitor Tue, 10 Mar 2020 06:09:04 -0700

Hey Gabor,

I appreciate you sharing your knowledge with me.


As I understand it, my solution is acceptable but is not the generalized
solution.  What would that solution look like?

Thanks.

On Tue, Mar 10, 2020, 4:55 AM Gabor Szadovszky
<[email protected]> wrote:

> Hi,
>
> Statistics objects are mainly created for internal use. The check you
> mentioned is to ensure that only the corresponding column statistics are
> summarized.
> The code you've written works properly because you create and use the
> Statistics object as we use it internally. However, it is quite easy to
> misuse it.
> It is also worth mentioning that the code works properly because your type
> is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it
> would not always be that trivial.
> So, if this code works for your case you may use it but I would not suggest
> generalizing it for other cases and neither would suggest extending the
> existing code to support it.
>
> Regards,
> Gabor
>
> On Mon, Mar 9, 2020 at 4:12 PM David Mollitor <[email protected]> wrote:
>
> > Hello,
> >
> > One thing that would have made this even easier... the 'mergeStatsistics'
> > method throws an exception if the columns are not equal on the RHS/LHS of
> > the method.  I had to add that toDotString check to avoid this
> scenario.  I
> > could have just caught (and ignored) that exception to remove that extra
> > check, but the overhead would have been heavy, and it would have added
> even
> > more code.
> >
> > The 'mergeStatistics' method is already doing a comparison check
> internally
> > (that's why it throws an exception),  is there any interest in adding a
> new
> > method signature that returns true/false if the merge was successful,
> > instead of throwing an exception?
> >
> > Then the code just becomes:
> >
> > for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> >       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> >             boolean success =
> > stats.mergeStatistics(column.getStatistics());
> >       }
> > }
> >
> >
> >
> > On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
> > <[email protected]> wrote:
> >
> > > Hi David,
> > >
> > > Your code looks good to me. As you are using INT64, min/max truncate
> does
> > > not apply. I think, it should work fine.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <[email protected]>
> wrote:
> > >
> > > > Hello Gang,
> > > >
> > > > I am trying to build an application.  One function it has is to scan
> a
> > > > directory of Parquet files and then determine the maximum "sequence
> > > number"
> > > > (id) across all files.  This is the solution I came up with, but is
> > this
> > > > correct?  How would you do such a thing?
> > > >
> > > > I wrote the files with parquet-avro writer.
> > > >
> > > > try (DirectoryStream<java.nio.file.Path> directoryStream =
> > > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> > > >
> > > >   PrimitiveType type =
> > > > Types.required(PrimitiveTypeName.INT64).named("seq");
> > > >   Statistics<?> stats =
> Statistics.getBuilderForReading(type).build();
> > > >
> > > >   for (java.nio.file.Path path : directoryStream) {
> > > >     ParquetFileReader reader =
> > > > ParquetFileReader.open(HadoopInputFile.fromPath(new
> Path(path.toUri()),
> > > new
> > > > Configuration()));
> > > >
> > > >     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > > >       for (final ColumnChunkMetaData column : rowGroup.getColumns())
> {
> > > >         if ("seq".equals(column.getPath().toDotString())) {
> > > >           stats.mergeStatistics(column.getStatistics());
> > > >         }
> > > >       }
> > > >    }
> > > > }
> > > >
> > > > Thanks.
> > > >
> > >
> >
>

Re: Finding Max Value of Column

Reply via email to