Re: Finding Max Value of Column

Gabor Szadovszky Tue, 10 Mar 2020 06:16:22 -0700

Hi David,

Unfortunately, I don't have a better solution. If you think, finding a
global min/max value in a file would be frequently used by our clients you
may create a jira for this feature.


Regards,
Gabor

On Tue, Mar 10, 2020 at 2:08 PM David Mollitor <[email protected]> wrote:

> Hey Gabor,
>
> I appreciate you sharing your knowledge with me.
>
> As I understand it, my solution is acceptable but is not the generalized
> solution.  What would that solution look like?
>
> Thanks.
>
> On Tue, Mar 10, 2020, 4:55 AM Gabor Szadovszky
> <[email protected]> wrote:
>
> > Hi,
> >
> > Statistics objects are mainly created for internal use. The check you
> > mentioned is to ensure that only the corresponding column statistics are
> > summarized.
> > The code you've written works properly because you create and use the
> > Statistics object as we use it internally. However, it is quite easy to
> > misuse it.
> > It is also worth mentioning that the code works properly because your
> type
> > is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it
> > would not always be that trivial.
> > So, if this code works for your case you may use it but I would not
> suggest
> > generalizing it for other cases and neither would suggest extending the
> > existing code to support it.
> >
> > Regards,
> > Gabor
> >
> > On Mon, Mar 9, 2020 at 4:12 PM David Mollitor <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > One thing that would have made this even easier... the
> 'mergeStatsistics'
> > > method throws an exception if the columns are not equal on the RHS/LHS
> of
> > > the method.  I had to add that toDotString check to avoid this
> > scenario.  I
> > > could have just caught (and ignored) that exception to remove that
> extra
> > > check, but the overhead would have been heavy, and it would have added
> > even
> > > more code.
> > >
> > > The 'mergeStatistics' method is already doing a comparison check
> > internally
> > > (that's why it throws an exception),  is there any interest in adding a
> > new
> > > method signature that returns true/false if the merge was successful,
> > > instead of throwing an exception?
> > >
> > > Then the code just becomes:
> > >
> > > for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > >       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> > >             boolean success =
> > > stats.mergeStatistics(column.getStatistics());
> > >       }
> > > }
> > >
> > >
> > >
> > > On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
> > > <[email protected]> wrote:
> > >
> > > > Hi David,
> > > >
> > > > Your code looks good to me. As you are using INT64, min/max truncate
> > does
> > > > not apply. I think, it should work fine.
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <[email protected]>
> > wrote:
> > > >
> > > > > Hello Gang,
> > > > >
> > > > > I am trying to build an application.  One function it has is to
> scan
> > a
> > > > > directory of Parquet files and then determine the maximum "sequence
> > > > number"
> > > > > (id) across all files.  This is the solution I came up with, but is
> > > this
> > > > > correct?  How would you do such a thing?
> > > > >
> > > > > I wrote the files with parquet-avro writer.
> > > > >
> > > > > try (DirectoryStream<java.nio.file.Path> directoryStream =
> > > > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> > > > >
> > > > >   PrimitiveType type =
> > > > > Types.required(PrimitiveTypeName.INT64).named("seq");
> > > > >   Statistics<?> stats =
> > Statistics.getBuilderForReading(type).build();
> > > > >
> > > > >   for (java.nio.file.Path path : directoryStream) {
> > > > >     ParquetFileReader reader =
> > > > > ParquetFileReader.open(HadoopInputFile.fromPath(new
> > Path(path.toUri()),
> > > > new
> > > > > Configuration()));
> > > > >
> > > > >     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > > > >       for (final ColumnChunkMetaData column :
> rowGroup.getColumns())
> > {
> > > > >         if ("seq".equals(column.getPath().toDotString())) {
> > > > >           stats.mergeStatistics(column.getStatistics());
> > > > >         }
> > > > >       }
> > > > >    }
> > > > > }
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > >
> >
>

Re: Finding Max Value of Column

Reply via email to