Hi David, Unfortunately, I don't have a better solution. If you think, finding a global min/max value in a file would be frequently used by our clients you may create a jira for this feature.
Regards, Gabor On Tue, Mar 10, 2020 at 2:08 PM David Mollitor <[email protected]> wrote: > Hey Gabor, > > I appreciate you sharing your knowledge with me. > > As I understand it, my solution is acceptable but is not the generalized > solution. What would that solution look like? > > Thanks. > > On Tue, Mar 10, 2020, 4:55 AM Gabor Szadovszky > <[email protected]> wrote: > > > Hi, > > > > Statistics objects are mainly created for internal use. The check you > > mentioned is to ensure that only the corresponding column statistics are > > summarized. > > The code you've written works properly because you create and use the > > Statistics object as we use it internally. However, it is quite easy to > > misuse it. > > It is also worth mentioning that the code works properly because your > type > > is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it > > would not always be that trivial. > > So, if this code works for your case you may use it but I would not > suggest > > generalizing it for other cases and neither would suggest extending the > > existing code to support it. > > > > Regards, > > Gabor > > > > On Mon, Mar 9, 2020 at 4:12 PM David Mollitor <[email protected]> wrote: > > > > > Hello, > > > > > > One thing that would have made this even easier... the > 'mergeStatsistics' > > > method throws an exception if the columns are not equal on the RHS/LHS > of > > > the method. I had to add that toDotString check to avoid this > > scenario. I > > > could have just caught (and ignored) that exception to remove that > extra > > > check, but the overhead would have been heavy, and it would have added > > even > > > more code. > > > > > > The 'mergeStatistics' method is already doing a comparison check > > internally > > > (that's why it throws an exception), is there any interest in adding a > > new > > > method signature that returns true/false if the merge was successful, > > > instead of throwing an exception? > > > > > > Then the code just becomes: > > > > > > for (final BlockMetaData rowGroup : reader.getRowGroups()) { > > > for (final ColumnChunkMetaData column : rowGroup.getColumns()) { > > > boolean success = > > > stats.mergeStatistics(column.getStatistics()); > > > } > > > } > > > > > > > > > > > > On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky > > > <[email protected]> wrote: > > > > > > > Hi David, > > > > > > > > Your code looks good to me. As you are using INT64, min/max truncate > > does > > > > not apply. I think, it should work fine. > > > > > > > > Cheers, > > > > Gabor > > > > > > > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <[email protected]> > > wrote: > > > > > > > > > Hello Gang, > > > > > > > > > > I am trying to build an application. One function it has is to > scan > > a > > > > > directory of Parquet files and then determine the maximum "sequence > > > > number" > > > > > (id) across all files. This is the solution I came up with, but is > > > this > > > > > correct? How would you do such a thing? > > > > > > > > > > I wrote the files with parquet-avro writer. > > > > > > > > > > try (DirectoryStream<java.nio.file.Path> directoryStream = > > > > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) { > > > > > > > > > > PrimitiveType type = > > > > > Types.required(PrimitiveTypeName.INT64).named("seq"); > > > > > Statistics<?> stats = > > Statistics.getBuilderForReading(type).build(); > > > > > > > > > > for (java.nio.file.Path path : directoryStream) { > > > > > ParquetFileReader reader = > > > > > ParquetFileReader.open(HadoopInputFile.fromPath(new > > Path(path.toUri()), > > > > new > > > > > Configuration())); > > > > > > > > > > for (final BlockMetaData rowGroup : reader.getRowGroups()) { > > > > > for (final ColumnChunkMetaData column : > rowGroup.getColumns()) > > { > > > > > if ("seq".equals(column.getPath().toDotString())) { > > > > > stats.mergeStatistics(column.getStatistics()); > > > > > } > > > > > } > > > > > } > > > > > } > > > > > > > > > > Thanks. > > > > > > > > > > > > > > >
