Hey Gabor, I appreciate you sharing your knowledge with me.
As I understand it, my solution is acceptable but is not the generalized solution. What would that solution look like? Thanks. On Tue, Mar 10, 2020, 4:55 AM Gabor Szadovszky <[email protected]> wrote: > Hi, > > Statistics objects are mainly created for internal use. The check you > mentioned is to ensure that only the corresponding column statistics are > summarized. > The code you've written works properly because you create and use the > Statistics object as we use it internally. However, it is quite easy to > misuse it. > It is also worth mentioning that the code works properly because your type > is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it > would not always be that trivial. > So, if this code works for your case you may use it but I would not suggest > generalizing it for other cases and neither would suggest extending the > existing code to support it. > > Regards, > Gabor > > On Mon, Mar 9, 2020 at 4:12 PM David Mollitor <[email protected]> wrote: > > > Hello, > > > > One thing that would have made this even easier... the 'mergeStatsistics' > > method throws an exception if the columns are not equal on the RHS/LHS of > > the method. I had to add that toDotString check to avoid this > scenario. I > > could have just caught (and ignored) that exception to remove that extra > > check, but the overhead would have been heavy, and it would have added > even > > more code. > > > > The 'mergeStatistics' method is already doing a comparison check > internally > > (that's why it throws an exception), is there any interest in adding a > new > > method signature that returns true/false if the merge was successful, > > instead of throwing an exception? > > > > Then the code just becomes: > > > > for (final BlockMetaData rowGroup : reader.getRowGroups()) { > > for (final ColumnChunkMetaData column : rowGroup.getColumns()) { > > boolean success = > > stats.mergeStatistics(column.getStatistics()); > > } > > } > > > > > > > > On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky > > <[email protected]> wrote: > > > > > Hi David, > > > > > > Your code looks good to me. As you are using INT64, min/max truncate > does > > > not apply. I think, it should work fine. > > > > > > Cheers, > > > Gabor > > > > > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <[email protected]> > wrote: > > > > > > > Hello Gang, > > > > > > > > I am trying to build an application. One function it has is to scan > a > > > > directory of Parquet files and then determine the maximum "sequence > > > number" > > > > (id) across all files. This is the solution I came up with, but is > > this > > > > correct? How would you do such a thing? > > > > > > > > I wrote the files with parquet-avro writer. > > > > > > > > try (DirectoryStream<java.nio.file.Path> directoryStream = > > > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) { > > > > > > > > PrimitiveType type = > > > > Types.required(PrimitiveTypeName.INT64).named("seq"); > > > > Statistics<?> stats = > Statistics.getBuilderForReading(type).build(); > > > > > > > > for (java.nio.file.Path path : directoryStream) { > > > > ParquetFileReader reader = > > > > ParquetFileReader.open(HadoopInputFile.fromPath(new > Path(path.toUri()), > > > new > > > > Configuration())); > > > > > > > > for (final BlockMetaData rowGroup : reader.getRowGroups()) { > > > > for (final ColumnChunkMetaData column : rowGroup.getColumns()) > { > > > > if ("seq".equals(column.getPath().toDotString())) { > > > > stats.mergeStatistics(column.getStatistics()); > > > > } > > > > } > > > > } > > > > } > > > > > > > > Thanks. > > > > > > > > > >
