yyanyy opened a new pull request #1641: URL: https://github.com/apache/iceberg/pull/1641
This change adds NaN counter in `Metrics` model, and update it during Parquet writing. I believe it only touches internal models and will not write the new attribute to output files. This change is the first step towards implementing spec change defined in https://github.com/apache/iceberg/pull/348 . Questions: - As mentioned in a few comments I highlighted, `SparkTableUtil` (essentially `importSparkTable()`) ([link]( https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L372)) reads metrics from Parquet footer directly for importing tables, and thus won't populate NaN metrics with this approach. If we don't want to accept this as a fact, we may need to switch `ParquetFileReader.readFooter()` to use internal parquet reader, and enable metric collection during reading, but this could be much more expensive than the current approach. Do people have better suggestions? + One thing that may worth noting is that `ParquetFileReader.readFooter` is on deprecation path (https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.0/deprecated-list.html) - The current change doesn't help with removing NaN from lower/upper bound, since parquet library doesn't treat NaN for its min/max stats specially. I'm thinking to use the same approach to populate upper and lower bounds, and wondering if people have better suggestions. Wanted to submit a draft to gather comments on the approach in general. Will add more tests later. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
