[GitHub] [iceberg] yyanyy opened a new pull request #1641: Add NaN counter to Metrics and implement in Parquet writers

GitBox Wed, 21 Oct 2020 19:21:50 -0700


yyanyy opened a new pull request #1641:
URL: https://github.com/apache/iceberg/pull/1641

This change adds NaN counter in `Metrics` model, and update it during
Parquet writing. I believe it only touches internal models and will not write
the new attribute to output files. This change is the first step towards
implementing spec change defined in https://github.com/apache/iceberg/pull/348
.

Questions:
- As mentioned in a few comments I highlighted, `SparkTableUtil`
(essentially `importSparkTable()`) ([link](
https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L372))
reads metrics from Parquet footer directly for importing tables, and thus
won't populate NaN metrics with this approach. If we don't want to accept this
as a fact, we may need to switch `ParquetFileReader.readFooter()` to use
internal parquet reader, and enable metric collection during reading, but this
could be much more expensive than the current approach. Do people have better
suggestions?
+ One thing that may worth noting is that
`ParquetFileReader.readFooter` is on deprecation path
(https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.0/deprecated-list.html)

- The current change doesn't help with removing NaN from lower/upper bound,
since parquet library doesn't treat NaN for its min/max stats specially. I'm
thinking to use the same approach to populate upper and lower bounds, and
wondering if people have better suggestions.

Wanted to submit a draft to gather comments on the approach in general. Will
add more tests later.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy opened a new pull request #1641: Add NaN counter to Metrics and implement in Parquet writers

Reply via email to