yyanyy opened a new pull request #1641:
URL: https://github.com/apache/iceberg/pull/1641


   This change adds NaN counter in `Metrics` model, and update it during 
Parquet writing. I believe it only touches internal models and will not write 
the new attribute to output files. This change is the first step towards 
implementing spec change defined in https://github.com/apache/iceberg/pull/348 
. 
   
   Questions:
   - As mentioned in a few comments I highlighted, `SparkTableUtil` 
(essentially `importSparkTable()`) ([link]( 
https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L372))
 reads metrics from Parquet footer directly for importing tables, and thus 
won't populate NaN metrics with this approach. If we don't want to accept this 
as a fact, we may need to switch `ParquetFileReader.readFooter()` to use 
internal parquet reader, and enable metric collection during reading, but this 
could be much more expensive than the current approach. Do people have better 
suggestions?
        + One thing that may worth noting is that 
`ParquetFileReader.readFooter` is on deprecation path 
(https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.0/deprecated-list.html)
 
   - The current change doesn't help with removing NaN from lower/upper bound, 
since parquet library doesn't treat NaN for its min/max stats specially. I'm 
thinking to use the same approach to populate upper and lower bounds, and 
wondering if people have better suggestions. 
   
   Wanted to submit a draft to gather comments on the approach in general. Will 
add more tests later. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to