Count) push down for Parquet

GitBox Wed, 13 Oct 2021 09:39:01 -0700


timarmstrong commented on pull request #33639:
URL: https://github.com/apache/spark/pull/33639#issuecomment-942487858



   Hi @huaxingao @sunchao. I saw this change go past and I have concerns about 
this not gracefully handling missing stats - it's a bad user experience if they 
need to turn off a very effective optimisation globally because of a single 
empty stat column in a (valid!) Parquet file. 
   
   > If somehow the file doesn't have stats for the aggregate columns, Spark 
will throw Exception.
   There are many reasons why a file might not have min/max stats:
   * There's no requirement in the Parquet spec that writers have to include 
them, it's optional.
   * The min_value/max_value fields were only added in Parquet format 
2.4.0/parquet-mr 1.9.0: https://issues.apache.org/jira/browse/PARQUET-686
   * Readers might have to ignore min/max stats if there are buggy writers that 
produce invalid min/max values. This has happened historically.
   * Writers can choose not to include stats for various reasons, e.g. strings 
are too long. Some workloads do work with very large string data so it's not 
always practical to include as min/max stats.
   
   I've looked at implementing a similar optimisation to this before and I 
think in order for it to be robust, the reader has to fall back to scanning the 
column and computing min/max if the stats are not available or useable. 
Otherwise it's not something that's really robust or production-ready.
   
   I also think Ivan's concern is valid and there is a risk of incorrect 
results. The Parquet spec I think is unclear about whether truncation of 
min/max stats at the row group level is allowed. When the page index support 
was added, it was clarified that truncation *is* allowed in the page index, but 
it didn't address the row group level min/max stats - 
https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach.
 I don't know if writers do truncate in practice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] timarmstrong commented on pull request #33639: [SPARK-36645][SQL] Aggregate (Min/Max/Count) push down for Parquet

Reply via email to