[GitHub] [iceberg] rdblue commented on pull request #2182: Support for PartitionStatsFile in each snapshot

GitBox Wed, 03 Feb 2021 12:05:35 -0800


rdblue commented on pull request #2182:
URL: https://github.com/apache/iceberg/pull/2182#issuecomment-772786048



   Thanks, @vvellanki. I think we will need to take a closer look at the plan 
for maintaining these files. I think we should track the last file and the 
snapshot it is based on, so we can apply diffs to it and update asynchronously.
   
   For the use case, I'm curious why you aren't using the actual files? In 
Spark, we push filters down to Iceberg before producing stats. Then stats are 
based on the actual files that will be scanned, which can be significantly 
smaller than just partition-level stats. That allows a lot more joins to be 
converted to broadcast joins. Is it possible to push filters earlier in your 
job planning?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #2182: Support for PartitionStatsFile in each snapshot

Reply via email to