rdblue opened a new pull request #1397:
URL: https://github.com/apache/iceberg/pull/1397


   This replaces use of the `all_data_files` metadata table in 
RemoveOrphanFilesAction and ExpireSnapshotsAction with a call to read data file 
paths from manifest files in parallel. This avoids reading all of the manifest 
lists in the Spark driver to plan the `all_data_files` scan.
   
   On large tables, this runs much faster with adaptive execution and broadcast 
joins disabled. Both optimizations use size estimates that are incorrect 
because the number of data files is much larger than the number of manifests in 
a table, and Spark does not account for a single row (manifest file) producing 
thousands or millions of result rows (data files) in a stage.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to