rdblue opened a new pull request #1397: URL: https://github.com/apache/iceberg/pull/1397
This replaces use of the `all_data_files` metadata table in RemoveOrphanFilesAction and ExpireSnapshotsAction with a call to read data file paths from manifest files in parallel. This avoids reading all of the manifest lists in the Spark driver to plan the `all_data_files` scan. On large tables, this runs much faster with adaptive execution and broadcast joins disabled. Both optimizations use size estimates that are incorrect because the number of data files is much larger than the number of manifests in a table, and Spark does not account for a single row (manifest file) producing thousands or millions of result rows (data files) in a stage. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
