vaultah opened a new issue, #13932: URL: https://github.com/apache/iceberg/issues/13932
### Feature Request / Improvement This is a follow-up to the work done in PR #13720 to fix a correctness bug in the `RewriteTablePath` action (#13719). **Problem:** The current implementation uses `.toLocalIterator()` to collect the results of the manifest rewrite job on the driver. While this is memory-safe, it puts the entire aggregation workload (processing O(N) records and building the final map) sequentially on the driver. This can become a performance bottleneck for tables with a very large number of manifests. Additionally, as discussed in the original PR, the way the generic `RewriteResult` and `RewriteContentFileResult` classes are used is awkward for this bottom-up aggregation pattern. **Proposed Solution:** The action should be refactored to use a dedicated, reducible result class to perform the aggregation in parallel on the Spark executors. This refactoring will improve the scalability of the action and simplify the internal data flow, making the code cleaner and more maintainable. ### Query engine Spark ### Willingness to contribute - [ ] I can contribute this improvement/feature independently - [ ] I would be willing to contribute this improvement/feature with guidance from the Iceberg community - [ ] I cannot contribute this improvement/feature at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
