vaultah opened a new issue, #13932:
URL: https://github.com/apache/iceberg/issues/13932

   ### Feature Request / Improvement
   
   This is a follow-up to the work done in PR #13720 to fix a correctness bug 
in the `RewriteTablePath` action (#13719).
   
   **Problem:**
   The current implementation uses `.toLocalIterator()` to collect the results 
of the manifest rewrite job on the driver. While this is memory-safe, it puts 
the entire aggregation workload (processing O(N) records and building the final 
map) sequentially on the driver. This can become a performance bottleneck for 
tables with a very large number of manifests.
   
   Additionally, as discussed in the original PR, the way the generic 
`RewriteResult` and `RewriteContentFileResult` classes are used is awkward for 
this bottom-up aggregation pattern.
   
   **Proposed Solution:**
   The action should be refactored to use a dedicated, reducible result class 
to perform the aggregation in parallel on the Spark executors. 
   
   This refactoring will improve the scalability of the action and simplify the 
internal data flow, making the code cleaner and more maintainable.
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [ ] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to