Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]

via GitHub Sat, 18 Oct 2025 13:14:33 -0700


liziyan-lzy commented on PR #12254:
URL: https://github.com/apache/iceberg/pull/12254#issuecomment-3346393088


   > @liziyan-lzy hey, I am trying to use this feature, but when enabled on a 
table with a medium amount of files, I get the following error:
   > 
   > ```
   > Serialized task 466:0 was 1732246938 bytes, which exceeds max allowed: 
spark.rpc.message.maxSize (134217728 bytes). Consider increasing 
spark.rpc.message.maxSize or using broadcast variables for large values
   > ```
   > 
   > as well as the following warning:
   > 
   > ```
   > WARN TaskSetManager: Stage 16 contains a task of very large size (1691637 
KiB). The maximum recommended task size is 1000 KiB.
   > ```
   > 
   > I think the previous Hadoop operation doesn't run into this problem 
because it spreads the listings to the executors?
   
   Hi @JoeryH,
   
   Thank you for sharing this issue. I think you are right. I am looking into 
how we can leverage our existing framework to perform file listing in a 
distributed manner, similar to how Hadoop distributes tasks across executors. 
This should help manage task size and resource utilization more effectively.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]

Reply via email to