Re: [I] EPIC: Support parallel scan in iceberg-datafusion [iceberg-rust]

via GitHub Wed, 20 Aug 2025 06:54:04 -0700


ZENOTME commented on issue #1604:
URL: https://github.com/apache/iceberg-rust/issues/1604#issuecomment-3206448551


   Thanks for the suggestions from @Fokko and @liurenjie1024, I learned a lot!
   
   > That's true, but not always the case. For exameple, Spark leverages 
distributed planning. Each manifest file has a target size of 8MB, which could 
be dispatched for distributed planning.
   
   It's close to my initial thought for the distributed case. Wee can push the 
row group pruning down to the executors, and at the same time cache the 
metadata they read there, so we don’t have to pay the cost of opening the data 
file twice. But number of data files is much more than manifest files so the 
cache size need to be consideration.
   
   This does make things more complex and the benefit is still uncertain. Let’s 
stick to the size-based planning for now. I'd like to implement it and bench it 
again to see what happen.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] EPIC: Support parallel scan in iceberg-datafusion [iceberg-rust]

Reply via email to