[PR] Core: Support Distributed Scan For Partitions Metadata Table [iceberg]

via GitHub Fri, 22 Aug 2025 14:09:52 -0700


dhruv-pratap opened a new pull request, #13903:
URL: https://github.com/apache/iceberg/pull/13903


   ## Summary
   This PR introduces distributed scanning capabilities for the 
`PartitionsTable` metadata table to improve performance when querying tables 
with many manifest files.
   
   ## Why is this needed?
   Today partitions metadata table is processes in a single task/thread which 
can be limiting for engines like Spark when scanning partitions metadata of 
very large table with very large number of manifests. Sometimes also leading to 
the process (Spark driver) going OOM. This PR enables parallel processing of 
partition metadata scanning for large tables with many manifest files, 
significantly reducing query latency for partition metadata table operations.
   
   ## Key Changes
   - **New Planning Modes**: Added `LOCAL`, `DISTRIBUTED`, and `AUTO` modes via 
`METADATA_PLANNING_MODE` table property
   - **Auto-switching**: Automatically uses distributed scanning when manifest 
count exceeds configurable threshold (default: 10)
   - **Enhanced PartitionsTable**: Implements `DistributedPartitionsScan` for 
parallel manifest processing
   - **Comprehensive Testing**: Added tests for core functionality and Spark 
integration (v3.5, v4.0)
   - **Backward Compatibility**: Existing behavior preserved with `AUTO` mode 
as default


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Core: Support Distributed Scan For Partitions Metadata Table [iceberg]

Reply via email to