dhruv-pratap opened a new pull request, #13903: URL: https://github.com/apache/iceberg/pull/13903
## Summary This PR introduces distributed scanning capabilities for the `PartitionsTable` metadata table to improve performance when querying tables with many manifest files. ## Why is this needed? Today partitions metadata table is processes in a single task/thread which can be limiting for engines like Spark when scanning partitions metadata of very large table with very large number of manifests. Sometimes also leading to the process (Spark driver) going OOM. This PR enables parallel processing of partition metadata scanning for large tables with many manifest files, significantly reducing query latency for partition metadata table operations. ## Key Changes - **New Planning Modes**: Added `LOCAL`, `DISTRIBUTED`, and `AUTO` modes via `METADATA_PLANNING_MODE` table property - **Auto-switching**: Automatically uses distributed scanning when manifest count exceeds configurable threshold (default: 10) - **Enhanced PartitionsTable**: Implements `DistributedPartitionsScan` for parallel manifest processing - **Comprehensive Testing**: Added tests for core functionality and Spark integration (v3.5, v4.0) - **Backward Compatibility**: Existing behavior preserved with `AUTO` mode as default -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
