metanil opened a new pull request, #14277:
URL: https://github.com/apache/iceberg/pull/14277

   ### Enable custom partition field ordering for Spark splits during planning 
phase
   
   This PR introduces the ability to control which partition field to use for 
ordering splits during the planning phase, providing more granular control over 
split ordering beyond the default storage/physical partition field ordering. 
_**This capability enables more efficient resource utilization by ensuring 
partition-specific data or caches are loaded only once, reducing memory (or IO) 
overhead and improving performance.**_
   
   ### Problem
   By default, Spark sorts and groups partitions by storage order (e.g., if 
partitions are A, B, C in storage path 
`s3://bucket/xxx/$dbname.db/$tablename/data/A=a/B=b/C=c/xxx-xxx-xxx.parquet`, 
splits are ordered by A/B/C). This change allows sorting spark partitions by a 
specific field (e.g., just C or B) rather than the full storage order.
   
   ### Solution
   Added two new configuration options:
   - **Read option**: `partition-ordering-field-name` 
   - **Spark SQL property**: 
`spark.sql.iceberg.planning.ordering.partition-field-name`
   
   ### Usage Examples
   
   **Via read option:**
   ```scala
   spark.read
       .option("partition-ordering-field-name", "region")
       .table(s"$namespace.$dbName.$tableName")
       .createTempView(tableName)
   ```
   
   **Via Spark SQL configuration:**
   ```scala
   spark.conf.set("spark.sql.iceberg.planning.ordering.partition-field-name", 
"region")
   ```
   
   ### Implementation Details
   - Added `SPLIT_ORDERING_BY_PARTITIONED_FIELD` constant to `SparkReadOptions` 
and `SparkSQLProperties`
   - Added config method `getSplitOrderingPartitionFieldOptional()` in 
`SparkReadConf`
   - Modified `SparkPartitioningAwareScan` to support custom partition field 
ordering with type-safe comparators
   - Added test suite `TestPartitionFieldOrdering`
   
   ### Limitations
   1. **Mutually exclusive with SPJ**: This feature is disabled when 
`preserve-data-grouping` is enabled
   2. **Partition field must exist**: The specified partition field name must 
exist in the partition schema
   
   ### Notes
   - Gracefully falls back to default behavior if partition field types are 
inconsistent (or run into error)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to