anuragmantri opened a new pull request, #16750:
URL: https://github.com/apache/iceberg/pull/16750

   This PR depends on #14948 
   
   This PR implements the Spark DSv2 SupportsReportOrdering API to report sort 
order to Spark, enabling sort elimination for partitioned tables when reading 
sorted Iceberg tables that have a defined sort order and files are written 
respecting that order.
   
   Sort order reporting can be enabled with:
     
   ```sql
   SET spark.sql.iceberg.planning.preserve-data-ordering = true; (default false)
   ```
   
   Implementation summary:
   
   1. SortOrderAnalyzer validates two conditions before 
SparkPartitioningAwareScan.outputOrdering() reports ordering to Spark:
       - all files carry the current sort order ID
        - each grouping key maps to exactly one task group (bin-packing must 
not split partitions)
   
   2. Merging Sorted Files: When ordering is reported, another PR (#14948) adds 
MergingSortedRowDataReader to merge rows from multiple sorted files within a 
partition using k-way merge. The plumbing for the merging reader 
(SparkRowReaderFactory, SparkBatch) is included here.
   
   Constraints:
   
     1. When `preserve-data-ordering` is enabled, bin-packing of large 
partitions
        is disabled. All files within a partition are placed into a single Spark
        task. This is a known limitation of the current KeyGroupedPartitioning
        approach and is expected to be addressed in 
[SPARK-56241](https://issues.apache.org/jira/browse/SPARK-56241).
     2. Vectorized reads are disabled for partitions with more than one file
        since k-way merge is row-based.
     3. This implementation only reports sort order if files are sorted in the
        current table sort order.
   
   Depends on #14948 for MergingSortedRowDataReader.
   
   AI Usage: I used Claude Opus 4.6 for code generation and writing tests. I 
manually reviewed the generated code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to