twuebi opened a new pull request, #1184:
URL: https://github.com/apache/iceberg-go/pull/1184

   The per-file sort_order_id declares that a file's rows are *fully* sorted by 
that order; a missing id means unsorted 
(https://iceberg.apache.org/spec/#sorting). Since #875 every writer stamped the 
table's default sort order id unconditionally, but #1157's sort-on-write only 
sorts each record batch individually, so the claim was false for any 
multi-batch file. Position delete files were stamped too, which the spec 
forbids outright, their order is (file_path, pos) and writers must set the 
field to null (https://iceberg.apache.org/spec/#data-file-fields). This 
contradicts the community consensus: Spark stamps the id *only* when its 
distribution+ordering globally sorted the whole task input, never inferring it 
from the table default (https://github.com/apache/iceberg/pull/15150).
   
   This change makes writers claim nothing by default: data files, eq-delete 
files, pos-delete files, and registered files (add_files) leave sort_order_id 
absent, matching PyIceberg and iceberg-rust. The per-batch sort is kept as a 
pure layout optimization (tighter page statistics, better encodings). 
WriteTask.SortOrderID remains as an explicit caller claim for producers that 
guarantee fully sorted batches, mirroring how Java/Spark stamps the id only 
when the engine enforced the ordering. Leaving the field absent is fully 
spec-compliant, it was always advisory and never assumed applied to all files 
(https://github.com/apache/iceberg/issues/317). The only incorrect behavior is 
stamping a claim that isn't true, which a reader could trust to wrongly skip a 
re-sort.
   
   References: https://github.com/apache/iceberg/issues/13634, 
https://github.com/apache/iceberg/pull/15150, 
https://github.com/apache/iceberg/issues/317, 
https://iceberg.apache.org/spec/#sorting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to