twuebi opened a new pull request, #1184: URL: https://github.com/apache/iceberg-go/pull/1184
The per-file sort_order_id declares that a file's rows are *fully* sorted by that order; a missing id means unsorted (https://iceberg.apache.org/spec/#sorting). Since #875 every writer stamped the table's default sort order id unconditionally, but #1157's sort-on-write only sorts each record batch individually, so the claim was false for any multi-batch file. Position delete files were stamped too, which the spec forbids outright, their order is (file_path, pos) and writers must set the field to null (https://iceberg.apache.org/spec/#data-file-fields). This contradicts the community consensus: Spark stamps the id *only* when its distribution+ordering globally sorted the whole task input, never inferring it from the table default (https://github.com/apache/iceberg/pull/15150). This change makes writers claim nothing by default: data files, eq-delete files, pos-delete files, and registered files (add_files) leave sort_order_id absent, matching PyIceberg and iceberg-rust. The per-batch sort is kept as a pure layout optimization (tighter page statistics, better encodings). WriteTask.SortOrderID remains as an explicit caller claim for producers that guarantee fully sorted batches, mirroring how Java/Spark stamps the id only when the engine enforced the ordering. Leaving the field absent is fully spec-compliant, it was always advisory and never assumed applied to all files (https://github.com/apache/iceberg/issues/317). The only incorrect behavior is stamping a claim that isn't true, which a reader could trust to wrongly skip a re-sort. References: https://github.com/apache/iceberg/issues/13634, https://github.com/apache/iceberg/pull/15150, https://github.com/apache/iceberg/issues/317, https://iceberg.apache.org/spec/#sorting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
