EnricoMi opened a new pull request, #44470: URL: https://github.com/apache/arrow/pull/44470
### Rationale for this change The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed, ### What changes are included in this PR? Preserving the dataset order of rows requires the `SourceNode` to use `ImplicitOrdering` (this gives exec batches an index), and the `ConsumingSinkNode` to sequence exec batches (preserve order of batches by their index). User-facing changes: - Add option `preserve_order` to `FileSystemDatasetWriteOptions` Dev-facing changes: - Add option `ordering` to `SourceNodeOptions` - Add option `implicit_ordering` to `ScanNodeOptions` Default behaviour is current behaviour. ### Are these changes tested? Unit tests have been added, ### Are there any user-facing changes? Users can set `FileSystemDatasetWriteOptions.preserve_order = true` (C++) / `arrow.dataset.write_dataset(..., preserve_order=True)` (Python). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
