[PR] GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded [arrow]

via GitHub Fri, 18 Oct 2024 04:16:08 -0700


EnricoMi opened a new pull request, #44470:
URL: https://github.com/apache/arrow/pull/44470


   ### Rationale for this change
   The order of rows in a dataset might be important for users and should be 
preserved when writing to a filesystem. With multi-threaded write, the order is 
currently not guaranteed,
   
   ### What changes are included in this PR?
   Preserving the dataset order of rows requires the `SourceNode` to use 
`ImplicitOrdering` (this gives exec batches an index), and the 
`ConsumingSinkNode` to sequence exec batches (preserve order of batches by 
their index).
   
   User-facing changes:
   - Add option `preserve_order` to `FileSystemDatasetWriteOptions`
   
   Dev-facing changes:
   - Add option `ordering` to `SourceNodeOptions`
   - Add option `implicit_ordering` to `ScanNodeOptions`
   
   Default behaviour is current behaviour.
   
   ### Are these changes tested?
   Unit tests have been added,
   
   ### Are there any user-facing changes?
   Users can set `FileSystemDatasetWriteOptions.preserve_order = true` (C++) / 
`arrow.dataset.write_dataset(..., preserve_order=True)` (Python).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded [arrow]

Reply via email to