devinjdangelo opened a new issue, #7536: URL: https://github.com/apache/arrow-datafusion/issues/7536
### Describe the bug Initial implementation of #7452 intended to preserve the ordering of rows in CSV/JSON files in case a user runs a query like: ```SQL COPY (select * from my_table order by my_col) TO my_file.csv ``` It is reasonable to expect that the CSV should be ordered by my_col. When this function: https://github.com/apache/arrow-datafusion/blob/561e0d7e87825aba224bf2eb9c3b8b5e1b725597/datafusion/core/src/datasource/file_format/write.rs#L310-L393 was updated to include the mpsc::channel I believe we lost the guarantee of preserving expected file order. The channel is nice since it introduces backpressure and ensures memory requirements do not grow without bound in case ObjectStore writes are falling behind, but I am not sure how to preserve the ordering of serialized RecordBatches in the channel construct. ### To Reproduce I have not yet verified a specific case where order is not preserved (TODO), but I don't see any reason why it is guarenteed since the channel does not preserve ordering (it will depend on how tokio schedules the tasks). ### Expected behavior We should guarantee that file ordering is preserved regardless of parallelization. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
