[GitHub] [arrow-datafusion] devinjdangelo opened a new issue, #7536: CSV/JSON Writes Not Guarenteed to Preserve Expected Ordering

via GitHub Tue, 12 Sep 2023 17:56:26 -0700


devinjdangelo opened a new issue, #7536:
URL: https://github.com/apache/arrow-datafusion/issues/7536


   ### Describe the bug
   
   Initial implementation of #7452 intended to preserve the ordering of rows in 
CSV/JSON files in case a user runs a query like:
   
   ```SQL
   COPY (select * from my_table order by my_col)
   TO my_file.csv
   ```
   
   It is reasonable to expect that the CSV should be ordered by my_col. When 
this function: 
https://github.com/apache/arrow-datafusion/blob/561e0d7e87825aba224bf2eb9c3b8b5e1b725597/datafusion/core/src/datasource/file_format/write.rs#L310-L393
   
   was updated to include the mpsc::channel I believe we lost the guarantee of 
preserving expected file order. The channel is nice since it introduces 
backpressure and ensures memory requirements do not grow without bound in case 
ObjectStore writes are falling behind, but I am not sure how to preserve the 
ordering of serialized RecordBatches in the channel construct.
   
   ### To Reproduce
   
   I have not yet verified a specific case where order is not preserved (TODO), 
but I don't see any reason why it is guarenteed since the channel does not 
preserve ordering (it will depend on how tokio schedules the tasks).
   
   ### Expected behavior
   
   We should guarantee that file ordering is preserved regardless of 
parallelization. 
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo opened a new issue, #7536: CSV/JSON Writes Not Guarenteed to Preserve Expected Ordering

Reply via email to