[GitHub] [arrow-datafusion] devinjdangelo commented on pull request #6987: use ObjectStore for dataframe writes

via GitHub Fri, 21 Jul 2023 04:59:24 -0700


devinjdangelo commented on PR #6987:
URL: 
https://github.com/apache/arrow-datafusion/pull/6987#issuecomment-1645469122


   No worries, @alamb. Today is better anyway since I just pushed up changes to 
support multipart incremental uploads instead of always buffering the entire 
file. For parquet, it was relatively straightforward to use `AsyncArrowWriter` 
and pass it the appropriate async multipart writer. For JSON lines  and CSV, I 
initially tried an implementation that relied on passing ownership of a buffer 
around and reinitializing the writer as needed. I.e. something roughly like:
   
   ```rust
   let mut buffer = Vec::new();
   while let Some(next_batch) = stream.next().await{
       let batch = next_batch?;
       let writer = csv::Writer::new(buffer);
       writer.write(next_batch)?;
       buffer = writer.into_inner();
       multipart_writer.write_all(&buffer).await?;
       buffer.clear();
   ```
   
   This approached mostly worked, except for the fact that recreating a 
`RecordBatchWriter` can in general have side effects such as writing CSV 
headers to the file multiple times. To get around that issue, I followed an 
approach similar to `AsyncArrowWriter`, but generic so it could work for any 
`RecordBatchWriter`. 
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo commented on pull request #6987: use ObjectStore for dataframe writes

Reply via email to