devinjdangelo commented on PR #6987:
URL:
https://github.com/apache/arrow-datafusion/pull/6987#issuecomment-1645469122
No worries, @alamb. Today is better anyway since I just pushed up changes to
support multipart incremental uploads instead of always buffering the entire
file. For parquet, it was relatively straightforward to use `AsyncArrowWriter`
and pass it the appropriate async multipart writer. For JSON lines and CSV, I
initially tried an implementation that relied on passing ownership of a buffer
around and reinitializing the writer as needed. I.e. something roughly like:
```rust
let mut buffer = Vec::new();
while let Some(next_batch) = stream.next().await{
let batch = next_batch?;
let writer = csv::Writer::new(buffer);
writer.write(next_batch)?;
buffer = writer.into_inner();
multipart_writer.write_all(&buffer).await?;
buffer.clear();
```
This approached mostly worked, except for the fact that recreating a
`RecordBatchWriter` can in general have side effects such as writing CSV
headers to the file multiple times. To get around that issue, I followed an
approach similar to `AsyncArrowWriter`, but generic so it could work for any
`RecordBatchWriter`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]