DDtKey opened a new issue, #5450:
URL: https://github.com/apache/arrow-rs/issues/5450
`AsyncArrowWriter` created with default `WriterProperties` will have a
config of `DEFAULT_MAX_ROW_GROUP_SIZE = 1024 * 1024`
It means that `ArrowWriter` won't flush until it's reached.
It leads to the incredible memory consumption, because it will cache up to
`DEFAULT_MAX_ROW_GROUP_SIZE` (1048576 by default) and will ignore
`buffer_capacity` config at all.
Because the flushing condition of sync writer is:
```rust
if in_progress.buffered_rows >= self.max_row_group_size {
self.flush()?
}
```
**To Reproduce**
Try to write many large rows to parquet with `AsyncArrowWriter`, you will
see the memory consumption doesn't respect buffer size.
In my case it was 10gb of data (900k records) coming in streaming way being
written to parquet. It consumed 10gb of memory accordingly.
**Expected behavior**
Perfectly, it should respect buffer config.
I.e flush on either buffer size or max row group is reached.
But even if it's expected for some reason, documentation should clearly
highlight that.
**Additional context**
Btw, why default is `1024 * 1024`? Like it's byte unites
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]