mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611458030
@jorisvandenbossche
```c++
Status NewRowGroup(int64_t chunk_size) override {
if (row_group_writer_ != nullptr) {
PARQUET_CATCH_NOT_OK(row_group_writer_->Close());
}
PARQUET_CATCH_NOT_OK(row_group_writer_ = writer_->AppendRowGroup());
return Status::OK();
}
```
`parquet::arrow::FileWriterImpl::WriteTable` will call `NewRowGroup` for
every chunk, which means it will first close previous RowGroup, then split
input table to `chunk` by users row group size, and call `AppendRowGroup` to
create non-buffered row-group writer for every chunk.
`parquet::arrow::FileWriterImpl::WriteRecordBatch` will close previous
rowgroup if previous row-group is not a "buffered" row-group, and append
`RecordBatch` to the buffered row-group.
So:
1. If only `WriteTable` is called, every call will create **at lease one**
new row group.
2. If only `WriteRecord` is called, every call might not create row group.
3. If `WriteRecordBatch` is called after `WriteTable`, it will find that
there is not in-flight row-group, so it will create a buffered row-group and
write to it.
4. If `WriteTable` is called after `WriteRecordBatch`, it will find that
there is a in-flight row-group, so it will close the buffered row-group, create
a non-buffered one, and write to it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]