[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

via GitHub Wed, 28 Jun 2023 06:53:52 -0700


mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611458030


   @jorisvandenbossche 
   
   ```c++
     Status NewRowGroup(int64_t chunk_size) override {
       if (row_group_writer_ != nullptr) {
         PARQUET_CATCH_NOT_OK(row_group_writer_->Close());
       }
       PARQUET_CATCH_NOT_OK(row_group_writer_ = writer_->AppendRowGroup());
       return Status::OK();
     }
   ```
   
   `parquet::arrow::FileWriterImpl::WriteTable` will call `NewRowGroup` for 
every chunk, which means it will first close previous RowGroup, then split 
input table to `chunk` by users row group size, and call `AppendRowGroup` to 
create non-buffered row-group writer for every chunk.
   
   `parquet::arrow::FileWriterImpl::WriteRecordBatch` will close previous 
rowgroup if previous row-group is not a "buffered" row-group, and append 
`RecordBatch` to the buffered row-group.
   
   So:
   
   1. If only `WriteTable` is called, every call will create **at lease one** 
new row group.
   2. If only `WriteRecord` is called, every call might not create row group.
   3. If `WriteRecordBatch` is called after `WriteTable`, it will find that 
there is not in-flight row-group, so it will create a buffered row-group and 
write to it.
   4. If `WriteTable` is called after `WriteRecordBatch`, it will find that 
there is a in-flight row-group, so it will close the buffered row-group, create 
a non-buffered one, and write to it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Reply via email to