wecharyu commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2742520274


##########
cpp/src/parquet/file_writer.cc:
##########
@@ -68,6 +68,12 @@ int64_t RowGroupWriter::total_compressed_bytes_written() 
const {
   return contents_->total_compressed_bytes_written();
 }
 
+int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
+  return contents_->total_compressed_bytes() +
+         contents_->total_compressed_bytes_written() +
+         contents_->EstimatedBufferedValueBytes();

Review Comment:
   If we choose 1, we can change to estimate batch size based on avg row size 
and written row numbers to avoid ignoring too many buffered bytes like:
   ```c++
       while (offset < batch.num_rows()) {
         auto avg_row_size = EstimateCompressedBytesPerRow();
         int64_t max_rows =
             avg_row_size
                 ? std::min(max_row_group_length,
                            // Ensure batch_size is at least 1 to avoid 
infinite loops.
                            std::max(1L, 
static_cast<int64_t>(max_row_group_bytes /
                                                              
avg_row_size.value())))
                 : max_row_group_length;
         if (row_group_writer_->num_rows() >= max_rows) {
           // Current row group is full, start a new one.
           RETURN_NOT_OK(NewBufferedRowGroup());
         }
         int64_t batch_size =
             std::min(max_rows - row_group_writer_->num_rows(), 
batch.num_rows() - offset);
         RETURN_NOT_OK(WriteBatch(offset, batch_size));
         offset += batch_size;
       }
   ```
   
   The last concern is that the `max_row_group_bytes` could not take effect 
before the first page is written, is it acceptable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to