wgtmac commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2631538278


##########
cpp/src/parquet/file_writer.cc:
##########
@@ -68,6 +68,12 @@ int64_t RowGroupWriter::total_compressed_bytes_written() 
const {
   return contents_->total_compressed_bytes_written();
 }
 
+int64_t RowGroupWriter::current_buffered_bytes() const {

Review Comment:
   The function name is a little misleading because readers may think it is 
same as `contents_->estimated_buffered_value_bytes()`.



##########
cpp/src/parquet/properties.h:
##########
@@ -418,6 +421,13 @@ class PARQUET_EXPORT WriterProperties {
       return this;
     }
 
+    /// Specify the max number of bytes to put in a single row group.

Review Comment:
   The size is after encoding and compression, right? It would be good to 
document this.



##########
cpp/src/parquet/file_writer.cc:
##########
@@ -195,6 +201,20 @@ class RowGroupSerializer : public RowGroupWriter::Contents 
{
     return total_compressed_bytes_written;
   }
 
+  int64_t estimated_buffered_value_bytes() const override {

Review Comment:
   ```suggestion
     int64_t EstimatedBufferedValueBytes() const override {
   ```
   
   This is not a trivial getter so we need to use initial-capitalized camel 
form. Unfortunately functions like `total_compressed_bytes()` have already used 
the wrong form so it looks confusing. :/



##########
cpp/src/parquet/arrow/writer.cc:
##########
@@ -395,15 +395,24 @@ class FileWriterImpl : public FileWriter {
     RETURN_NOT_OK(CheckClosed());
     RETURN_NOT_OK(table.Validate());
 
-    if (chunk_size <= 0 && table.num_rows() > 0) {
-      return Status::Invalid("chunk size per row_group must be greater than 
0");
-    } else if (!table.schema()->Equals(*schema_, false)) {
+    if (!table.schema()->Equals(*schema_, false)) {
       return Status::Invalid("table schema does not match this writer's. 
table:'",
                              table.schema()->ToString(), "' this:'", 
schema_->ToString(),
                              "'");
     } else if (chunk_size > this->properties().max_row_group_length()) {
       chunk_size = this->properties().max_row_group_length();
     }
+    // max_row_group_bytes is applied only after the row group has accumulated 
data.
+    if (row_group_writer_ != nullptr && row_group_writer_->num_rows() > 0) {

Review Comment:
   `row_group_writer_->num_rows() > 0` can only happen when the current row 
group writer is in the buffered mode. Usually users calling `WriteTable` will 
never use buffered mode so this approach seems not working in the majority of 
cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to