kszucs commented on code in PR #45360:
URL: https://github.com/apache/arrow/pull/45360#discussion_r2016989421


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1382,11 +1383,15 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
                                          chunk.levels_to_write, *chunk_array, 
ctx,
                                          maybe_parent_nulls));
         }
-        if (num_buffered_values_ > 0) {
+        bool is_last_chunk = i == chunks.size() - 1;
+        if (num_buffered_values_ > 0 && !is_last_chunk) {
           // Explicitly add a new data page according to the content-defined 
chunk
           // boundaries. This way the same chunks will have the same 
byte-sequence
           // in the resulting file, which can be identified by content 
addressible
           // storage.
+          // Note that the last chunk doesn't trigger a new data page in order 
to
+          // allow subsequent WriteArrow() calls to continue writing to the 
same
+          // data page, the chunker's state is not being reset after the last 
chunk.

Review Comment:
   The chunker post-checks max size after adding the value to the chunk, if it 
exceeds the max chunk size then we don't add more values to that chunk but 
rather split. This is similar to how `max_data_page_size` is being checked.
   
   We are just not calling `AddDataPage()` after the last chunk, but that 
"ongoing" data page will be flushed nonetheless when the writer gets closed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to