wgtmac opened a new pull request, #34327:
URL: https://github.com/apache/arrow/pull/34327

   ### Rationale for this change
   
   Parquet ColumnWriter obtains null_count of a page from page stats as below 
([link](https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L952))
   ```cpp
     EncodedStatistics page_stats = GetPageStatistics();
   
     int32_t null_count = static_cast<int32_t>(page_stats.null_count);
   
     DataPageV2 page(combined, num_values, null_count, num_rows, encoding_,
                       def_levels_byte_length, rep_levels_byte_length, 
uncompressed_size,
                       pager_->has_compressor(), page_stats);
   ```
   
   However, the null_count is uninitialized if page stat is not enabled:
   ```cpp
     EncodedStatistics GetPageStatistics() override {
       EncodedStatistics result;
       if (page_statistics_) result = page_statistics_->Encode();
       return result;
     }
   ```
   
   ### What changes are included in this PR?
   
   ColumnWriter collects null_count by itself. To be safe, it also checks that 
from page stats if available.
   
   ### Are these changes tested?
   
   Added a test case to cover null counts of optional and repeated fields are 
properly set.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to