[GitHub] [arrow] emkornfield commented on a change in pull request #10729: ARROW-12513: [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls

GitBox Fri, 16 Jul 2021 20:38:01 -0700


emkornfield commented on a change in pull request #10729:
URL: https://github.com/apache/arrow/pull/10729#discussion_r671602048




##########
File path: cpp/src/parquet/column_writer.cc
##########
@@ -1490,7 +1490,8 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
     // TODO(wesm): If some dictionary values are unobserved, then the

Review comment:
       functional concern (unfortunately I think the performance would be worse 
since it will require extra allocations for each batch).  Statistics are stored 
at two levels rowgroup (column chunk) and page level.  The batching in done 
here and in other locations in the code to be able to get some level of 
vectorization but not make any individual page too large.  I might have traced 
the code incorrectly but the current location for updating statistics will only 
cover correct statistics at the row group level.  It seems like this is an 
orthogonal bug so you can maybe do it as a follow-up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] emkornfield commented on a change in pull request #10729: ARROW-12513: [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls

Reply via email to