sfc-gh-ebrossard commented on PR #37016:
URL: https://github.com/apache/arrow/pull/37016#issuecomment-1664953078

   > I'm OK with the change, but seems that `distinct_count == 0` only happens 
when the page is all null? Is it a common case?
   > 
   > And encode `distinct_count` seems not applied by parquet-mr and rust impl.
   
   Yeah, the case I was seeing is that we merged stats from an all-null page 
with a page that had non-null values and ended up with an unset distinct count. 
I thought it would be useful to preserve the count if we can.
   
   Technically we could make this more general and propagate distinct counts 
for pages whose min and max ranges don't overlap, too. For example, if one page 
has a range of `[0, 9]` and the other has a range of `[10, 19]`, we can simply 
add the distinct counts. What do you think? Maybe that would add too much 
complexity, though.
   
   Do you mind pointing me to the code for parquet-ml and rust that I should 
update? If this change looks generally okay for the C++ Parquet code, I'll work 
on those as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to