sfc-gh-ebrossard commented on PR #37016: URL: https://github.com/apache/arrow/pull/37016#issuecomment-1664953078
> I'm OK with the change, but seems that `distinct_count == 0` only happens when the page is all null? Is it a common case? > > And encode `distinct_count` seems not applied by parquet-mr and rust impl. Yeah, the case I was seeing is that we merged stats from an all-null page with a page that had non-null values and ended up with an unset distinct count. I thought it would be useful to preserve the count if we can. Technically we could make this more general and propagate distinct counts for pages whose min and max ranges don't overlap, too. For example, if one page has a range of `[0, 9]` and the other has a range of `[10, 19]`, we can simply add the distinct counts. What do you think? Maybe that would add too much complexity, though. Do you mind pointing me to the code for parquet-ml and rust that I should update? If this change looks generally okay for the C++ Parquet code, I'll work on those as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
