Re: [I] Implement (optional) distinct count population in Parquet statistics [arrow-rs]

via GitHub Fri, 24 Oct 2025 22:51:59 -0700


JanKaul commented on issue #8608:
URL: https://github.com/apache/arrow-rs/issues/8608#issuecomment-3445935443


   You're right, a sketch would be better as I'm only interested in approximate 
values for query planning. 
   
   The issue is about having some kind of standard across different query 
engines. I think it's very difficult to establish a new standard for data 
sketches for parquet now. You have to agree on an appropriate sketch and define 
a new metadata field that will be accepted and used across the industry. 
   
   The benefit of the 'distinct_count' metadata field is that it already 
exists. You do have the issue that it wasn't designed to be an approximate 
field. But I think as long as the approximation is an optional configuration 
option, it should be fine. 
   
   Generally a sketch will be more accurate than the distinct_count field. But 
the default column chunk size is 1 million rows which should be large enough to 
provide reasonable distinct count approximations.
   
   For query engines, typically having some inaccurate metadata is better than 
having no metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Implement (optional) distinct count population in Parquet statistics [arrow-rs]

Reply via email to