asolimando commented on issue #8608:
URL: https://github.com/apache/arrow-rs/issues/8608#issuecomment-3998508400

   Late to the party, but another argument in favor of using data sketches 
(e.g., 
[HyperLogLog](https://datasketches.apache.org/docs/HLL/HllSketches.html)) for 
representing `distinct_count` is mergeability, so a coherent "global" 
`distinct_count` can be computed by merging individual sketches from row 
groups, or across multiple parquet files for partitioned data.
   
   Even though HLL (and sketches in general) are an approximate data structure, 
distinct values can overlap between "chunks", so as soon as you need to 
aggregate statistics across multiple "chunks", you are probably anyway getting 
a better estimation by merging sketches than trying to combine exact counters, 
in this case.
   
   @JanKaul, have you tried reaching out to the Parquet community?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to