asolimando commented on issue #8608: URL: https://github.com/apache/arrow-rs/issues/8608#issuecomment-3998508400
Late to the party, but another argument in favor of using data sketches (e.g., [HyperLogLog](https://datasketches.apache.org/docs/HLL/HllSketches.html)) for representing `distinct_count` is mergeability, so a coherent "global" `distinct_count` can be computed by merging individual sketches from row groups, or across multiple parquet files for partitioned data. Even though HLL (and sketches in general) are an approximate data structure, distinct values can overlap between "chunks", so as soon as you need to aggregate statistics across multiple "chunks", you are probably anyway getting a better estimation by merging sketches than trying to combine exact counters, in this case. @JanKaul, have you tried reaching out to the Parquet community? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
