alamb commented on pull request #512:
URL: https://github.com/apache/arrow-rs/pull/512#issuecomment-871736348


   > Distinct count AFAIK is often not included for parquet stats as 
calculating it is expensive.
   
   This is true. One thing I have thought of recently is doing "best effort 
distinct count" -- namely because the distinct count is often used for 
detecting low cardinality columns, one could keep track of distinct count 
provided it consumed less than a fixed size memory budget. When that was 
exceeded then the distinct count would be abandoned.
   
   This still costs CPU for sure, but it could cap the memory at some fixed size


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to