adriangb commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3696733969

   Could we also use a vectorized approach for distribution statistics? I think 
we should be able to store them as a union of structs and use udfs to compute 
interesections, etc.?
   
   For sets statistics, at least the `HashSet<ScalarValue>` type we could have 
a simple size based heuristic: in my experience these sorts of statistics are 
most useful when the sets are small. Larger sets are less useful and much more 
expensive to manage, i.e. cardinality of 1 vs 1M is useful, 1M vs. 2M less so. 
So maybe we cap it at 128 elements or something like that and drop it / stop 
building it beyond that?
   I imagine for larger sets estimated set sizes and membership would be more 
useful, e.g. a bloom filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to