adriangb commented on issue #19487: URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3696733969
Could we also use a vectorized approach for distribution statistics? I think we should be able to store them as a union of structs and use udfs to compute interesections, etc.? For sets statistics, at least the `HashSet<ScalarValue>` type we could have a simple size based heuristic: in my experience these sorts of statistics are most useful when the sets are small. Larger sets are less useful and much more expensive to manage, i.e. cardinality of 1 vs 1M is useful, 1M vs. 2M less so. So maybe we cap it at 128 elements or something like that and drop it / stop building it beyond that? I imagine for larger sets estimated set sizes and membership would be more useful, e.g. a bloom filter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
