okumin opened a new pull request, #5337: URL: https://github.com/apache/hive/pull/5337
### What changes were proposed in this pull request? Make FilterStatsRule reduce the number of filtered rows by half when # of distinct values is empty. https://issues.apache.org/jira/browse/HIVE-28363 ### Why are the changes needed? The current algorithm easily estimates the selectivity to be 100%. I believe it is not the best in most cases. FilterStatsRule roughly estimates the number of rows filtered by IN to be `{Original # of rows} * {1 / cardinality} * {# of values in IN}`. The second term is estimated as 0.5 when column stats are unavailable. So, it always returns the original number when `IN` retains two or more constant values like `col IN (1, 3)`. ### Does this PR introduce _any_ user-facing change? No. ### Is the change a dependency upgrade? No. ### How was this patch tested? ``` CREATE TABLE users (id INT); INSERT INTO users VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); set hive.fetch.task.conversion=none; set hive.stats.fetch.column.stats=false; EXPLAIN SELECT * FROM users WHERE id IN (1); EXPLAIN SELECT * FROM users WHERE id IN (1, 2); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org