[PR] HIVE-28363: Improve heuristics of FilterStatsRule without column stats [hive]

via GitHub Mon, 08 Jul 2024 01:26:10 -0700


okumin opened a new pull request, #5337:
URL: https://github.com/apache/hive/pull/5337


   ### What changes were proposed in this pull request?
   
   Make FilterStatsRule reduce the number of filtered rows by half when # of 
distinct values is empty.
   https://issues.apache.org/jira/browse/HIVE-28363
   
   ### Why are the changes needed?
   
   The current algorithm easily estimates the selectivity to be 100%. I believe 
it is not the best in most cases.
   FilterStatsRule roughly estimates the number of rows filtered by IN to be 
`{Original # of rows} * {1 / cardinality} * {# of values in IN}`. The second 
term is estimated as 0.5 when column stats are unavailable. So, it always 
returns the original number when `IN` retains two or more constant values like 
`col IN (1, 3)`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### Is the change a dependency upgrade?
   
   No.
   
   ### How was this patch tested?
   
   ```
   CREATE TABLE users (id INT);
   INSERT INTO users VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
   set hive.fetch.task.conversion=none;
   set hive.stats.fetch.column.stats=false;
   EXPLAIN SELECT * FROM users WHERE id IN (1);
   EXPLAIN SELECT * FROM users WHERE id IN (1, 2);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HIVE-28363: Improve heuristics of FilterStatsRule without column stats [hive]

Reply via email to