Eric Hanson created HIVE-4701:
---------------------------------

             Summary: Optimize filter Column IN ( list-of-constants ) for 
vectorized execution
                 Key: HIVE-4701
                 URL: https://issues.apache.org/jira/browse/HIVE-4701
             Project: Hive
          Issue Type: Sub-task
            Reporter: Eric Hanson


OR filters have been optimized to run with vectorized query execution. IN 
filters of the form "Column IN (list-of-constants)" are a special case of OR. 
However, IN does not vectorized currently.

E.g. 

select ddate, count(*) from factsqlengineam_vec_orc where ddate = "2012-05-19 
00:00:00" OR ddate = "2012-05-20 00:00:00" or ddate = "2012-05-21 00:00:00" 
group by ddate;

takes about 23 seconds of CPU and 

select ddate, count(*) from factsqlengineam_vec_orc where ddate IN ("2012-05-19 
00:00:00", "2012-05-20 00:00:00", "2012-05-21 00:00:00") group by ddate;

takes about 153 seconds of CPU.

A simple fix may be that for short IN lists (say <= 64 elements) we turn them 
into OR by manipulating the query tree before planning whether vectorization 
can be used.

A more complex fix that covers more cases would be to turn longer IN lists into 
a join so when we eventually support vectorized joins it will be fast.

An intermediate approach might be to implement a special IN filter operator 
that stores the constant values in a sorted array or high-performance hash 
table (like Cuckoo hashing).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to