Eric Hanson created HIVE-4548:
---------------------------------

             Summary: Speed up vectorized LIKE filter for special cases abc%, 
%abc and %abc%
                 Key: HIVE-4548
                 URL: https://issues.apache.org/jira/browse/HIVE-4548
             Project: Hive
          Issue Type: Sub-task
    Affects Versions: vectorization-branch
            Reporter: Eric Hanson
            Assignee: Teddy Choi
            Priority: Minor
             Fix For: vectorization-branch


Speed up vectorized LIKE filter evaluation for abc%, %abc, and %abc% pattern 
special cases (here, abc is just a place holder for some fixed string).  
  
Problem: The current vectorized LIKE implementation always calls the standard 
LIKE function code in UDFLike.java. But this is pretty expensive. It calls 
multiple functions and allocates at least one new object per call. Probably 80% 
of uses of LIKE are for the simple patterns abc%, %abc, and %abc%.  These can 
be implemented much more efficiently.

Start by speeding up the case for  

    Column LIKE "abc%"
  
The goal would be to minimize expense in the inner loop. Don't use new() in the 
inner loop, and write a static function that checks the prefix of the string 
matches the like pattern as efficiently as possible, operating directly on the 
byte array holding UTF-8-encoded string data, and avoiding unnecessary 
additional function calls and if/else logic. Call that in the inner loop.

If feasible, consider using a template-driven approach, with an instance of the 
template expanded for each of the three cases. Start doing the abc% (prefix 
match) by hand, then consider templatizing for the other two cases.

The code is in the "vectorization" branch of the main hive repo.
  
Start by checking in the constructor for FilterStringColLikeStringScalar.java 
if the pattern is one of the simple special cases. If so, record that, and have 
the evaluate() method call a special-case function for each case, i.e. the 
general case, and each of the 3 special cases. All the dynamic decision-making 
would be done once per vector, not once per element.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to