salim achouche created DRILL-5879: ------------------------------------- Summary: Optimize "Like" operator Key: DRILL-5879 URL: https://issues.apache.org/jira/browse/DRILL-5879 Project: Apache Drill Issue Type: Improvement Components: Execution - Relational Operators Environment: * Reporter: salim achouche Assignee: salim achouche Priority: Minor Fix For: 1.12.0
Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%'; Improvement Opportunity # Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice # Optimize the "contains" for-loop Implementation Detail 1) * Added a new integer variable "asciiMode" to the VarCharHolder class * The default value is -1 which indicates this info is not know * Otherwise this value will be set to either 1 or 0 * The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value * The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations 2) * The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons * Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons Benchmarks * Lineitem table 100GB * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag * Before changes: 33sec * After changes : 27sec -- This message was sent by Atlassian JIRA (v6.4.14#64029)