[ 
https://issues.apache.org/jira/browse/PHOENIX-5491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Feng updated PHOENIX-5491:
-------------------------------
    Description: 
In WhereOptimizer.pushKeyExpressionsToScan(), has a line of code: 
"extractNodes.addAll(nodesToExtract)"

When executing sqls like "select * from ... where A in (a1, a2, ..., a_n) and B 
= X", saying A in N (N > 100,000) elements, previous code execution will slow 
(> 90s in our environment).

This is because in such case, extractNodes is a HashSet, nodesToExtract is a 
List with N InListExpression (the N InListExpressions are the same instance), 
each InListExpression.values has N elements as well.

HashSet.addAll(list<InListExpression>) will call N times of 
InListExpression.hashCode(). Each time, InListExpression.hashCode() will 
calculate hashCode for every value. Therefore, the time complexity will be N^2.

A simple way to solve it is to remember of the hashCode of InListExpression and 
returns it directly if calculated once. The query will finish in 5 seconds.

  was:
In WhereOptimizer.pushKeyExpressionsToScan(), has a line of code: 
"extractNodes.addAll(nodesToExtract)" When executing sqls like "select * from 
... where A in (a1, a2, ..., a_n) and B = X", saying A in N (N > 100,000) 
elements, previous code execution will slow (> 90s in our environment).

This is because in such case, extractNodes is a HashSet, nodesToExtract is a 
List with N InListExpression (the N InListExpressions are the same instance), 
each InListExpression.values has N elements as well.

HashSet.addAll(list<InListExpression>) will call N times of 
InListExpression.hashCode(). Each time, InListExpression.hashCode() will 
calculate hashCode for every value. Therefore, the time complexity will be N^2.

A simple way to solve it is to remember of the hashCode of InListExpression and 
returns it directly if calculated once.


> Improve performance of InListExpression.hashCode
> ------------------------------------------------
>
>                 Key: PHOENIX-5491
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5491
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 5.0.0, 4.14.3
>            Reporter: Chen Feng
>            Assignee: Chen Feng
>            Priority: Critical
>         Attachments: PHOENIX-5491.patch
>
>
> In WhereOptimizer.pushKeyExpressionsToScan(), has a line of code: 
> "extractNodes.addAll(nodesToExtract)"
> When executing sqls like "select * from ... where A in (a1, a2, ..., a_n) and 
> B = X", saying A in N (N > 100,000) elements, previous code execution will 
> slow (> 90s in our environment).
> This is because in such case, extractNodes is a HashSet, nodesToExtract is a 
> List with N InListExpression (the N InListExpressions are the same instance), 
> each InListExpression.values has N elements as well.
> HashSet.addAll(list<InListExpression>) will call N times of 
> InListExpression.hashCode(). Each time, InListExpression.hashCode() will 
> calculate hashCode for every value. Therefore, the time complexity will be 
> N^2.
> A simple way to solve it is to remember of the hashCode of InListExpression 
> and returns it directly if calculated once. The query will finish in 5 
> seconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to