[ 
https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206559#comment-17206559
 ] 

Mustafa Iman commented on HIVE-24205:
-------------------------------------

I added a simple max/min length check in CuckooSetBytes#lookup. Attached file 
shows some benchmark results.

 

*TPCH_Q12* is a select with IN clause and a join afterwards. Selectivity of the 
filter is 30%.

*Synthetic* query ** is Simple select with IN clause. IN is over two of the 
longest comment fields (both 72 characters wide). So selectivity is very high 
at about 2%:

select o_orderkey, o_comment from orders where o_comment in ('jole quickly 
furiously bold escapades: regular accounts play regular req', 's foxes. regular 
warhorses detect fluffily. carefull 
y regular tithes amo', 'grate ironic, pending sauternes. deposits do are slyly. 
carefully ironic')

*Synthetic Wide* query is the same as synthetic except IN clause is over one 
shortest length and one longest length comment. Selectivity is still high at 4% 
but our optimization cannot eliminate any tuples.

select o_orderkey, o_comment from orders where o_comment in ('jole quickly 
furiously bold escapades: regular accounts play regular req', 'ts nag 
furiously. even');

 

The patch outperforms original code by 50% on synthetic query. For tpch q12, 
there is no meaningful difference between two runs. My conclusion is that the 
optimization is very low overhead and it gives significant perf improvement in 
certain cases.

I implemented a vectorized version of the early return from cuckooset. It is 
attached as vectorized.patch. However, in all cases simpler patch outperforms 
vectorized one.

> Optimise CuckooSetBytes
> -----------------------
>
>                 Key: HIVE-24205
>                 URL: https://issues.apache.org/jira/browse/HIVE-24205
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Mustafa Iman
>            Priority: Major
>         Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, 
> vectorized.patch
>
>
> {{FilterStringColumnInList, StringColumnInList}}  etc use CuckooSetBytes for 
> lookup.
> !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508!
> One option to optimize would be to add boundary conditions on "length" with 
> the min/max length stored in the hashes (ref: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85])
>  . This would significantly reduce the number of hash computation that needs 
> to happen. E.g 
> [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to