[
https://issues.apache.org/jira/browse/HIVE-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157994#comment-16157994
]
Ke Jia commented on HIVE-17139:
-------------------------------
Upload the latest patch to fix the failed tests and the remain failed tests
seem not patch related.
I test the patch with table product_reviews of TPCx-BB using the following sql
statement:
{code:java}
select case when pr_review_rating=4 then upper(pr_review_content) when
pr_review_rating=3 then upper(pr_review_content) end from product_reviews;
{code}
The cluster includes 8 nodes, 230G/per node. CPU is Intel(R) Xeon(R) CPU
E5-2699.
With 3TB data scale and spark as executor engine, the following is the result:
|| ||without patch||with patch||improvement(s)||improvement(%)||
|Hos|28.25s|16.14s|12.11s|42.8%|
|VectorSelectOperator |2.99s|12.58s|9.59s|76.2%|
The result shows the execution time of spark from 28.25s to 16.14s and the time
cost of VectorSelectOperator from 12.58s to 2.99s.
Here, the total records, "pr_review_rating=4" records and "pr_review_rating=3"
records are as following:
|| ||count||
|total records|9934636|
|pr_review_rating=4 records|1897804|
|pr_review_rating=3 records|792278|
With this patch, only (1897804+792278) records do the upper operation of the
above sql statement and without this patch, there are (9934636+9934636) records
doing the upper operation.
> Conditional expressions optimization: skip the expression evaluation if the
> condition is not satisfied for vectorization engine.
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-17139
> URL: https://issues.apache.org/jira/browse/HIVE-17139
> Project: Hive
> Issue Type: Improvement
> Reporter: Ke Jia
> Assignee: Ke Jia
> Attachments: HIVE-17139.1.patch, HIVE-17139.2.patch,
> HIVE-17139.3.patch, HIVE-17139.4.patch, HIVE-17139.5.patch,
> HIVE-17139.6.patch, HIVE-17139.7.patch, HIVE-17139.8.patch
>
>
> The case when and if statement execution for Hive vectorization is not
> optimal, which all the conditional and else expressions are evaluated for
> current implementation. The optimized approach is to update the selected
> array of batch parameter after the conditional expression is executed. Then
> the else expression will only do the selected rows instead of all.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)