[
https://issues.apache.org/jira/browse/HIVE-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor resolved HIVE-24930.
---------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
> Operator.setDone() short-circuit from child op is not used in vectorized
> codepath (if childSize == 1)
> -----------------------------------------------------------------------------------------------------
>
> Key: HIVE-24930
> URL: https://issues.apache.org/jira/browse/HIVE-24930
> Project: Hive
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> This looks like a possible performance regression in case of limit,
> considering the following query:
> {code}
> explain vectorization detail select
> ws_item_sk item_sk, d_date,
> sum(ws_sales_price) over (partition by ws_item_sk order by d_date range
> between 10 preceding and current row) cume_sales,
> last_value(ws_sales_price) over (partition by ws_item_sk order by d_date
> range between 10 preceding and current row) last_price
> from web_sales
> ,date_dim
> where ws_sold_date_sk=d_date_sk
> and d_month_seq between 1214 and 1214+11
> and ws_item_sk is not NULL
> group by ws_item_sk, d_date, ws_sales_price
> limit 100;
> {code}
> in case of vectorized ptf (note: the issue is independent of ptf operator
> though), the whole pipeline process all the rows, which leads to serious
> performance regression (note 1439591782 runtime rows for all the operators
> except limit)
> non-vectorized:
> {code}
> set hive.vectorized.execution.ptf.enabled=false;
> ...
> | Select Operator |
> | Statistics: Num rows: 1415172503/1439591782 Data size:
> 248969569264 Basic stats: COMPLETE Column stats: COMPLETE |
> | PTF Operator |
> | Statistics: Num rows: 1415172503/449131 Data size:
> 248969569264 Basic stats: COMPLETE Column stats: COMPLETE |
> | Select Operator |
> | Statistics: Num rows: 1415172503/11526 Data size:
> 565867418560 Basic stats: COMPLETE Column stats: COMPLETE |
> {code}
> vectorized:
> {code}
> set hive.vectorized.execution.ptf.enabled=true;
> ...
> | Select Operator |
> | Statistics: Num rows: 1415172503/1439591782 Data size:
> 248969569264 Basic stats: COMPLETE Column stats: COMPLETE |
> | PTF Operator |
> | Statistics: Num rows: 1415172503/1439591782 Data size:
> 248969569264 Basic stats: COMPLETE Column stats: COMPLETE |
> | Select Operator |
> | Statistics: Num rows: 1415172503/1439591782 Data size:
> 565867418560 Basic stats: COMPLETE Column stats: COMPLETE |
> | File Output Operator |
> | Statistics: Num rows: 100/11300 Data size: 40000
> Basic stats: COMPLETE Column stats: COMPLETE |
> {code}
> this is because this short-circuit is missing if childSize==1 (from
> vectorForward):
> {code}
> // if all children are done, this operator is also done
> if (childrenDone != 0 && childrenDone == childOperatorsArray.length) {
> setDone(true);
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)