[ 
https://issues.apache.org/jira/browse/HIVE-23541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal Vijayaraghavan updated HIVE-23541:
----------------------------------------
    Affects Version/s: 4.0.0
                       3.1.2

> Vectorization: Unbounded following window function start producing results 
> too early
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-23541
>                 URL: https://issues.apache.org/jira/browse/HIVE-23541
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0, 3.1.2
>            Reporter: Gopal Vijayaraghavan
>            Priority: Major
>
> ReduceRecordSource indicates the end of group for a reducer input, whenever 
> the entire key changes.
> ReduceRecordSource::processVectorGroup calls 
> reducer.setNextVectorBatchGroupStatus(/* isLastGroupBatch */ true); when the 
> last group is being processed.
> However for PTF window functions with unbounded following, this is triggered 
> by the key changing and not the partition changing.
> This results in the VectorPTFOperator detect a change in the sort key as a 
> switch of the partition key and start producing results too early.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/ptf/VectorPTFOperator.java#L399
> {code}
> create temporary table test2(id STRING,name STRING,event_dt date) stored as 
> orc;
> insert into test2 values ('100','A','2019-08-15'), ('100','A','2019-10-12');
> SELECT name, event_dt, first_value(event_dt) over (PARTITION BY name ORDER BY 
> event_dt desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT_ROW) last_event_dt 
> FROM test2; -- streaming FIRST_VALUE with DESCENDING
> SELECT name, event_dt, last_value(event_dt) over (PARTITION BY name ORDER BY 
> event_dt asc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) 
> last_event_dt FROM test2; -- non-streaming LAST_VALUE with ASCENDING
> {code}
> These two queries should return identical results, with the streaming version 
> being significantly faster than the non-streaming one, due to the lack of 
> buffered/spilled rows with streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to