Ankit Kamboj created HIVE-7989:
----------------------------------

             Summary: Optimize Windowing function performance for row frames
                 Key: HIVE-7989
                 URL: https://issues.apache.org/jira/browse/HIVE-7989
             Project: Hive
          Issue Type: Improvement
          Components: Windows
    Affects Versions: 0.13.0
            Reporter: Ankit Kamboj


To find aggregate value for each row, current windowing function implementation 
creates a new aggregation buffer for each row, iterates over all the rows in 
respective window frame, puts them in buffer and then finds the aggregated 
value. This causes bottleneck for partitions with huge number of rows because 
this process runs in n-square complexity (n being rows in a partition) for each 
partition. So, if there are multiple partitions in a dataset, each with 
millions of rows, aggregation for all rows will take days to finish.

There is scope of optimization for row frames, for following cases:

a) For UNBOUNDED PRECEDING start and bounded end: Instead of iterating on 
window frame again for each row, we can slide the end one row at a time and 
aggregate, since we know the start is fixed for each row. This will have 
running time linear to the size of partition (O(n)).

b) For bounded start and UNBOUNDED FOLLOWING end: Instead of iterating on 
window frame again for each row, we can slide the start one row at a time and 
aggregate in reverse, since we know the end is fixed for each row. This will 
have running time linear to the size of partition (O(n)).

Also, In general for both row and value frames, we don't need to iterate over 
the range and re-create aggregation buffer if the start as well as end remain 
same. Instead, can re-use the previously created aggregation buffer.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to