[ 
https://issues.apache.org/jira/browse/HIVE-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt McCline updated HIVE-7405:
-------------------------------

    Description: 
Vectorize the basic case that does not have any count distinct aggregation.

Add a 4th processing mode in VectorGroupByOperator for reduce where each input 
VectorizedRowBatch has only values for one key at a time.  Thus, the values in 
the batch can be aggregated quickly.

  was:

Take advantage of the fact that in most plans a reduce-side GroupBy will get 
the group keys in sorted order so aggregation can be done "streaming" and not 
require large buffering of intermediate aggregation in memory/storage.

Push any case requiring large buffering -- e.g. COUNT(DISTINCT(..)) -- to part 
2 of Vectorize Reduce-Side GroupBy.  In theory, if there is only one 
COUNT(DISTINCT(..)) the optimizer could arrange for sorting on the distinct 
column(s) as subordinate sort key and do the count of each distinct column(s) 
as a "streaming" operation.  Then, only multiple COUNT(DISTINCT(..)) would 
require large buffering.

        Summary: Vectorize GROUP BY on the Reduce-Side (Part 1 – Basic)  (was: 
Vectorize Reduce-Side GroupBy)

> Vectorize GROUP BY on the Reduce-Side (Part 1 – Basic)
> ------------------------------------------------------
>
>                 Key: HIVE-7405
>                 URL: https://issues.apache.org/jira/browse/HIVE-7405
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Matt McCline
>            Assignee: Matt McCline
>
> Vectorize the basic case that does not have any count distinct aggregation.
> Add a 4th processing mode in VectorGroupByOperator for reduce where each 
> input VectorizedRowBatch has only values for one key at a time.  Thus, the 
> values in the batch can be aggregated quickly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to