[ 
https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630055#action_12630055
 ] 

Ashish Thusoo commented on HADOOP-4139:
---------------------------------------

Namit and I went over this. The following are the comments:

1. In OpForward instead of copying input row resolver we could use the same one.
2. In the first job we evaluate all the input columns + all group by clause 
expressions + parameters to all the aggregation functions and do not eliminate 
all the duplicates because we treat expression resolution and column resolution 
differently (To be fixed in a later txn).
3. We can be smarter in terms of this list of what parameters we evaluate in 
the first stage, we should only evaluate those that are common across the group 
by clauses (To be fixed in a later txn.)


> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  
> SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  
> SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first 
> partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to