[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Ashish Thusoo (JIRA) Wed, 10 Sep 2008 09:17:05 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629845#action_12629845
 ]


Ashish Thusoo commented on HADOOP-4139:
---------------------------------------

I should be done reviewing this in couple of hours...

A few minor comments though:

1. In the tests we should drop the created destination tables. At some point we 
want to ensure that the cleanup code for a test is isolated within the test. 
(This is minor - I am ok with it as is for now).
2. The check to disallow different distincts - can that be moved up and 
potentially even before we generate the groupbyPlan. No point going through the 
entire processing stuff if we can disallow it right up front.
3. Also a comment describing the algorithm somewhere would be great


> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  
> SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  
> SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first 
> partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Reply via email to