[
https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629845#action_12629845
]
Ashish Thusoo commented on HADOOP-4139:
---------------------------------------
I should be done reviewing this in couple of hours...
A few minor comments though:
1. In the tests we should drop the created destination tables. At some point we
want to ensure that the cleanup code for a test is isolated within the test.
(This is minor - I am ok with it as is for now).
2. The check to disallow different distincts - can that be moved up and
potentially even before we generate the groupbyPlan. No point going through the
entire processing stuff if we can disallow it right up front.
3. Also a comment describing the algorithm somewhere would be great
> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
> Key: HADOOP-4139
> URL: https://issues.apache.org/jira/browse/HADOOP-4139
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/hive
> Reporter: Namit Jain
> Assignee: Namit Jain
> Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct
> SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct
> SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first
> partially aggregated on the distinct value and then aggregated.
> The first step can be common to all group bys.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.