[jira] [Commented] (HIVE-1772) optimize join followed by a groupby

Radhika Malik (JIRA) Sat, 05 May 2012 00:10:19 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268913#comment-13268913
 ]


Radhika Malik commented on HIVE-1772:
-------------------------------------

A group of us is trying to do this for a class project. We want to parallelize 
the process of JOIN followed by GROUP BY as follows-
The Map job is the same: it takes in two TableScanOperators (as well as any 
FilterOperators) as well as two ReduceSinkOperators.
The Reduce job, while computing the joins in the JoinOperator also groups the 
results and performs any aggregates. It then pushes the results directly to a 
FileSinkOperator without having a separate GroupByOperator.

Does anyone have suggestions on where we can get started in the code? Looking 
at Hive's architecture overview, it seems we want to make changes to the  Query 
Plan Generator in the compiler to generate different map-reduce tasks for 
queries that include Join followed by Group By. We are thinking of beginning 
with trying to modify src/ql/src/java/org/apache/hadoop/hive/ql/QueryPlan.java 
but weren't sure if this was the right approach. Any input on how you think we 
should approach this would be great!
                
> optimize join followed by a groupby
> -----------------------------------
>
>                 Key: HIVE-1772
>                 URL: https://issues.apache.org/jira/browse/HIVE-1772
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Navis
>         Attachments: HIVE-1772.1.patch
>
>
> explain SELECT x.key, count(1) FROM src1 x JOIN src y ON (x.key = y.key) 
> group by x.key;
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-2 depends on stages: Stage-1
>   Stage-0 is a root stage
> The above query issues 2 map-reduce jobs. 
> The first MR job performs the join, whereas the second MR performs the group 
> by.
> Since the data is already sorted, the group by can be performed in the 
> reducer of the join itself.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-1772) optimize join followed by a groupby

Reply via email to