[ 
https://issues.apache.org/jira/browse/HIVE-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964676#action_12964676
 ] 

Sreekanth Ramakrishnan commented on HIVE-1695:
----------------------------------------------

Current processing of the jobs with MapJoin followed by the Reduce sink happens 
in two stages.

Stage-1 : Mapjoin + Select operator is split into one single stage. This stage 
the plan is split when the select operator is encountered immediately after the 
MapJoin. A file Sink Operator is added immediately after the Mapjoin and the 
select operator is removed from the tree.

Stage-2: Mapjoin + Reduce Sink processor. This stage the work is initialized 
from the previous stage by looking at the output from the FileSinkOperator and 
then uses this as input for current stage and select operator is added for the 
column to be used in the reduce stage along with ordering and other information.

In order to collapse the two stage into a single stage we would need to do the 
following:

After Stage-1 processing is done, i.e. after the NodeProcessor from 
MapJoinFactory.MapJoin is run and the next stage NodeProcessor is called, we 
need to:

# In GenMRRedSink4, access the current MapJoin Operator. Remove the 
FileSinkOperator which is added to mark the end of stage.
# Add Reduce operator to the same to pass the expression and the sort order to 
be used by the reducer.

Thoughts on the above approach? 

Plus, by adding the reduce operator at the end of the MapJoin would it cause 
any regressions? Is there a cleaner way of doing the same i.e by adding a new 
rule for processing?

> MapJoin followed by ReduceSink should be done as single MapReduce Job
> ---------------------------------------------------------------------
>
>                 Key: HIVE-1695
>                 URL: https://issues.apache.org/jira/browse/HIVE-1695
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>
> Currently MapJoin followed by ReduceSink runs as two MapReduce jobs : One map 
> only job followed by a Map-Reduce job. It can be combined into single 
> MapReduce Job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to