Remove redundant map-reduce job for merge join
----------------------------------------------

                 Key: PIG-1116
                 URL: https://issues.apache.org/jira/browse/PIG-1116
             Project: Pig
          Issue Type: Bug
            Reporter: Daniel Dai


In merge join, when we convert right hand side file into a side file, we didn't 
remove it from the map-reduce plan, we only disconnect it from the plan. When 
we run the query, the redundant load will load the data but doing nothing. This 
operation should be removed entirely. 

Eg: 
a = load '/user/pig/tests/data/zebra/singlefile/studentsortedtab10k' using 
org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, gpa);
b = load '/user/pig/tests/data/zebra/singlefile/votersortedtab10k' using 
org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, 
registration, contributions);
c = join a by name, b by name using "merge";
explain c;

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node 1-21
Map Plan
Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/votersortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted'))
 - 1-13--------
Global sort: false
----------------

MapReduce node 1-20
Map Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-19
|
|---MergeJoin[tuple] - 1-16
    |
    
|---Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/studentsortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted'))
 - 1-12--------
Global sort: false
----------------

1-21 should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to