Remove redundant map-reduce job for merge join
----------------------------------------------
Key: PIG-1116
URL: https://issues.apache.org/jira/browse/PIG-1116
Project: Pig
Issue Type: Bug
Reporter: Daniel Dai
In merge join, when we convert right hand side file into a side file, we didn't
remove it from the map-reduce plan, we only disconnect it from the plan. When
we run the query, the redundant load will load the data but doing nothing. This
operation should be removed entirely.
Eg:
a = load '/user/pig/tests/data/zebra/singlefile/studentsortedtab10k' using
org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, gpa);
b = load '/user/pig/tests/data/zebra/singlefile/votersortedtab10k' using
org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age,
registration, contributions);
c = join a by name, b by name using "merge";
explain c;
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-21
Map Plan
Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/votersortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted'))
- 1-13--------
Global sort: false
----------------
MapReduce node 1-20
Map Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-19
|
|---MergeJoin[tuple] - 1-16
|
|---Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/studentsortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted'))
- 1-12--------
Global sort: false
----------------
1-21 should be removed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.