-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16309/
-----------------------------------------------------------
Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini
Palaniswamy.
Bugs: PIG-3629
https://issues.apache.org/jira/browse/PIG-3629
Repository: pig-git
Description
-------
Implement STREAM operator in Tez -
https://issues.apache.org/jira/browse/PIG-3629
In this patch, I do not add resources to pig-misc.jar, I just add them
individually. See my discussion post:
https://groups.google.com/forum/#!topic/pig-on-tez/8S80GMKhMaU
Basic Changes:
-Run the PhyPlanSetter and EndOfAllInputSetter to set the parent plan and the
end-of-all input flags necessary for STREAM, just like in MR Pig.
-Add a map to hold plan-specific extra local resources in TezOperPlan.java.
These resources can either come from the user's directory (e.g.
SHIP('/home/abain/foo')) or from HDFS (e.g. CACHE('/user/abain/bar') in HDFS).
-Add the new class TezPOStreamVisitor that assembles all the plan-specific
local resources that get added in TezOperPlan.java.
Resource Manager Changes:
-TezResourcManager resources were previously a map of java.net.URL -> Path in
HDFS. Previously, the URL's were all local files, e.g.
file://home/abain/pig-withouthHadoop.jar. However, the CACHE statement requires
that resources already present in HDFS be able to be added as local resources.
Unfortunately java.net.URL does not support hdfs:// URL's, so I changed this
data structure to be a YARN URL instead. I also added methods to the
ResourceManager to distinguish whether you are adding a local resource or a
resource already present in HDFS.
-CACHE also supports URL's with fragments at the end, which become a "shortcut"
to the name, e.g. CACHE(/input/big-data-name.gz#data.gz). I changed the
resource manager to look for a fragments and use that as the resource name (if
the fragment exist). This results in the symlink to the resource being created
with the fragment name, which is what we want.
Diffs
-----
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java
37566ab
src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java
7a1736a
src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
2584501
src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java
96ccdde
src/org/apache/pig/backend/hadoop/executionengine/tez/TezJobControlCompiler.java
135b933
src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java
0cc8e17
src/org/apache/pig/backend/hadoop/executionengine/tez/TezPOStreamVisitor.java
PRE-CREATION
src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java
673fd70
src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java
0fd7575
Diff: https://reviews.apache.org/r/16309/diff/
Testing
-------
Added a unit test to TestTezCompiler.java
Added an e2e test to tez.conf
ant test-tez passes
ant test-e2e-tez has three failures - I am investigating to see if they are
releated, or perhaps just transient
Question: There is already a separate suite of STREAM tests in streaming.conf.
Maybe I should remove my e2e test and we should add streaming.conf as a
dependency to the test-e2e-tez target? I haven't tried to run streaming.conf
yet.
Thanks,
Alex Bain