-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16309/#review30868
-----------------------------------------------------------


Alex,
   Sorry for the late review. The shipping of local resources code needs some 
fixing. Review comments below. I know the patch is already committed. If you 
can fix these, I will commit them as an addendum patch.


src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java
<https://reviews.apache.org/r/16309/#comment59095>

    Implementation of these do not look right to me. It does not honor 
FileSystem and fs.defaultFS and assume files are to be shipped to HDFS (local 
mode, s3 etc won't work). Needs to be fixed.
    
    Also whenever URL or URI conversion is required, use the Path class.
    i.e new Path(filepath).toUri().toURL() wherever a path is required. Path 
class takes care of normalizing the url for path and does some Windows specific 
handling.



src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java
<https://reviews.apache.org/r/16309/#comment59096>

    You can directly call addExtraResource instead of adding to a set and 
iterating over it.



src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java
<https://reviews.apache.org/r/16309/#comment59097>

    Shouldn't be parsing for #. Can use URI.getPath() to get path without 
fragment



src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java
<https://reviews.apache.org/r/16309/#comment59098>

    Can directly do addExtraResource instead of adding to a map and then 
iterating over the map.
    
    Also in this case the same file could be symlinked to different names. Need 
to handle them instead of ignoring based on key being the same.



src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java
<https://reviews.apache.org/r/16309/#comment59090>

    The conversion to org.apache.hadoop.yarn.api.records.URL in multiple places 
for local files seems unneccessary. Does not seem to be required and 
complicates code. Only LocalResource.newInstance takes 
org.apache.hadoop.yarn.api.records.URL in getTezResources but we are still 
doing a ConverterUtils.getYarnUrlFromPath(fstat.getPath()) there. Can we just 
use java.net.URL everywhere?



src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java
<https://reviews.apache.org/r/16309/#comment59091>

    This check should be removed. You are allowed to ship from any FileSystem 
implementation to the destination. The shipToHDFS method in Utils is a 
misnomer. It ships to the fs.defaultFS. We should rename the method to 
shipToDefaultFS



src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java
<https://reviews.apache.org/r/16309/#comment59092>

    We should not be having this check. The destination filesystem could be 
anything. For eg: s3. Can we rename all variable names/methods in 
TezResourceManager, TezOperPlan that have hdfs in to say defaultFS, srcFS or 
remoteFS as appropriate.


- Rohini Palaniswamy


On Dec. 24, 2013, 1:34 a.m., Alex Bain wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16309/
> -----------------------------------------------------------
> 
> (Updated Dec. 24, 2013, 1:34 a.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini 
> Palaniswamy.
> 
> 
> Bugs: PIG-3629
>     https://issues.apache.org/jira/browse/PIG-3629
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Implement STREAM operator in Tez - 
> https://issues.apache.org/jira/browse/PIG-3629
> 
> In this patch, I do not add resources to pig-misc.jar, I just add them 
> individually. See my discussion post: 
> https://groups.google.com/forum/#!topic/pig-on-tez/8S80GMKhMaU
> 
> Basic Changes:
> -Run the PhyPlanSetter and EndOfAllInputSetter to set the parent plan and the 
> end-of-all input flags necessary for STREAM, just like in MR Pig.
> -Add a map to hold plan-specific extra local resources in TezOperPlan.java. 
> These resources can either come from the user's directory (e.g. 
> SHIP('/home/abain/foo')) or from HDFS (e.g. CACHE('/user/abain/bar') in HDFS).
> -Add the new class TezPOStreamVisitor that assembles all the plan-specific 
> local resources that get added in TezOperPlan.java.
> 
> Resource Manager Changes:
> -TezResourcManager resources were previously a map of java.net.URL -> Path in 
> HDFS. Previously, the URL's were all local files, e.g. 
> file://home/abain/pig-withouthHadoop.jar. However, the CACHE statement 
> requires that resources already present in HDFS be able to be added as local 
> resources. Unfortunately java.net.URL does not support hdfs:// URL's, so I 
> changed this data structure to be a YARN URL instead. I also added methods to 
> the ResourceManager to distinguish whether you are adding a local resource or 
> a resource already present in HDFS.
> -CACHE also supports URL's with fragments at the end, which become a 
> "shortcut" to the name, e.g. CACHE(/input/big-data-name.gz#data.gz). I 
> changed the resource manager to look for a fragments and use that as the 
> resource name (if the fragment exist). This results in the symlink to the 
> resource being created with the fragment name, which is what we want.
> 
> Race condition:
> -I found a race condition that resulted from reusing the Result object in 
> POSimpleTezLoad. There are several possible solutions. After discussing in 
> the newsgroup, we decided to change POSimpleTezLoad for now.
> -I also made a small cleanup to PhysicalOperator.java.
> 
> 
> Diffs
> -----
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java
>  37566ab 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java
>  8487a0f 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POSimpleTezLoad.java 
> d57aded 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 
> 0ee7256 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 
> 191563d 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 
> df9fea6 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/TezJobControlCompiler.java
>  135b933 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java 
> 0cc8e17 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/TezPOStreamVisitor.java 
> PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java 
> 673fd70 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java 
> 0fd7575 
>   src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 
> 862e637 
>   test/e2e/pig/tests/tez.conf 0e4ba4e 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC12.gld PRE-CREATION 
>   test/org/apache/pig/tez/TestTezCompiler.java 8d5e5f2 
> 
> Diff: https://reviews.apache.org/r/16309/diff/
> 
> 
> Testing
> -------
> 
> Added a unit test to TestTezCompiler.java
> Added a new unit test e2e test to tez.conf with session reuse enabled
> Ported three other e2e tests from streaming.conf to tez.conf to increase 
> coverage
> 
> ant test-tez passes
> ant test-e2e-tez passes
> Manually tested with a large subset of tests from streaming.conf (the ones 
> using features currently supported by Pig-on-Tez), they pass
> 
> 
> Thanks,
> 
> Alex Bain
> 
>

Reply via email to