Re: Review Request 16463: PIG-3636 Implement accumulator optimization in Tez

Cheolsoo Park Thu, 26 Dec 2013 17:26:26 -0800


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestAccumulator.java, line 189
> > <https://reviews.apache.org/r/16463/diff/1/?file=403084#file403084line189>
> >
> >     Any reason for adding this newly within tests? tearDown() already has 
> > it.
> 
> Cheolsoo Park wrote:
>     I added this because in the same test case, if pigServer invokes 
> openIterator() multiple times, Pig generates temporary files with the same 
> name in subsequent calls. So I am cleaning them up after each invocation.
>     
>     Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2002: Output 
> Location Validation Failed for: 
> 'hdfs://localhost:56901/tmp/temp-1456742965/tmp-1456742965 More info to 
> follow:
>     8788 Output directory 
> hdfs://localhost:56901/tmp/temp-1456742965/tmp-1456742965 already exists
>     
>     I don't see this behavior in hadoopversion=20 but in hadoopversion=23. 
> Perhaps we should fix FileLocalizer#getTemporaryPath(). I can take a closer 
> look.
> 
> Rohini Palaniswamy wrote:
>     I am sure this problem does not exist in trunk. We need to see why and 
> fix. Something must be broken in tez branch.


Random doesn't guarantee unique numbers. We haven't seen this problem because 
we were lucky.

After changing Random.getNextInt() to UUID.randomUUID(), I no longer see 
failures.


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java,
> >  lines 148-154
> > <https://reviews.apache.org/r/16463/diff/1/?file=403078#file403078line148>
> >
> >     I see that unsafe typecast to Iterable<NullableTuple> is not possible 
> > in java. 
> >     
> >     But copying over to a new arraylist seems to be inefficient just to get 
> > a Iterable<NullableTuple>. Can we write a wrapper iterator that implements 
> > Iterable<NullableTuple> wrapping Iterable<Object> instead?
> 
> Cheolsoo Park wrote:
>     Actually, that for loop does more than converting type. It also merges 
> multiple lists into a single list. See the following code snippet-
>     
>     List<NullableTuple> nTups = new ArrayList<NullableTuple>(); // nTups is 
> outside of the loop.
>     
>     for (int i = 0; i < numInputs; i++) {
>       cur = readers.get(i).getCurrentKey();
>       if (comparator.compare(min, cur) == 0) {
>         Iterable<Object> vals = readers.get(i).getCurrentValues();
>     
>         for (Object val : vals) {
>           nTups.add((NullableTuple) val); // adding vals from multiple inputs 
> to the same list.
>         }
>     
>         buffer.setKey(cur);
>         buffer.setIterator(nTups.iterator()); // reset the iterator each time.
>         bag = new AccumulativeBag(buffer, i);
>     }
>     
>     So if I wrap the iterator returned by readers.get(i).getCurrentValues(), 
> it won't work. I am open to any suggestion though.
>     
>
> 
> Rohini Palaniswamy wrote:
>     Ah. ok. That should not be a problem. You can make the wrapper iterator 
> have a list of Iterable<Object> and keep adding to the list. And when 
> retrieving the values by iterating, you can sequentially iterate over all the 
> Iterable<Object> in the list till all of them are done.

That's a good suggestion. Let me do it!


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestAccumulator.java, line 114
> > <https://reviews.apache.org/r/16463/diff/1/?file=403084#file403084line114>
> >
> >     Why do we specifically need homeDirOnDfs? Isn't that the default 
> > current directory?
> 
> Cheolsoo Park wrote:
>     I wanted to ask you about this. If I don't specify the home dir, 
> getMkDirCommandForHadoop2_0() returns "" because parentDir is empty.
> 
> Rohini Palaniswamy wrote:
>     That is expected. What you are missing is
>     
>     //Create user home directory
>     m_fileSys.mkdirs(m_fileSys.getWorkingDirectory());
>     
>     in TezMiniCluster.java. This is done in MiniCluster.java

Indeed. You're right!


- Cheolsoo


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16463/#review30867
-----------------------------------------------------------


On Dec. 26, 2013, 6:11 a.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16463/
> -----------------------------------------------------------
> 
> (Updated Dec. 26, 2013, 6:11 a.m.)
> 
> 
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini 
> Palaniswamy.
> 
> 
> Bugs: PIG-3636
>     https://issues.apache.org/jira/browse/PIG-3636
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> The patch implements accumulator optimization in Tez. The changes include-
> * Create AccumulatorOptimizer in Tez.
> * Create AccumulatorOptimizerUtil class and factor out common functions in MR 
> and Tez.
> * Implement accumulator logic in POShuffleTezLoad.
> * Update TestAccumulator to make it run in Tez mode.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 0a26e8c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/AccumulatorOptimizer.java
>  7f9e15a 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
>  9eed25c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java
>  6e04513 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/AccumulatorOptimizer.java
>  e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 
> 722b9f6 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POUnionTezLoad.java 
> 742a33a 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 
> d42ce89 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java 
> c6af682 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java 
> e33a7c6 
>   
> src/org/apache/pig/backend/hadoop/executionengine/util/AccumulatorOptimizerUtil.java
>  e69de29 
>   test/org/apache/pig/test/TestAccumulator.java b979649 
>   test/org/apache/pig/test/TestCombiner.java a227d18 
>   test/tez-tests fcb573e 
> 
> Diff: https://reviews.apache.org/r/16463/diff/
> 
> 
> Testing
> -------
> 
> * TestAccumulator passes in Tez mode.
> * All unit tests pass.
> * All e2e tests pass.
> 
> Note that 3 test cases in TestAccumulator are annotated as @Ignore because 
> SecondaryKeyOptimizer in Tez is not implement yet. The test cases expect 
> accumulator optimizer is applied when order-by and distinct are present in a 
> nested foreash because sort operator is removed by SecondaryKeyOptimizer. 
> Added TODO comments accordingly.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>

Re: Review Request 16463: PIG-3636 Implement accumulator optimization in Tez

Reply via email to