Re: Review Request 16463: PIG-3636 Implement accumulator optimization in Tez

Cheolsoo Park Thu, 26 Dec 2013 10:51:51 -0800


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > Patch looks good. Just few minor review comments.
> > 
> > > Note that 3 test cases in TestAccumulator are annotated as @Ignore 
> > > because SecondaryKeyOptimizer in Tez is not implement yet. The test cases 
> > > expect accumulator optimizer is applied when order-by and distinct are 
> > > present in a nested foreash because sort operator is removed by 
> > > SecondaryKeyOptimizer. Added TODO comments accordingly.
> > 
> >   I have the SecondaryKeyOptimizer fixed in my patch for PIG-3626. Will 
> > take care of reverting them back in that patch.
> > 
> > 
> >


That will be awesome!


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestAccumulator.java, line 114
> > <https://reviews.apache.org/r/16463/diff/1/?file=403084#file403084line114>
> >
> >     Why do we specifically need homeDirOnDfs? Isn't that the default 
> > current directory?

I wanted to ask you about this. If I don't specify the home dir, 
getMkDirCommandForHadoop2_0() returns "" because parentDir is empty.


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java,
> >  lines 148-154
> > <https://reviews.apache.org/r/16463/diff/1/?file=403078#file403078line148>
> >
> >     I see that unsafe typecast to Iterable<NullableTuple> is not possible 
> > in java. 
> >     
> >     But copying over to a new arraylist seems to be inefficient just to get 
> > a Iterable<NullableTuple>. Can we write a wrapper iterator that implements 
> > Iterable<NullableTuple> wrapping Iterable<Object> instead?

Actually, that for loop does more than converting type. It also merges multiple 
lists into a single list. See the following code snippet-

List<NullableTuple> nTups = new ArrayList<NullableTuple>(); // nTups is outside 
of the loop.

for (int i = 0; i < numInputs; i++) {
  cur = readers.get(i).getCurrentKey();
  if (comparator.compare(min, cur) == 0) {
    Iterable<Object> vals = readers.get(i).getCurrentValues();

    for (Object val : vals) {
      nTups.add((NullableTuple) val); // adding vals from multiple inputs to 
the same list.
    }

    buffer.setKey(cur);
    buffer.setIterator(nTups.iterator()); // reset the iterator each time.
    bag = new AccumulativeBag(buffer, i);
}

So if I wrap the iterator returned by readers.get(i).getCurrentValues(), it 
won't work. I am open to any suggestion though.


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestCombiner.java, line 58
> > <https://reviews.apache.org/r/16463/diff/1/?file=403085#file403085line58>
> >
> >     We seem to be doing this in a couple of places now. If there are issues 
> > we hit because of session reuse (static variables, etc), then we should try 
> > fix them so that session reuse works with grunt shell. Not important for 
> > this jira as we need to get the core stuff out first. Can you create a 
> > separate jira mentioning the issues that you faced so far with session 
> > reuse so that we come back later and fix them some time.

Thank you for asking this. After PIG-3602, I still occasionally see a dangling 
process after TestCombiner. I don't see it always, so I guess timing matters. I 
will definitely follow up with a jira.


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestAccumulator.java, line 189
> > <https://reviews.apache.org/r/16463/diff/1/?file=403084#file403084line189>
> >
> >     Any reason for adding this newly within tests? tearDown() already has 
> > it.

I added this because in the same test case, if pigServer invokes openIterator() 
multiple times, Pig generates temporary files with the same name in subsequent 
calls. So I am cleaning them up after each invocation.

Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2002: Output 
Location Validation Failed for: 
'hdfs://localhost:56901/tmp/temp-1456742965/tmp-1456742965 More info to follow:
8788 Output directory hdfs://localhost:56901/tmp/temp-1456742965/tmp-1456742965 
already exists

I don't see this behavior in hadoopversion=20 but in hadoopversion=23. Perhaps 
we should fix FileLocalizer#getTemporaryPath(). I can take a closer look.


> On Dec. 26, 2013, 4:45 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestAccumulator.java, line 530
> > <https://reviews.apache.org/r/16463/diff/1/?file=403084#file403084line530>
> >
> >     Can be removed. tearDown already does it

Same as above. pigServer invokes openIterator() inside a while loop, so I am 
cleaning up temporary files after each iteration.


- Cheolsoo


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16463/#review30867
-----------------------------------------------------------


On Dec. 26, 2013, 6:11 a.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16463/
> -----------------------------------------------------------
> 
> (Updated Dec. 26, 2013, 6:11 a.m.)
> 
> 
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini 
> Palaniswamy.
> 
> 
> Bugs: PIG-3636
>     https://issues.apache.org/jira/browse/PIG-3636
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> The patch implements accumulator optimization in Tez. The changes include-
> * Create AccumulatorOptimizer in Tez.
> * Create AccumulatorOptimizerUtil class and factor out common functions in MR 
> and Tez.
> * Implement accumulator logic in POShuffleTezLoad.
> * Update TestAccumulator to make it run in Tez mode.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 0a26e8c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/AccumulatorOptimizer.java
>  7f9e15a 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
>  9eed25c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java
>  6e04513 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/AccumulatorOptimizer.java
>  e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 
> 722b9f6 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POUnionTezLoad.java 
> 742a33a 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 
> d42ce89 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java 
> c6af682 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java 
> e33a7c6 
>   
> src/org/apache/pig/backend/hadoop/executionengine/util/AccumulatorOptimizerUtil.java
>  e69de29 
>   test/org/apache/pig/test/TestAccumulator.java b979649 
>   test/org/apache/pig/test/TestCombiner.java a227d18 
>   test/tez-tests fcb573e 
> 
> Diff: https://reviews.apache.org/r/16463/diff/
> 
> 
> Testing
> -------
> 
> * TestAccumulator passes in Tez mode.
> * All unit tests pass.
> * All e2e tests pass.
> 
> Note that 3 test cases in TestAccumulator are annotated as @Ignore because 
> SecondaryKeyOptimizer in Tez is not implement yet. The test cases expect 
> accumulator optimizer is applied when order-by and distinct are present in a 
> nested foreash because sort operator is removed by SecondaryKeyOptimizer. 
> Added TODO comments accordingly.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>

Re: Review Request 16463: PIG-3636 Implement accumulator optimization in Tez

Reply via email to