Re: Review Request 17375: PIG-3719 Fix skewed join e2e tests in tez branch

Daniel Dai Mon, 27 Jan 2014 14:30:38 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17375/#review32896
-----------------------------------------------------------


Ship it!


Ship It!

- Daniel Dai


On Jan. 27, 2014, 9:13 p.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/17375/
> -----------------------------------------------------------
> 
> (Updated Jan. 27, 2014, 9:13 p.m.)
> 
> 
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini 
> Palaniswamy.
> 
> 
> Bugs: PIG-3719
>     https://issues.apache.org/jira/browse/PIG-3719
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> This patch fixes two sets of skewed join e2e tests-
> 1) tez.conf Join_[7_8].
> 2) nightly.conf SkewedJoin_[1_10].
> 
> There are mainly two changes-
> 1) In POShuffleTezLoad, we copy PigNullableWritable using 
> PigNullableWritable.newInstance(). But this methods doesn't copy the key of 
> NullablePartitionWritable (subclass of PigNullableWritable for skewed join). 
> This was causing NPE later.
> 2) In POPartitionRearrangeTez, the init() method builds reduceMap <key, 
> pair<int, int>>. The problem was that the key of this map was always wrapped 
> by tuple even though there is only a single join key, resulting in wrong join 
> output. Now the key is wrapped only if there are more than one join keys.
> 
> 
> Diffs
> -----
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java
>  57af63f 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartitionRearrange.java
>  4ef55b8 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java
>  13d9ec9 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java
>  5e94e61 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 
> 761ae90 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java
>  8d4c2a9 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 
> d8646cb 
>   src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
> 637246c 
> 
> Diff: https://reviews.apache.org/r/17375/diff/
> 
> 
> Testing
> -------
> 
> * ant clean test-tez passes except TestCustomPartitioner. I confirmed 
> TestCustomPartitioner doesn't pass in current tez branch, so it's not related.
> * All nightly.conf SkewedJoin tests pass except #6 self join case.
> * All tez.conf e2e tests pass.
> 
> Note that for nightly.conf SkewedJoin #9 and #10, I changed the parallel from 
> 8 to 4. The reason is because for right/full outer join, it's important that 
> every reducer task assigned to skewed keys is given at least one row of those 
> skewed keys. For eg, if we assign 8 reducers to key X, we expect these 8 
> reducers are all given at least one row of key X.
> 
> However, in e2e tests, we arbitrarily assign 8 reducers to two particularly 
> keys for testing purpose. As a result, some reducer tasks are given no row of 
> these fake "skewed" keys, and that makes the tests fail. So I am reducing the 
> parallel to avoid this situation.
> 
> In MR mode, this is not a problem because 1) a single task loads the input 
> data (10k rows), and 2) a single task RRs more than 8 rows of the skewed keys 
> to 8 reducers using SkewedPartitioner. So each reducer is given at lease one 
> row of the skewed keys. But in Tez mode, 1) we transfer the input data to the 
> partition vertex that has a parallelism of 8, and 2) 8 mapper tasks RR the 
> skewed keys using SkewedPartitioner. Since each mapper task is given less 
> than 8 rows of the skewed keys, the skewed keys are not evenly distribute to 
> 8 reducer tasks. i.e. reducer #8 ends up with no rows of the skewed keys.
> 
> In fact, this is a test issue IMO. If we have enough skewed keys to RR to 8 
> reducer tasks, we won't run into this situation.
> 
> Lastly, I'd like to fix the self join case (#6) in a separate jira because it 
> requires quite a few changes in TezCompiler due to POSplit.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>

Re: Review Request 17375: PIG-3719 Fix skewed join e2e tests in tez branch

Reply via email to