-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17375/
-----------------------------------------------------------
(Updated Jan. 27, 2014, 9:13 p.m.)
Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini
Palaniswamy.
Changes
-------
Rebase to the current head.
Bugs: PIG-3719
https://issues.apache.org/jira/browse/PIG-3719
Repository: pig-git
Description
-------
This patch fixes two sets of skewed join e2e tests-
1) tez.conf Join_[7_8].
2) nightly.conf SkewedJoin_[1_10].
There are mainly two changes-
1) In POShuffleTezLoad, we copy PigNullableWritable using
PigNullableWritable.newInstance(). But this methods doesn't copy the key of
NullablePartitionWritable (subclass of PigNullableWritable for skewed join).
This was causing NPE later.
2) In POPartitionRearrangeTez, the init() method builds reduceMap <key,
pair<int, int>>. The problem was that the key of this map was always wrapped by
tuple even though there is only a single join key, resulting in wrong join
output. Now the key is wrapped only if there are more than one join keys.
Diffs (updated)
-----
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java
57af63f
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartitionRearrange.java
4ef55b8
src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java
13d9ec9
src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java
5e94e61
src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java
761ae90
src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java
8d4c2a9
src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
d8646cb
src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java
637246c
Diff: https://reviews.apache.org/r/17375/diff/
Testing
-------
* ant clean test-tez passes except TestCustomPartitioner. I confirmed
TestCustomPartitioner doesn't pass in current tez branch, so it's not related.
* All nightly.conf SkewedJoin tests pass except #6 self join case.
* All tez.conf e2e tests pass.
Note that for nightly.conf SkewedJoin #9 and #10, I changed the parallel from 8
to 4. The reason is because for right/full outer join, it's important that
every reducer task assigned to skewed keys is given at least one row of those
skewed keys. For eg, if we assign 8 reducers to key X, we expect these 8
reducers are all given at least one row of key X.
However, in e2e tests, we arbitrarily assign 8 reducers to two particularly
keys for testing purpose. As a result, some reducer tasks are given no row of
these fake "skewed" keys, and that makes the tests fail. So I am reducing the
parallel to avoid this situation.
In MR mode, this is not a problem because 1) a single task loads the input data
(10k rows), and 2) a single task RRs more than 8 rows of the skewed keys to 8
reducers using SkewedPartitioner. So each reducer is given at lease one row of
the skewed keys. But in Tez mode, 1) we transfer the input data to the
partition vertex that has a parallelism of 8, and 2) 8 mapper tasks RR the
skewed keys using SkewedPartitioner. Since each mapper task is given less than
8 rows of the skewed keys, the skewed keys are not evenly distribute to 8
reducer tasks. i.e. reducer #8 ends up with no rows of the skewed keys.
In fact, this is a test issue IMO. If we have enough skewed keys to RR to 8
reducer tasks, we won't run into this situation.
Lastly, I'd like to fix the self join case (#6) in a separate jira because it
requires quite a few changes in TezCompiler due to POSplit.
Thanks,
Cheolsoo Park