[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732207#action_12732207 ]
Alan Gates commented on PIG-792: -------------------------------- Your code has tabs in it. It should instead have 4 spaces. In MRCompiler.visitSkewedJoin, I don't understand the following: {code} int rp = op.getRequestedParallelism(); Pair<MapReduceOper, Integer> sampleJobPair = getSkewedJoinSampleJob(op, mro, fSpec, partitionFile, rp); rp = sampleJobPair.second; // set parallelism of SkewedJoin same as the sampling job op.setRequestedParallelism(rp); {code} Why is the job parallelism being reset based on the sample? The results from the join sampling puts out data in a certain format. That format should be documented clearly in the comments somewhere. It is referred to in the class comments for SkewedPartitioner, but not completely specified. Rather than creating a separate POSkewedJoinFileSetter to correct the changes made by the SampleOptimizer, the SampleOptimizer should be changed to correctly handle file names in the case of skewed join. Why does MapReduceOper need to know about skewedJoinPartitionFile? In POPartitionRearrange.constructPROutput, what does {code} opTuple.set(1, Byte.valueOf(""+reducerIdx)); {code} do? It looks like you're forcing reducerIdx to String and then to byte. That's rather inefficient. We can't go straight from int to byte? And why is reducerIdx and Integer instead of an int? There are still some System.out/err.println statements in the code. These should be removed or converted to log.debug statements. Why do we need a new NullablePartitionWritable class? Couldn't Tuple be used for this? > PERFORMANCE: Support skewed join in pig > --------------------------------------- > > Key: PIG-792 > URL: https://issues.apache.org/jira/browse/PIG-792 > Project: Pig > Issue Type: Improvement > Reporter: Sriranjan Manjunath > Attachments: skewedjoin.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.