[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797412#action_12797412 ]
Ying He commented on PIG-480: ----------------------------- I did more performance tests. It shows the performance is related to the nature of data. If the data is skewed, performance is very bad for combiner case. If data is uniform, the combiner case gets the most performance gain. The test is done by using a join then a group by statement. For skewed data, if I use skewed join, the result is much better. I think the reason of bad performance for skewed data is that because the map plan of second job is moved to the reducer of first job. If data is skewed, a single reducer has to execute the extra logic for all its tuples. While without this patch, that part of logic would be executed inside multiple mappers. So we lost parallelism for this. The more skewed the data is, the worse the performance would be. 1. skewed data combiner job 1 job 2 total patch 7min 53sec 1min 1sec 8min 54sec trunk 4min 43sec 1min 37sec 6min 20sec combiner and using skewed join patch 1min 55sec 1min 1sec 2min 56sec trunk 1min 44sec 1min 40sec 3min 24sec no combiner patch 2min 26sec 2min 28sec 4min 54sec trunk 1min 25sec 3min 24sec 4min 49sec no combiner and using skewed join patch 1min 17sec 3min 5sec 4min 22sec trunk 59sec 3min 7sec 4min 6sec 2. uniform data combiner patch 6min 48sec 3min 43sec 10min 31sec trunk 7min 32sec 7min 3sec 14min 35sec no combiner patch 1min 25sec 2min 25sec 3min 50sec trunk 1min 24sec 2min 28sec 3min 52sec each group of tests may use different data, so don't make cross group comparison. > PERFORMANCE: Use identity mapper in a chain of M-R jobs > ------------------------------------------------------- > > Key: PIG-480 > URL: https://issues.apache.org/jira/browse/PIG-480 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.2.0 > Reporter: Olga Natkovich > Assignee: Ying He > Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch > > > For jobs with two or more MR jobs, use identity mapper wherever possible in > second and subsequent MR jobs. Identity mapper is about 50% than pig empty > map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.