[ https://issues.apache.org/jira/browse/MAPREDUCE-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peng Zhang resolved MAPREDUCE-6445. ----------------------------------- Resolution: Cannot Reproduce > Shuffle hang > ------------ > > Key: MAPREDUCE-6445 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6445 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Peng Zhang > > Scale cluster has run for months with 2.6.0. > 2 of 200 reduces hang on shuffle > instance 1 log seems like loop on 1 map output: > {noformat} > 2015-08-06 21:54:14,649 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 > to node-132.bj:22408 to fetcher#1 > 2015-08-06 21:54:14,651 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=20&map=attempt_1438689528746_10193_m_000013_0,attempt_1438689528746_10193_m_000020_0 > sent hash and received reply > 2015-08-06 21:54:14,651 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager > returned status WAIT ... > 2015-08-06 21:54:14,651 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: > node-132.bj:22408 freed by fetcher#1 in 2ms > 2015-08-06 21:54:14,651 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-132.bj:22408 with 2 to fetcher#5 > 2015-08-06 21:54:14,651 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 > to node-132.bj:22408 to fetcher#5 > 2015-08-06 21:54:14,656 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=20&map=attempt_1438689528746_10193_m_000013_0,attempt_1438689528746_10193_m_000020_0 > sent hash and received reply > 2015-08-06 21:54:14,656 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 - MergeManager > returned status WAIT ... > 2015-08-06 21:54:14,656 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: > node-132.bj:22408 freed by fetcher#5 in 4ms > 2015-08-06 21:54:14,656 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-132.bj:22408 with 2 to fetcher#5 > 2015-08-06 21:54:14,656 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 > to node-132.bj:22408 to fetcher#5 > 2015-08-06 21:54:14,660 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=20&map=attempt_1438689528746_10193_m_000013_0,attempt_1438689528746_10193_m_000020_0 > sent hash and received reply > 2015-08-06 21:54:14,660 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 - MergeManager > returned status WAIT ... > 2015-08-06 21:54:14,660 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: > node-132.bj:22408 freed by fetcher#5 in 5ms > 2015-08-06 21:54:14,660 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-132.bj:22408 with 2 to fetcher#5 > 2015-08-06 21:54:14,660 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 > to node-132.bj:22408 to fetcher#5 > {noformat} > node 2 log seems like loop on 5 map output: > {noformat} > 2015-08-06 21:43:33,626 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-172.bj:22408 with 1 to fetcher#5 > 2015-08-06 21:43:33,626 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 > to node-172.bj:22408 to fetcher#5 > 2015-08-06 21:43:33,627 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000013_0,attempt_1438689528746_10193_m_000020_0 > sent hash and received reply > 2015-08-06 21:43:33,627 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 - MergeManager > returned status WAIT ... > 2015-08-06 21:43:33,627 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: > node-132.bj:22408 freed by fetcher#3 in 5ms > 2015-08-06 21:43:33,627 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-179.bj:22408 with 1 to fetcher#3 > 2015-08-06 21:43:33,627 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 > to node-179.bj:22408 to fetcher#3 > 2015-08-06 21:43:33,627 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000084_0,attempt_1438689528746_10193_m_000046_0 > sent hash and received reply > 2015-08-06 21:43:33,627 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#4 - MergeManager > returned status WAIT ... > 2015-08-06 21:43:33,627 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: > node-71.bj:22408 freed by fetcher#4 in 5ms > 2015-08-06 21:43:33,627 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-71.bj:22408 with 2 to fetcher#4 > 2015-08-06 21:43:33,627 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 > to node-71.bj:22408 to fetcher#4 > 2015-08-06 21:43:33,628 INFO [fetcher#2] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000092_0 > sent hash and received reply > 2015-08-06 21:43:33,628 INFO [fetcher#2] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#2 - MergeManager > returned status WAIT ... > 2015-08-06 21:43:33,628 INFO [fetcher#2] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: > node-167.bj:22408 freed by fetcher#2 in 3ms > 2015-08-06 21:43:33,628 INFO [fetcher#2] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-132.bj:22408 with 2 to fetcher#2 > 2015-08-06 21:43:33,628 INFO [fetcher#2] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 > to node-132.bj:22408 to fetcher#2 > 2015-08-06 21:43:33,629 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000097_0 > sent hash and received reply > 2015-08-06 21:43:33,629 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager > returned status WAIT ... > 2015-08-06 21:43:33,629 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: > node-174.bj:22408 freed by fetcher#1 in 3ms > 2015-08-06 21:43:33,629 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-174.bj:22408 with 1 to fetcher#1 > 2015-08-06 21:43:33,629 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 > to node-174.bj:22408 to fetcher#1 > 2015-08-06 21:43:33,629 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000093_0 > sent hash and received reply > 2015-08-06 21:43:33,629 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 - MergeManager > returned status WAIT ... > 2015-08-06 21:43:33,630 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: > node-172.bj:22408 freed by fetcher#5 in 3ms > 2015-08-06 21:43:33,630 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning > node-172.bj:22408 with 1 to fetcher#5 > 2015-08-06 21:43:33,630 INFO [fetcher#5] > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 > to node-172.bj:22408 to fetcher#5 > 2015-08-06 21:43:33,630 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000089_0 > sent hash and received reply > 2015-08-06 21:43:33,630 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 - MergeManager > returned status WAIT ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)