[
https://issues.apache.org/jira/browse/HADOOP-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522843
]
Devaraj Das commented on HADOOP-1764:
-------------------------------------
bq. Given that I'm inclined to close this bug... thoughts?
+1
> Inconsistancy between Mapper/Reducer book keeping
> -------------------------------------------------
>
> Key: HADOOP-1764
> URL: https://issues.apache.org/jira/browse/HADOOP-1764
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Environment: Related: HADOOP-1763 (Same environment)
> Version: 0.15.0-dev, r565628
> Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa
> 1400 Nodes
> Reporter: Srikanth Kakani
> Assignee: Arun C Murthy
> Priority: Blocker
>
> Refer to HADOOP-1763
> This occurs in that scenario once many job trackers are lost, reducers do not
> know where the map outputs are present. They keep retrying the wrong node
> causing the reducers to run forever without failures.
> Relevant logs:
> Reducer output:
> 2007-08-21 09:47:47,046 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 Copying task_200708210155_0003_m_002598_0
> output from node50
> 2007-08-21 09:47:53,643 WARN org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 copy failed:
> task_200708210155_0003_m_002598_0 from node50
> 2007-08-21 09:47:53,643 WARN org.apache.hadoop.mapred.ReduceTask:
> java.io.FileNotFoundException:
> http://wm511750.inktomisearch.com:50060/mapOutput?map=task_200708210155_0003_m_002598_0&reduce=6
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1243)
> at
> org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:207)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:673)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:631)
> 2007-08-21 09:53:02,327 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 Copying task_200708210155_0003_m_002598_0
> output from node50
> 2007-08-21 09:53:02,333 WARN org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 copy failed:
> task_200708210155_0003_m_002598_0 from node50
> 2007-08-21 09:53:02,333 WARN org.apache.hadoop.mapred.ReduceTask:
> java.io.FileNotFoundException:
> http://node50:50060/mapOutput?map=task_200708210155_0003_m_002598_0&reduce=6
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1243)
> at
> org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:207)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:673)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:631)
> 2007-08-21 09:57:33,899 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 Copying task_200708210155_0003_m_002598_0
> output from node50.inktomisearch.com.
> 2007-08-21 09:57:33,908 WARN org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 copy failed:
> task_200708210155_0003_m_002598_0 from node50.inktomisearch.com
> 2007-08-21 09:57:33,908 WARN org.apache.hadoop.mapred.ReduceTask:
> java.io.FileNotFoundException:
> http://node50:50060/mapOutput?map=task_200708210155_0003_m_002598_0&reduce=6
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1243)
> at
> org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:207)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:673)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:631)
> 2007-08-21 10:00:56,337 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 Copying task_200708210155_0003_m_002598_1
> output from node75.inktomisearch.com.
> 2007-08-21 10:00:56,342 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 done copying
> task_200708210155_0003_m_002598_1 output from node75
> 2007-08-21 10:02:17,486 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200708210155_0003_r_000006_2 Ignoring obsolete copy result for Map Task:
> task_200708210155_0003_m_002598_0 from host: node50
> Looking at TIP task_200708210155_0003_m_002598:
> task_200708210155_0003_m_002598_0 node50 KILLED 0.00%
> 21-Aug-2007 09:38:49 Lost task tracker
> task_200708210155_0003_m_002598_1 node75 KILLED 0.00%
> 21-Aug-2007 11:22:42 Lost task tracker
> task_200708210155_0003_m_002598_2 node55 SUCCEEDED 100.00%
> 21-Aug-2007 11:22:46 21-Aug-2007 11:27:19 (4mins, 33sec)
> task_200708210155_0003_m_002598_3 node49 KILLED 100.00% 21-Aug-2007
> 11:22:48 21-Aug-2007 11:27:48 (4mins, 59sec) Already completed TIP
> Notes:
> 1. Even finally the reducer seems to fetch data from the incorrect
> TaskTracker, it is not checking with the job tracker for the final/correct
> map output
> 2. It seems to retry more times and sleeps for longer time (looking at the
> interval of log messages)
> 3. An obvious solution may be to go to the job tracker and directly get the
> correct map output (I was able to get the correct map output from node55
> using http, without any errors)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.