Inconsistancy between Mapper/Reducer book keeping
-------------------------------------------------

                 Key: HADOOP-1764
                 URL: https://issues.apache.org/jira/browse/HADOOP-1764
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
         Environment: Related: HADOOP-1763 (Same environment)
Version: 0.15.0-dev, r565628
Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa
1400 Nodes
            Reporter: Srikanth Kakani
            Priority: Blocker


Refer to HADOOP-1763

This occurs in that scenario once many job trackers are lost, reducers do not 
know where the map outputs are present. They keep retrying the wrong node 
causing the reducers to run forever without failures.

Relevant logs:
Reducer output:
2007-08-21 09:47:47,046 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 Copying task_200708210155_0003_m_002598_0 
output from node50
2007-08-21 09:47:53,643 WARN org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 copy failed: 
task_200708210155_0003_m_002598_0 from node50
2007-08-21 09:47:53,643 WARN org.apache.hadoop.mapred.ReduceTask: 
java.io.FileNotFoundException: 
http://wm511750.inktomisearch.com:50060/mapOutput?map=task_200708210155_0003_m_002598_0&reduce=6
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1243)
        at 
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:207)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:673)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:631)
2007-08-21 09:53:02,327 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 Copying task_200708210155_0003_m_002598_0 
output from node50
2007-08-21 09:53:02,333 WARN org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 copy failed: 
task_200708210155_0003_m_002598_0 from node50
2007-08-21 09:53:02,333 WARN org.apache.hadoop.mapred.ReduceTask: 
java.io.FileNotFoundException: 
http://node50:50060/mapOutput?map=task_200708210155_0003_m_002598_0&reduce=6
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1243)
        at 
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:207)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:673)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:631)
2007-08-21 09:57:33,899 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 Copying task_200708210155_0003_m_002598_0 
output from node50.inktomisearch.com.
2007-08-21 09:57:33,908 WARN org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 copy failed: 
task_200708210155_0003_m_002598_0 from node50.inktomisearch.com
2007-08-21 09:57:33,908 WARN org.apache.hadoop.mapred.ReduceTask: 
java.io.FileNotFoundException: 
http://node50:50060/mapOutput?map=task_200708210155_0003_m_002598_0&reduce=6
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1243)
        at 
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:207)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:673)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:631)
2007-08-21 10:00:56,337 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 Copying task_200708210155_0003_m_002598_1 
output from node75.inktomisearch.com.
2007-08-21 10:00:56,342 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 done copying 
task_200708210155_0003_m_002598_1 output from node75
2007-08-21 10:02:17,486 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200708210155_0003_r_000006_2 Ignoring obsolete copy result for Map Task: 
task_200708210155_0003_m_002598_0 from host: node50

Looking at TIP task_200708210155_0003_m_002598:

task_200708210155_0003_m_002598_0       node50  KILLED  0.00%           
21-Aug-2007 09:38:49    Lost task tracker
task_200708210155_0003_m_002598_1       node75  KILLED  0.00%           
21-Aug-2007 11:22:42    Lost task tracker
task_200708210155_0003_m_002598_2       node55  SUCCEEDED       100.00% 
21-Aug-2007 11:22:46    21-Aug-2007 11:27:19 (4mins, 33sec)     
task_200708210155_0003_m_002598_3       node49  KILLED  100.00% 21-Aug-2007 
11:22:48    21-Aug-2007 11:27:48 (4mins, 59sec)     Already completed TIP


Notes:
1. Even finally the reducer seems to fetch data from the incorrect TaskTracker, 
it is not checking with the job tracker for the final/correct map output
2. It seems to retry more times and sleeps for longer time (looking at the 
interval of log messages)
3. An obvious solution may be to go to the job tracker and directly get the 
correct map output (I was able to get the correct map output from node55 using 
http, without any errors)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to