Hi all, We are currently in the process of replacing the servers in our Hadoop 0.20.2 production cluster and in the last couple of days have experienced an error similar to the following (from the JobTracker log) several times, which then appears to hang the JobTracker:
2010-10-15 04:13:38,980 INFO org.apache.hadoop.mapred.JobInProgress: Job job_201010140844_0510 has completed successfully. 2010-10-15 04:13:44,192 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... 2010-10-15 04:13:44,592 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... 2010-10-15 04:13:44,993 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... 2010-10-15 04:13:45,393 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... 2010-10-15 04:13:45,794 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/history/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... We haven't seen an issue like this until we added 6 new nodes to our existing 65 node cluster. The only other configuration change made recently was to setup include/exclude files for DFS and MapReduce to "enable" Hadoop's node decommissioning functionality. Once we encounter this issue (which has happened twice in the last 24 hours), we end up needing to restart the MapReduce processes which we cannot do on a frequent basis. After the last occurrence, I increased the value of the mapred.job.tracker.handler.count to 60 and am waiting to see if it has an impact. Has anyone else seen this behavior before? Are there any recommendations for trying to prevent this from happening in the future? Thanks in advance, -Bobby
