You should check the bad reducers' logs carefully.There may be more information about it.
2011/3/10 Chris Curtin <[email protected]> > Hi, > > The last couple of days we have been seeing 10's of thousands of these > errors in the logs: > > INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file > > /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003 > retrying... > When this is going on the reducer in question is always the last reducer in > a job. > > Sometimes the reducer recovers. Sometimes hadoop kills that reducer, runs > another and it succeeds. Sometimes hadoop kills the reducer and the new one > also fails, so it gets killed and the cluster goes into a loop of > kill/launch/kill. > > At first we thought it was related to the size of the data being evaluated > (4+GB), but we've seen it several times today on < 100 MB > > Searching here or online doesn't show a lot about what this error means and > how to fix it. > > We are running 0.20.2, r911707 > > Any suggestions? > > > Thanks, > > Chris >
