Hi, The last couple of days we have been seeing 10's of thousands of these errors in the logs:
INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003 retrying... When this is going on the reducer in question is always the last reducer in a job. Sometimes the reducer recovers. Sometimes hadoop kills that reducer, runs another and it succeeds. Sometimes hadoop kills the reducer and the new one also fails, so it gets killed and the cluster goes into a loop of kill/launch/kill. At first we thought it was related to the size of the data being evaluated (4+GB), but we've seen it several times today on < 100 MB Searching here or online doesn't show a lot about what this error means and how to fix it. We are running 0.20.2, r911707 Any suggestions? Thanks, Chris
