You should check the bad reducers' logs carefully.There may be more
information about it.

2011/3/10 Chris Curtin <[email protected]>

> Hi,
>
> The last couple of days we have been seeing 10's of thousands of these
> errors in the logs:
>
>  INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
>
> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003
> retrying...
> When this is going on the reducer in question is always the last reducer in
> a job.
>
> Sometimes the reducer recovers. Sometimes hadoop kills that reducer, runs
> another and it succeeds. Sometimes hadoop kills the reducer and the new one
> also fails, so it gets killed and the cluster goes into a loop of
> kill/launch/kill.
>
> At first we thought it was related to the size of the data being evaluated
> (4+GB), but we've seen it several times today on < 100 MB
>
> Searching here or online doesn't show a lot about what this error means and
> how to fix it.
>
> We are running 0.20.2, r911707
>
> Any suggestions?
>
>
> Thanks,
>
> Chris
>

Reply via email to