Re: I get checksum errors! Was: Re: io.skip.checksum.errors was: Re: Hung job

Michael Stack Thu, 13 Apr 2006 10:59:17 -0700

Doug Cutting wrote:

...
The reason for the move is that, if the file lies on a bad disk block,we don't want to remove it, we want to keep it around so that thatblock is not reused. Local files are normally removed at the end ofthe job, but files in the bad_file directory are never automaticallyremoved.


Makes sense.

In your case I don't think moving the files is helping at all, sincethe checksum errors are not caused by bad disk blocks, but rather bymemory errors. So not moving is fine. Maybe we should add a configparameter to disable moving on checksum error?

Let me try this. Should probably subsume the 'io.skip.checksum.errors'parameter.

Then the task lands on a machine that has started to exhibit checksumerrors. After each failure, the task is rescheduled and it alwaysseems to land back at the problematic machine (Anything I can doabout randomizing the machine a task gets assigned too?).
I think the job tracker now has logic that tries to prevent a taskfrom being re-assigned to a node that it has previously failed on.Either that logic is buggy, or you're running an older version.

I'm using latest. Will also take a look into why tasks keep going backto same machine.


St.Ack

Re: I get checksum errors! Was: Re: io.skip.checksum.errors was: Re: Hung job

Reply via email to