Re: I get checksum errors! Was: Re: io.skip.checksum.errors was: Re: Hung job

Doug Cutting Thu, 13 Apr 2006 10:50:10 -0700

Michael Stack wrote:

What if we did not move the file? A checksum error would be thrown. Ifwe're inside SequenceFile#next and 'io.skip.checksum.errors' is set,then we'll just try to move to next record. I do not have theexperience with the code base to know if not-moving will manufactureweird scenarios elsewhere in the code base.

The reason for the move is that, if the file lies on a bad disk block,we don't want to remove it, we want to keep it around so that that blockis not reused. Local files are normally removed at the end of the job,but files in the bad_file directory are never automatically removed.

In your case I don't think moving the files is helping at all, since thechecksum errors are not caused by bad disk blocks, but rather by memoryerrors. So not moving is fine. Maybe we should add a config parameterto disable moving on checksum error?

Then the task lands on a machine that hasstarted to exhibit checksum errors. After each failure, the task isrescheduled and it always seems to land back at the problematic machine(Anything I can do about randomizing the machine a task gets assignedtoo?).

I think the job tracker now has logic that tries to prevent a task frombeing re-assigned to a node that it has previously failed on. Eitherthat logic is buggy, or you're running an older version.


Doug

Re: I get checksum errors! Was: Re: io.skip.checksum.errors was: Re: Hung job

Reply via email to