Michael Stack wrote:
What if we did not move the file? A checksum error would be thrown. If we're inside SequenceFile#next and 'io.skip.checksum.errors' is set, then we'll just try to move to next record. I do not have the experience with the code base to know if not-moving will manufacture weird scenarios elsewhere in the code base.

The reason for the move is that, if the file lies on a bad disk block, we don't want to remove it, we want to keep it around so that that block is not reused. Local files are normally removed at the end of the job, but files in the bad_file directory are never automatically removed.

In your case I don't think moving the files is helping at all, since the checksum errors are not caused by bad disk blocks, but rather by memory errors. So not moving is fine. Maybe we should add a config parameter to disable moving on checksum error?

Then the task lands on a machine that has started to exhibit checksum errors. After each failure, the task is rescheduled and it always seems to land back at the problematic machine (Anything I can do about randomizing the machine a task gets assigned too?).

I think the job tracker now has logic that tries to prevent a task from being re-assigned to a node that it has previously failed on. Either that logic is buggy, or you're running an older version.

Doug

Reply via email to