Doug Cutting wrote:
...
The reason for the move is that, if the file lies on a bad disk block,
we don't want to remove it, we want to keep it around so that that
block is not reused. Local files are normally removed at the end of
the job, but files in the bad_file directory are never automatically
removed.
Makes sense.
In your case I don't think moving the files is helping at all, since
the checksum errors are not caused by bad disk blocks, but rather by
memory errors. So not moving is fine. Maybe we should add a config
parameter to disable moving on checksum error?
Let me try this. Should probably subsume the 'io.skip.checksum.errors'
parameter.
Then the task lands on a machine that has started to exhibit checksum
errors. After each failure, the task is rescheduled and it always
seems to land back at the problematic machine (Anything I can do
about randomizing the machine a task gets assigned too?).
I think the job tracker now has logic that tries to prevent a task
from being re-assigned to a node that it has previously failed on.
Either that logic is buggy, or you're running an older version.
I'm using latest. Will also take a look into why tasks keep going back
to same machine.
St.Ack