Doug Cutting wrote:
Michael Stack wrote:
One question: The 'io.skip.checksum.errors' is only read in
SequenceFile#next but the LocalFileSystem checksum error "move-aside"
handler can be triggered by other than just a call out of
SequenceFile#next. If so, stopping the LocalFileSystem move-aside on
checksum error is probably not the right thing to do.
Right, we ideally want SequenceFile to disable it when that flag is
set. But that would take a lot of plumbing to implement!
Yes.
Perhaps we should instead fix this by not closing the file in
LocalFilesystem.reportChecksumFailure. Then it won't be able to move
the file aside on Windows. To fix that, we can (1) try to move it
without closing it (since something on the stack will eventually close
it anyway, and may still need it open) and (2) if the move fails, try
closing it and moving it (for Windows). The net effect is that
io.skip.checksum.errors will then work on Unix but not on Windows. Or
we could skip moving it altogether, since it seems that most checksum
errors we're seeing are not disk errors but memory errors before the
data hits the disk.
What if we did not move the file? A checksum error would be thrown. If
we're inside SequenceFile#next and 'io.skip.checksum.errors' is set,
then we'll just try to move to next record. I do not have the
experience with the code base to know if not-moving will manufacture
weird scenarios elsewhere in the code base.
A checksum failure on a local file currently causes the task to fail.
But it takes multiple checksum errors per job to get a job to fail,
right? Is that what's happening?
It is. Jobs are long-running -- a day or more (I should probably try
cutting them into smaller pieces). What I usually see is a failure for
some genuinely odd reason. Then the task lands on a machine that has
started to exhibit checksum errors. After each failure, the task is
rescheduled and it always seems to land back at the problematic machine
(Anything I can do about randomizing the machine a task gets assigned too?).
St.Ack
- Re: I get checksum errors! Was: Re: io.skip.checksum.e... Michael Stack
-