Also, reading from block supports 'real skip', ie, it does not check
checksum if an entire checksum block (usually 512 bytes) falls within
the skip range. Another reason to implement our own skip.
Raghu Angadi wrote:
In Hadoop, whenever possible, we read directly to user buffer. E.g. in
ChecksumFileSystem we read into user buffer and then do a checksum, I do
the same in new Block level CRCs. This is very useful since this avoids
an extra copy in most cases.
We don't define skip() for our extensions of InputStream since we know
default implementation calls read(). But the problem is that
InputStream.skip() uses a *static* byte buffer (from its perspective, it
makes sense). So if we have two parallel skip() on unrelated streams,
we will surely get checksum errors.
When this happened with Block level CRCs, I wasted time trying to find a
bug in the new code.
My prefered fix would be to implement skip() in Hadoop() level. Always
copying to user buffer would be very defensive fix.
Raghu.
- Re: Problem with InputStream.skip() Raghu Angadi
-