Stephen Jung (Stripe) created HADOOP-17096:
----------------------------------------------
Summary: ZStandardCompressor throws java.lang.InternalError: Error
(generic)
Key: HADOOP-17096
URL: https://issues.apache.org/jira/browse/HADOOP-17096
Project: Hadoop Common
Issue Type: Bug
Components: io
Affects Versions: 3.2.1
Environment: Our repro is on ubuntu xenial LTS, with hadoop 3.2.1
linking to libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end
environment (eg running an actual hadoop job with zstd compression) because
it's very sensitive to the exact input and output characteristics. I reproduced
the bug by turning one of the existing unit tests into a crude fuzzer, but I'm
not sure upstream will accept that patch, so I've attached it separately on
this ticket.
Note that the existing unit test for testCompressingWithOneByteOutputBuffer
fails to reproduce this bug. This is because it's using the license file as
input, and this file is too small. libzstd has internal buffering (in our
environment it seems to be 128 kilobytes), and the license file is only 10
kilobytes. Thus libzstd is able to consume all the input and compress it in a
single call, then return pieces of its internal buffer one byte at a time.
Since all the input is consumed in a single call, uncompressedDirectBufOff and
uncompressedDirectBufLen are both set to zero and thus the bug does not
reproduce.
Reporter: Stephen Jung (Stripe)
A bug in index handling causes ZStandardCompressor.c to pass a malformed
ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets
thrown. The crux of the issue is two variables, uncompressedDirectBufLen and
uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff from
the start of uncompressedDirectBuf, then uncompressedDirectBufLen is counted
from uncompressedDirectBufOff. However, libzstd considers pos and size to both
be counted from the start of the buffer. As a result, this line
https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228
causes a malformed buffer to be passed to libzstd, where pos>size. Here's a
longer description of the bug in case this abstract explanation is unclear:
----
Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with
five bytes of input. This results in uncompressedDirectBufOff=0 and
uncompressedDirectBufLen=5
(https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146).
Then we call compress(), which initializes a ZSTD_inBuffer
(https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196).
The definition of those libzstd structs is here
https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that
we set size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The
ZSTD_inBuffer gets passed to libzstd, compression happens, etc. When libzstd
returns from the compression function, it updates the ZSTD_inBuffer struct to
indicate how many bytes were consumed
(https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920).
Note that pos is advanced, but size is unchanged.
Now, libzstd does not guarantee that the entire input will be compressed in a
single call of the compression function. (Some of the compression libraries
used by hadoop, such as snappy, _do_ provide this guarantee, but libzstd is not
one of them.) So the hadoop native code updates uncompressedDirectBufOff and
uncompressedDirectBufLen using the updated ZSTD_inBuffer:
https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228
Now, returning to our example, we started with 5 bytes of uncompressed input.
Suppose libzstd compressed 4 of those bytes, leaving one unread. This would
result in a ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four bytes
were consumed). The hadoop native code would then set
uncompressedDirectBufOff=4, but it would also set uncompressedDirectBufLen=1
(five minus four equals one).
Since some of the input was not consumed, we will eventually call compress()
again. Then we instantiate another ZSTD_inBuffer struct, this time with size=1
and pos=4. This is a bug - libzstd expects size and pos to both be counted from
the start of the buffer, therefore pos>size is unsound. So it returns an error
https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932
which gets escalated as a java.lang.InternalError.
I will be submitting a pull request on github with a fix for this bug. The key
is that the hadoop code should handle offsets the same way libzstd does, ie
uncompressedDirectBufLen should be counted from the start of
uncompressedDirectBuf, not from uncompressedDirectBufOff.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]