[
https://issues.apache.org/jira/browse/HADOOP-17096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang resolved HADOOP-17096.
--------------------------------------
Fix Version/s: 3.1.5
3.4.0
3.3.1
3.2.2
Resolution: Fixed
Great fix. Thanks.
I'm really sorry it took so long to review it.
> ZStandardCompressor throws java.lang.InternalError: Error (generic)
> -------------------------------------------------------------------
>
> Key: HADOOP-17096
> URL: https://issues.apache.org/jira/browse/HADOOP-17096
> Project: Hadoop Common
> Issue Type: Bug
> Components: io
> Affects Versions: 3.2.1
> Environment: Our repro is on ubuntu xenial LTS, with hadoop 3.2.1
> linking to libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end
> environment (eg running an actual hadoop job with zstd compression) because
> it's very sensitive to the exact input and output characteristics. I
> reproduced the bug by turning one of the existing unit tests into a crude
> fuzzer, but I'm not sure upstream will accept that patch, so I've attached it
> separately on this ticket.
> Note that the existing unit test for testCompressingWithOneByteOutputBuffer
> fails to reproduce this bug. This is because it's using the license file as
> input, and this file is too small. libzstd has internal buffering (in our
> environment it seems to be 128 kilobytes), and the license file is only 10
> kilobytes. Thus libzstd is able to consume all the input and compress it in a
> single call, then return pieces of its internal buffer one byte at a time.
> Since all the input is consumed in a single call, uncompressedDirectBufOff
> and uncompressedDirectBufLen are both set to zero and thus the bug does not
> reproduce.
> Reporter: Stephen Jung (Stripe)
> Assignee: Stephen Jung (Stripe)
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5
>
> Attachments: fuzztest.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> A bug in index handling causes ZStandardCompressor.c to pass a malformed
> ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets
> thrown. The crux of the issue is two variables, uncompressedDirectBufLen and
> uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff
> from the start of uncompressedDirectBuf, then uncompressedDirectBufLen is
> counted from uncompressedDirectBufOff. However, libzstd considers pos and
> size to both be counted from the start of the buffer. As a result, this line
> https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228
> causes a malformed buffer to be passed to libzstd, where pos>size. Here's a
> longer description of the bug in case this abstract explanation is unclear:
> ----
> Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with
> five bytes of input. This results in uncompressedDirectBufOff=0 and
> uncompressedDirectBufLen=5
> (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146).
> Then we call compress(), which initializes a ZSTD_inBuffer
> (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196).
> The definition of those libzstd structs is here
> https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that
> we set size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The
> ZSTD_inBuffer gets passed to libzstd, compression happens, etc. When libzstd
> returns from the compression function, it updates the ZSTD_inBuffer struct to
> indicate how many bytes were consumed
> (https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920).
> Note that pos is advanced, but size is unchanged.
> Now, libzstd does not guarantee that the entire input will be compressed in a
> single call of the compression function. (Some of the compression libraries
> used by hadoop, such as snappy, _do_ provide this guarantee, but libzstd is
> not one of them.) So the hadoop native code updates uncompressedDirectBufOff
> and uncompressedDirectBufLen using the updated ZSTD_inBuffer:
> https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228
> Now, returning to our example, we started with 5 bytes of uncompressed input.
> Suppose libzstd compressed 4 of those bytes, leaving one unread. This would
> result in a ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four
> bytes were consumed). The hadoop native code would then set
> uncompressedDirectBufOff=4, but it would also set uncompressedDirectBufLen=1
> (five minus four equals one).
> Since some of the input was not consumed, we will eventually call compress()
> again. Then we instantiate another ZSTD_inBuffer struct, this time with
> size=1 and pos=4. This is a bug - libzstd expects size and pos to both be
> counted from the start of the buffer, therefore pos>size is unsound. So it
> returns an error
> https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932
> which gets escalated as a java.lang.InternalError.
> I will be submitting a pull request on github with a fix for this bug. The
> key is that the hadoop code should handle offsets the same way libzstd does,
> ie uncompressedDirectBufLen should be counted from the start of
> uncompressedDirectBuf, not from uncompressedDirectBufOff.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]