[ 
https://issues.apache.org/jira/browse/HADOOP-17096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HADOOP-17096:
-----------------------------------
    Fix Version/s:     (was: 3.2.3)

> ZStandardCompressor throws java.lang.InternalError: Error (generic)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-17096
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17096
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 3.2.1
>         Environment: Our repro is on ubuntu xenial LTS, with hadoop 3.2.1 
> linking to libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end 
> environment (eg running an actual hadoop job with zstd compression) because 
> it's very sensitive to the exact input and output characteristics. I 
> reproduced the bug by turning one of the existing unit tests into a crude 
> fuzzer, but I'm not sure upstream will accept that patch, so I've attached it 
> separately on this ticket.
> Note that the existing unit test for testCompressingWithOneByteOutputBuffer 
> fails to reproduce this bug. This is because it's using the license file as 
> input, and this file is too small. libzstd has internal buffering (in our 
> environment it seems to be 128 kilobytes), and the license file is only 10 
> kilobytes. Thus libzstd is able to consume all the input and compress it in a 
> single call, then return pieces of its internal buffer one byte at a time. 
> Since all the input is consumed in a single call, uncompressedDirectBufOff 
> and uncompressedDirectBufLen are both set to zero and thus the bug does not 
> reproduce.
>            Reporter: Stephen Jung (Stripe)
>            Assignee: Stephen Jung (Stripe)
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.2.2, 3.3.1, 3.4.0
>
>         Attachments: fuzztest.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> A bug in index handling causes ZStandardCompressor.c to pass a malformed 
> ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets 
> thrown. The crux of the issue is two variables, uncompressedDirectBufLen and 
> uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff 
> from the start of uncompressedDirectBuf, then uncompressedDirectBufLen is 
> counted from uncompressedDirectBufOff. However, libzstd considers pos and 
> size to both be counted from the start of the buffer. As a result, this line 
> https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228
>  causes a malformed buffer to be passed to libzstd, where pos>size. Here's a 
> longer description of the bug in case this abstract explanation is unclear:
> ----
> Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with 
> five bytes of input. This results in uncompressedDirectBufOff=0 and 
> uncompressedDirectBufLen=5 
> (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146).
> Then we call compress(), which initializes a ZSTD_inBuffer 
> (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196).
>  The definition of those libzstd structs is here 
> https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that 
> we set size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The 
> ZSTD_inBuffer gets passed to libzstd, compression happens, etc. When libzstd 
> returns from the compression function, it updates the ZSTD_inBuffer struct to 
> indicate how many bytes were consumed 
> (https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920).
>  Note that pos is advanced, but size is unchanged.
> Now, libzstd does not guarantee that the entire input will be compressed in a 
> single call of the compression function. (Some of the compression libraries 
> used by hadoop, such as snappy, _do_ provide this guarantee, but libzstd is 
> not one of them.) So the hadoop native code updates uncompressedDirectBufOff 
> and uncompressedDirectBufLen using the updated ZSTD_inBuffer: 
> https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228
> Now, returning to our example, we started with 5 bytes of uncompressed input. 
> Suppose libzstd compressed 4 of those bytes, leaving one unread. This would 
> result in a ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four 
> bytes were consumed). The hadoop native code would then set 
> uncompressedDirectBufOff=4, but it would also set uncompressedDirectBufLen=1 
> (five minus four equals one).
> Since some of the input was not consumed, we will eventually call compress() 
> again. Then we instantiate another ZSTD_inBuffer struct, this time with 
> size=1 and pos=4. This is a bug - libzstd expects size and pos to both be 
> counted from the start of the buffer, therefore pos>size is unsound. So it 
> returns an error 
> https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932
>  which gets escalated as a java.lang.InternalError.
> I will be submitting a pull request on github with a fix for this bug. The 
> key is that the hadoop code should handle offsets the same way libzstd does, 
> ie uncompressedDirectBufLen should be counted from the start of 
> uncompressedDirectBuf, not from uncompressedDirectBufOff.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to