[ 
https://issues.apache.org/jira/browse/HADOOP-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042384#comment-18042384
 ] 

Fengyu Cao commented on HADOOP-18799:
-------------------------------------

I found that when Hadoop decompresses a file compressed by the zstd command, if 
the original uncompressed size is smaller than 129 KiB, it will consistently 
reproduce the following error:

java.lang.InternalError: Src size is incorrect

My scenario is that Spark reads a Zstandard-compressed file through the Hadoop 
native library. I assume this should behave identically to reading it directly 
through the API.

The versions used are:
        •       Apache Spark 3.5.7 (Scala 2.12) with Hadoop 3.3.4
        •       libhadoop.so from Apache Hadoop 3.3.6
        •       libzstd 1.5.4

Reproduction steps:
        1.      yes a | head -n 65536 > file_128KiB.txt  # generate a 128 KiB 
file
        2.      zstd file_128KiB.txt
        3.      zstd -lv file_128KiB.txt.zst && zstdcat file_128KiB.txt.zst | 
head -n 3  # confirm the file is a valid zstd file
        4.      pyspark
        5.      
spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
        6.      Error: java.lang.InternalError: Src size is incorrect

>>> spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
25/12/03 11:05:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.InternalError: Src size is incorrect
        at 
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
 Method)
        at 
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:187)
        at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
        at java.base/java.io.InputStream.read(InputStream.java:218)
        at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:191)
        at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:227)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:185)
        at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:158)
        at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:198)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

In addition, the following flow results in:
java.lang.InternalError: Restored data doesn't match checksum

1a. yes a | head -n 66048 > file_129KiB.txt
1b. yes a | head -n 65536 > file_128KiB.txt
2a. zstd file_129KiB.txt
2b. zstd file_128KiB.txt
3a. zstd -lv file_129KiB.txt.zst && zstdcat file_129KiB.txt.zst | head -n 3
3b. zstd -lv file_128KiB.txt.zst && zstdcat file_128KiB.txt.zst | head -n 3
4. pyspark
5. spark.read.text("hdfs://dhome/camepr42/test_zstd/file_129KiB.txt.zst").show()
6. 
spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()  
# no error
7. spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
8. Error: java.lang.InternalError: Restored data doesn't match checksum

>>> spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
25/12/03 11:08:48 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.InternalError: Restored data doesn't match checksum
        at 
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
 Method)
        at 
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:187)
        at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
        at java.base/java.io.InputStream.read(InputStream.java:218)
        at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:191)
        at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:227)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:185)
        at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:158)
        at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:198)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

> Zstd compressor fails with src size is incorrect
> ------------------------------------------------
>
>                 Key: HADOOP-18799
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18799
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: native
>    Affects Versions: 3.3.0
>            Reporter: Frens Jan Rumph
>            Priority: Major
>
> It seems like I've hit an issue similar to 
> https://issues.apache.org/jira/browse/HADOOP-15822. I haven't been able to 
> reproduce the issue though. I did manage to add a little bit of logging to 
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor. I've captured the off 
> and len arguments of compress and the srcOffset and srcLen arguments for 
> deflateBytesDirect:
> {{compress           0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress           0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress           0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress           0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress           0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress           0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress           0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress           0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress           0 131591}}
> {{deflateBytesDirect 131072 519}}
> Just after that last line the process dies with a java.lang.InternalError: 
> Src size is incorrect:
> {{org.apache.hadoop.mapred.YarnChild: Error running child : 
> java.lang.InternalError: Src size is incorrect}}
> {{at 
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor.deflateBytesDirect(Native
>  Method)}}
> {{at 
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor.compress(ZStandardCompressor.java:220)}}
> {{at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)}}
> {{at 
> org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:76)}}
> {{at 
> java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)}}
> {{at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)}}
> {{at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1569)}}
> {{...}}
> I have also seen this error: java.lang.InternalError: Error (generic):
> {{java.lang.InternalError: Error (generic)}}
> {{at 
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor.deflateBytesDirect(Native
>  Method)}}
> {{at 
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor.compress(ZStandardCompressor.java:220)}}
> {{at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)}}
> {{at 
> org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:76)}}
> {{at 
> java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)}}
> {{at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)}}
> {{at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:15}}
> {{...}}
> Note that the arguments `131072 519` are _always_ given to 
> `deflateBytesDirect` in case things go wrong. In other cases the offset 
> argument is zero and the size argument is smaller, but not zero; e.g., 0 and 
> 7772.
> As for some context: we're using the compression as part of writing sequence 
> files with data serialised with Kryo to Backblaze using the S3A file system / 
> S3 client with a map-reduce job on YARN. The job has no issues with smaller 
> values, but for larger ones this situation happens. I've seen very larges 
> values being written successfully, but at some point this error is raised all 
> over the place (after a few larger values). Perhaps some buffer is filling up?
> Unfortunately, I'm developing using a Mac with M1 processor. So reproducing 
> the issue locally is not a simple feat. If I can somehow produce more leads 
> to investigate this, I'd be happy to.
> As an aside: we're considering working around this using the 
> hbase-compression-zstd module. This is an alternative compression codec that 
> uses the zstd-jni library without depending on hadoop native.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to