[
https://issues.apache.org/jira/browse/HADOOP-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042384#comment-18042384
]
Fengyu Cao commented on HADOOP-18799:
-------------------------------------
I found that when Hadoop decompresses a file compressed by the zstd command, if
the original uncompressed size is smaller than 129 KiB, it will consistently
reproduce the following error:
java.lang.InternalError: Src size is incorrect
My scenario is that Spark reads a Zstandard-compressed file through the Hadoop
native library. I assume this should behave identically to reading it directly
through the API.
The versions used are:
• Apache Spark 3.5.7 (Scala 2.12) with Hadoop 3.3.4
• libhadoop.so from Apache Hadoop 3.3.6
• libzstd 1.5.4
Reproduction steps:
1. yes a | head -n 65536 > file_128KiB.txt # generate a 128 KiB
file
2. zstd file_128KiB.txt
3. zstd -lv file_128KiB.txt.zst && zstdcat file_128KiB.txt.zst |
head -n 3 # confirm the file is a valid zstd file
4. pyspark
5.
spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
6. Error: java.lang.InternalError: Src size is incorrect
>>> spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
25/12/03 11:05:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.InternalError: Src size is incorrect
at
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
Method)
at
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:187)
at
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.base/java.io.InputStream.read(InputStream.java:218)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:191)
at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:227)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:185)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:158)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:198)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
In addition, the following flow results in:
java.lang.InternalError: Restored data doesn't match checksum
1a. yes a | head -n 66048 > file_129KiB.txt
1b. yes a | head -n 65536 > file_128KiB.txt
2a. zstd file_129KiB.txt
2b. zstd file_128KiB.txt
3a. zstd -lv file_129KiB.txt.zst && zstdcat file_129KiB.txt.zst | head -n 3
3b. zstd -lv file_128KiB.txt.zst && zstdcat file_128KiB.txt.zst | head -n 3
4. pyspark
5. spark.read.text("hdfs://dhome/camepr42/test_zstd/file_129KiB.txt.zst").show()
6.
spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
# no error
7. spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
8. Error: java.lang.InternalError: Restored data doesn't match checksum
>>> spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()
25/12/03 11:08:48 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.InternalError: Restored data doesn't match checksum
at
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
Method)
at
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:187)
at
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.base/java.io.InputStream.read(InputStream.java:218)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:191)
at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:227)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:185)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:158)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:198)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> Zstd compressor fails with src size is incorrect
> ------------------------------------------------
>
> Key: HADOOP-18799
> URL: https://issues.apache.org/jira/browse/HADOOP-18799
> Project: Hadoop Common
> Issue Type: Bug
> Components: native
> Affects Versions: 3.3.0
> Reporter: Frens Jan Rumph
> Priority: Major
>
> It seems like I've hit an issue similar to
> https://issues.apache.org/jira/browse/HADOOP-15822. I haven't been able to
> reproduce the issue though. I did manage to add a little bit of logging to
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor. I've captured the off
> and len arguments of compress and the srcOffset and srcLen arguments for
> deflateBytesDirect:
> {{compress 0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress 0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress 0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress 0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress 0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress 0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress 0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress 0 131591}}
> {{deflateBytesDirect 0 131591}}
> {{compress 0 131591}}
> {{deflateBytesDirect 131072 519}}
> Just after that last line the process dies with a java.lang.InternalError:
> Src size is incorrect:
> {{org.apache.hadoop.mapred.YarnChild: Error running child :
> java.lang.InternalError: Src size is incorrect}}
> {{at
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor.deflateBytesDirect(Native
> Method)}}
> {{at
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor.compress(ZStandardCompressor.java:220)}}
> {{at
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)}}
> {{at
> org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:76)}}
> {{at
> java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)}}
> {{at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)}}
> {{at
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1569)}}
> {{...}}
> I have also seen this error: java.lang.InternalError: Error (generic):
> {{java.lang.InternalError: Error (generic)}}
> {{at
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor.deflateBytesDirect(Native
> Method)}}
> {{at
> org.apache.hadoop.io.compress.zstd.ZStandardCompressor.compress(ZStandardCompressor.java:220)}}
> {{at
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)}}
> {{at
> org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:76)}}
> {{at
> java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)}}
> {{at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)}}
> {{at
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:15}}
> {{...}}
> Note that the arguments `131072 519` are _always_ given to
> `deflateBytesDirect` in case things go wrong. In other cases the offset
> argument is zero and the size argument is smaller, but not zero; e.g., 0 and
> 7772.
> As for some context: we're using the compression as part of writing sequence
> files with data serialised with Kryo to Backblaze using the S3A file system /
> S3 client with a map-reduce job on YARN. The job has no issues with smaller
> values, but for larger ones this situation happens. I've seen very larges
> values being written successfully, but at some point this error is raised all
> over the place (after a few larger values). Perhaps some buffer is filling up?
> Unfortunately, I'm developing using a Mac with M1 processor. So reproducing
> the issue locally is not a simple feat. If I can somehow produce more leads
> to investigate this, I'd be happy to.
> As an aside: we're considering working around this using the
> hbase-compression-zstd module. This is an alternative compression codec that
> uses the zstd-jni library without depending on hadoop native.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]