xuzq created HDFS-14099: --------------------------- Summary: Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor Key: HDFS-14099 URL: https://issues.apache.org/jira/browse/HDFS-14099 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.3 Environment: Hadoop Version: hadoop-3.0.3
Java Version: 1.8.0_144 Reporter: xuzq I need to use zstd compress in Hadoop. So I write a simple demo like this. {code:java} // code placeholder while ((size = fsDataInputStream.read(bufferV2)) > 0 ) { countSize += size; if (countSize == 65536 * 8) { if(!isFinished) { // finish a frame in zstd cmpOut.finish(); isFinished = true; } fsDataOutputStream.flush(); fsDataOutputStream.hflush(); } if(isFinished) { LOG.info("Will resetState. N=" + n); // reset the stream and write again cmpOut.resetState(); isFinished = false; } cmpOut.write(bufferV2, 0, size); bufferV2 = new byte[5 * 1024 * 1024]; n++; } {code} And I use *hadoop fs -text* to read this file and failed. The error as blow. {code:java} java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support. at org.apache.hadoop.io.compress.ZStandardCodec.checkNativeCodeLoaded(ZStandardCodec.java:65) at org.apache.hadoop.io.compress.ZStandardCodec.getDecompressorType(ZStandardCodec.java:211) at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:181) at org.apache.hadoop.io.compress.CompressionCodec$Util.createInputStreamWithCodecPool(CompressionCodec.java:157) at org.apache.hadoop.io.compress.ZStandardCodec.createInputStream(ZStandardCodec.java:182) at org.apache.hadoop.fs.shell.Display$Text.getInputStream(Display.java:157) at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) at org.apache.hadoop.fs.shell.Command.run(Command.java:176) at org.apache.hadoop.fs.FsShell.run(FsShell.java:328) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:391) {code} So I had to look the code, include jni, then I find this bug. *ZSTD_initDStream(stream)* method may by called twice in the same frame. The first is in *ZStandardDecompressor.c.* {code:java} if (size == 0) { (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, JNI_TRUE); size_t result = dlsym_ZSTD_initDStream(stream); if (dlsym_ZSTD_isError(result)) { THROW(env, "java/lang/InternalError", dlsym_ZSTD_getErrorName(result)); return (jint) 0; } } {code} This call here is correct, but *Finished* no longer be set to false, even if the is some datas in *compressedBuffer* need to be decompressed. The second is in *org.apache.hadoop.io.compress.DecompressorStream* by *decompressor.reset()*, because *Finished* is always true after decompressed a frame. {code:java} if (decompressor.finished()) { // First see if there was any leftover buffered input from previous // stream; if not, attempt to refill buffer. If refill -> EOF, we're // all done; else reset, fix up input buffer, and get ready for next // concatenated substream/"member". int nRemaining = decompressor.getRemaining(); if (nRemaining == 0) { int m = getCompressedData(); if (m == -1) { // apparently the previous end-of-stream was also end-of-file: // return success, as if we had never called getCompressedData() eof = true; return -1; } decompressor.reset(); decompressor.setInput(buffer, 0, m); lastBytesSent = m; } else { // looks like it's a concatenated stream: reset low-level zlib (or // other engine) and buffers, then "resend" remaining input data decompressor.reset(); int leftoverOffset = lastBytesSent - nRemaining; assert (leftoverOffset >= 0); // this recopies userBuf -> direct buffer if using native libraries: decompressor.setInput(buffer, leftoverOffset, nRemaining); // NOTE: this is the one place we do NOT want to save the number // of bytes sent (nRemaining here) into lastBytesSent: since we // are resending what we've already sent before, offset is nonzero // in general (only way it could be zero is if it already equals // nRemaining), which would then screw up the offset calculation // _next_ time around. IOW, getRemaining() is in terms of the // original, zero-offset bufferload, so lastBytesSent must be as // well. Cheesy ASCII art: // // <------------ m, lastBytesSent -----------> // +===============================================+ // buffer: |1111111111|22222222222222222|333333333333| | // +===============================================+ // #1: <-- off -->|<-------- nRemaining ---------> // #2: <----------- off ----------->|<-- nRem. --> // #3: (final substream: nRemaining == 0; eof = true) // // If lastBytesSent is anything other than m, as shown, then "off" // will be calculated incorrectly. } }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org