[jira] [Created] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor

xuzq (JIRA) Mon, 26 Nov 2018 06:12:17 -0800

xuzq created HDFS-14099:
---------------------------

             Summary: Unknown frame descriptor when decompressing multiple 
frames in ZStandardDecompressor
                 Key: HDFS-14099
                 URL: https://issues.apache.org/jira/browse/HDFS-14099
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 3.0.3
         Environment: Hadoop Version: hadoop-3.0.3


Java Version: 1.8.0_144
            Reporter: xuzq


I need to use zstd compress in Hadoop. So I write a simple demo like this.
{code:java}
// code placeholder
while ((size = fsDataInputStream.read(bufferV2)) > 0 ) {
  countSize += size;
  if (countSize == 65536 * 8) {
    if(!isFinished) {
      // finish a frame in zstd
      cmpOut.finish();
      isFinished = true;
    }
    fsDataOutputStream.flush();
    fsDataOutputStream.hflush();
  }
  if(isFinished) {
    LOG.info("Will resetState. N=" + n);
    // reset the stream and write again
    cmpOut.resetState();
    isFinished = false;
  }
  cmpOut.write(bufferV2, 0, size);
  bufferV2 = new byte[5 * 1024 * 1024];
  n++;
}
{code}
 

 

And I use *hadoop fs -text*  to read this file and failed. The error as blow.
{code:java}
java.lang.RuntimeException: native zStandard library not available: this 
version of libhadoop was built without zstd support.
at 
org.apache.hadoop.io.compress.ZStandardCodec.checkNativeCodeLoaded(ZStandardCodec.java:65)
at 
org.apache.hadoop.io.compress.ZStandardCodec.getDecompressorType(ZStandardCodec.java:211)
at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:181)
at 
org.apache.hadoop.io.compress.CompressionCodec$Util.createInputStreamWithCodecPool(CompressionCodec.java:157)
at 
org.apache.hadoop.io.compress.ZStandardCodec.createInputStream(ZStandardCodec.java:182)
at org.apache.hadoop.fs.shell.Display$Text.getInputStream(Display.java:157)
at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
{code}
 

So I had to look the code, include jni, then I find this bug.

*ZSTD_initDStream(stream)* method may by called twice in the same frame.

The first is  in *ZStandardDecompressor.c.* 

 
{code:java}
if (size == 0) {
    (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, 
JNI_TRUE);
    size_t result = dlsym_ZSTD_initDStream(stream);
    if (dlsym_ZSTD_isError(result)) {
        THROW(env, "java/lang/InternalError", dlsym_ZSTD_getErrorName(result));
        return (jint) 0;
    }
}
{code}
This call here is correct, but *Finished* no longer be set to false, even if 
the is some datas in *compressedBuffer* need to be decompressed.

The second is in *org.apache.hadoop.io.compress.DecompressorStream* by 
*decompressor.reset()*, because *Finished* is always true after decompressed a 
frame.
{code:java}
if (decompressor.finished()) {
  // First see if there was any leftover buffered input from previous
  // stream; if not, attempt to refill buffer.  If refill -> EOF, we're
  // all done; else reset, fix up input buffer, and get ready for next
  // concatenated substream/"member".
  int nRemaining = decompressor.getRemaining();
  if (nRemaining == 0) {
    int m = getCompressedData();
    if (m == -1) {
      // apparently the previous end-of-stream was also end-of-file:
      // return success, as if we had never called getCompressedData()
      eof = true;
      return -1;
    }
    decompressor.reset();
    decompressor.setInput(buffer, 0, m);
    lastBytesSent = m;
  } else {
    // looks like it's a concatenated stream:  reset low-level zlib (or
    // other engine) and buffers, then "resend" remaining input data
    decompressor.reset();
    int leftoverOffset = lastBytesSent - nRemaining;
    assert (leftoverOffset >= 0);
    // this recopies userBuf -> direct buffer if using native libraries:
    decompressor.setInput(buffer, leftoverOffset, nRemaining);
    // NOTE:  this is the one place we do NOT want to save the number
    // of bytes sent (nRemaining here) into lastBytesSent:  since we
    // are resending what we've already sent before, offset is nonzero
    // in general (only way it could be zero is if it already equals
    // nRemaining), which would then screw up the offset calculation
    // _next_ time around.  IOW, getRemaining() is in terms of the
    // original, zero-offset bufferload, so lastBytesSent must be as
    // well.  Cheesy ASCII art:
    //
    //          <------------ m, lastBytesSent ----------->
    //          +===============================================+
    // buffer:  |1111111111|22222222222222222|333333333333|     |
    //          +===============================================+
    //     #1:  <-- off -->|<-------- nRemaining --------->
    //     #2:  <----------- off ----------->|<-- nRem. -->
    //     #3:  (final substream:  nRemaining == 0; eof = true)
    //
    // If lastBytesSent is anything other than m, as shown, then "off"
    // will be calculated incorrectly.
  }
}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor

Reply via email to