[jira] [Commented] (HADOOP-12990) lz4 incompatibility between OS and Hadoop

Jason Lowe (JIRA) Thu, 19 Jan 2017 12:42:18 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830570#comment-15830570
 ]


Jason Lowe commented on HADOOP-12990:
-------------------------------------

After having worked on the ZStandard codec in HADOOP-13578, I have a fresher 
perspective on this.  One main problem with creating a separate codec for LZ4 
compatibility is that the existing Hadoop LZ4 codec has claimed the standard 
'.lz4' extension.  That means when users upload files into HDFS that have been 
compressed with the standard LZ4 CLI tool, it will try to use the existing, 
broken LZ4 codec rather than any new one.  They'd have to rename the files to 
use some non-standard LZ4 extension to select the new codec.  That's not ideal.

In hindsight, the Hadoop LZ4 codec really should have used the streaming APIs 
for LZ4 rather than the one-shot or single-step APIs.  Then it wouldn't need 
the extra framing bytes that broke compatibility with the existing LZ4 CLI, and 
it wouldn't lead to weird failures where the decoder can't decode anything that 
was encoded with a larger buffer size.  The streaming API solves all those 
problems, being able to decode with an arbitrary user-supplied buffer size and 
without the extra block header hints that Hadoop added.

The cleanest solution from an end-user standpoint would be to have the existing 
LZ4 codec automatically detect the format when decoding so that we just have 
one codec and it works both with the old (IMHO broken) format and the standard 
LZ4 format.  I'm hoping there are some key signature bytes that LZ4 always 
places at the beginning of the compressed data stream so that we can 
automatically detect which one it is.  If that is possible then that would be 
my preference on how to tackle the issue.  If we can't then the end-user story 
is much less compelling -- two codecs with significant confusion on which one 
to use.

However there is one gotcha even if we can pull off this approach.  Files 
generated on clusters with the updated LZ4 codec would not be able to be 
decoded on clusters that only have the old codec.  If that case has to be 
supported then we have no choice but to develop a new codec and make users live 
with the non-standard LZ4 file extensions used by the new codec.  .lz4 files 
uploaded to Hadoop would continue to fail as they do today until renamed to the 
non-standard extension.


> lz4 incompatibility between OS and Hadoop
> -----------------------------------------
>
>                 Key: HADOOP-12990
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12990
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io, native
>    Affects Versions: 2.6.0
>            Reporter: John Zhuge
>            Priority: Minor
>
> {{hdfs dfs -text}} hit exception when trying to view the compression file 
> created by Linux lz4 tool.
> The Hadoop version has HADOOP-11184 "update lz4 to r123", thus it is using 
> LZ4 library in release r123.
> Linux lz4 version:
> {code}
> $ /tmp/lz4 -h 2>&1 | head -1
> *** LZ4 Compression CLI 64-bits r123, by Yann Collet (Apr  1 2016) ***
> {code}
> Test steps:
> {code}
> $ cat 10rows.txt
> 001|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 002|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 003|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 004|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 005|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 006|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 007|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 008|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 009|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 010|c1|c2|c3|c4|c5|c6|c7|c8|c9
> $ /tmp/lz4 10rows.txt 10rows.txt.r123.lz4
> Compressed 310 bytes into 105 bytes ==> 33.87%
> $ hdfs dfs -put 10rows.txt.r123.lz4 /tmp
> $ hdfs dfs -text /tmp/10rows.txt.r123.lz4
> 16/04/01 08:19:07 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at 
> org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
>     at 
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
>     at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>     at java.io.InputStream.read(InputStream.java:101)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
>     at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106)
>     at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101)
>     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>     at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>     at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>     at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>     at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>     at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-12990) lz4 incompatibility between OS and Hadoop

Reply via email to