[ https://issues.apache.org/jira/browse/HADOOP-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830570#comment-15830570 ]
Jason Lowe commented on HADOOP-12990: ------------------------------------- After having worked on the ZStandard codec in HADOOP-13578, I have a fresher perspective on this. One main problem with creating a separate codec for LZ4 compatibility is that the existing Hadoop LZ4 codec has claimed the standard '.lz4' extension. That means when users upload files into HDFS that have been compressed with the standard LZ4 CLI tool, it will try to use the existing, broken LZ4 codec rather than any new one. They'd have to rename the files to use some non-standard LZ4 extension to select the new codec. That's not ideal. In hindsight, the Hadoop LZ4 codec really should have used the streaming APIs for LZ4 rather than the one-shot or single-step APIs. Then it wouldn't need the extra framing bytes that broke compatibility with the existing LZ4 CLI, and it wouldn't lead to weird failures where the decoder can't decode anything that was encoded with a larger buffer size. The streaming API solves all those problems, being able to decode with an arbitrary user-supplied buffer size and without the extra block header hints that Hadoop added. The cleanest solution from an end-user standpoint would be to have the existing LZ4 codec automatically detect the format when decoding so that we just have one codec and it works both with the old (IMHO broken) format and the standard LZ4 format. I'm hoping there are some key signature bytes that LZ4 always places at the beginning of the compressed data stream so that we can automatically detect which one it is. If that is possible then that would be my preference on how to tackle the issue. If we can't then the end-user story is much less compelling -- two codecs with significant confusion on which one to use. However there is one gotcha even if we can pull off this approach. Files generated on clusters with the updated LZ4 codec would not be able to be decoded on clusters that only have the old codec. If that case has to be supported then we have no choice but to develop a new codec and make users live with the non-standard LZ4 file extensions used by the new codec. .lz4 files uploaded to Hadoop would continue to fail as they do today until renamed to the non-standard extension. > lz4 incompatibility between OS and Hadoop > ----------------------------------------- > > Key: HADOOP-12990 > URL: https://issues.apache.org/jira/browse/HADOOP-12990 > Project: Hadoop Common > Issue Type: Bug > Components: io, native > Affects Versions: 2.6.0 > Reporter: John Zhuge > Priority: Minor > > {{hdfs dfs -text}} hit exception when trying to view the compression file > created by Linux lz4 tool. > The Hadoop version has HADOOP-11184 "update lz4 to r123", thus it is using > LZ4 library in release r123. > Linux lz4 version: > {code} > $ /tmp/lz4 -h 2>&1 | head -1 > *** LZ4 Compression CLI 64-bits r123, by Yann Collet (Apr 1 2016) *** > {code} > Test steps: > {code} > $ cat 10rows.txt > 001|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 002|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 003|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 004|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 005|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 006|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 007|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 008|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 009|c1|c2|c3|c4|c5|c6|c7|c8|c9 > 010|c1|c2|c3|c4|c5|c6|c7|c8|c9 > $ /tmp/lz4 10rows.txt 10rows.txt.r123.lz4 > Compressed 310 bytes into 105 bytes ==> 33.87% > $ hdfs dfs -put 10rows.txt.r123.lz4 /tmp > $ hdfs dfs -text /tmp/10rows.txt.r123.lz4 > 16/04/01 08:19:07 INFO compress.CodecPool: Got brand-new decompressor [.lz4] > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123) > at > org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119) > at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106) > at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org