[
https://issues.apache.org/jira/browse/HADOOP-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289809#comment-15289809
]
Jason Lowe commented on HADOOP-12990:
-------------------------------------
I agree with [~qwertymaniac] that the right way to fix this is to add a
separate codec intended to be compatible with the LZ4 CLI tools. I don't think
it would be too hard to accomplish. We'd need to avoid using
BlockCompressorStream/BlockDecompressorStream since those are doing the
big-endian lengths that will totally confuse the CLI tools. Instead we need to
do the same little-endian block-length logic done by the CLI tool along with
the 4-byte of zeros for the end marker. Seems to simply be a 4-byte
little-endian length per block, but sometimes the high bit can be set? Haven't
looked at the lz4 CLI code to see what the exact logic is.
Bonus points for including the xxhash code so Hadoop can generate and validate
checksums.
> lz4 incompatibility between OS and Hadoop
> -----------------------------------------
>
> Key: HADOOP-12990
> URL: https://issues.apache.org/jira/browse/HADOOP-12990
> Project: Hadoop Common
> Issue Type: Bug
> Components: io, native
> Affects Versions: 2.6.0
> Reporter: John Zhuge
> Priority: Minor
>
> {{hdfs dfs -text}} hit exception when trying to view the compression file
> created by Linux lz4 tool.
> The Hadoop version has HADOOP-11184 "update lz4 to r123", thus it is using
> LZ4 library in release r123.
> Linux lz4 version:
> {code}
> $ /tmp/lz4 -h 2>&1 | head -1
> *** LZ4 Compression CLI 64-bits r123, by Yann Collet (Apr 1 2016) ***
> {code}
> Test steps:
> {code}
> $ cat 10rows.txt
> 001|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 002|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 003|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 004|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 005|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 006|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 007|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 008|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 009|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 010|c1|c2|c3|c4|c5|c6|c7|c8|c9
> $ /tmp/lz4 10rows.txt 10rows.txt.r123.lz4
> Compressed 310 bytes into 105 bytes ==> 33.87%
> $ hdfs dfs -put 10rows.txt.r123.lz4 /tmp
> $ hdfs dfs -text /tmp/10rows.txt.r123.lz4
> 16/04/01 08:19:07 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
> at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
> at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
> at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
> at
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
> at
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]