[ 
https://issues.apache.org/jira/browse/HADOOP-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711831#comment-15711831
 ] 

Steve Loughran commented on HADOOP-13849:
-----------------------------------------

Well, if you want to work on it, feel free. 

however, know that the native codec uses the standard {{libbz2}}; there's not 
much that can be done in the Hadoop code to speed that up other than any 
improvements in how data is moved between the Java memory structures and those 
of libbz...if there are memory copies taking place then that could be hurting 
performance. Anything that can help there would be good.


bq. I think the "system native" should have better compress/decompress 
performance than "java builtin".

That's something to explore. The latest Java 8 compilers are fast, and if the 
algorithms aren't doing lots of object creation, then bit operations in Java 
should be on a par with C-language actions against general registers. Where you 
would expect differences is if the native code uses some special CPU registers 
and operations (example, Intel SSE2) for significant performance. I don't know 
if bzip does that.

The fun part in benchmarking is isolating things. For codec performance, maybe 
have some test data being pre generated in CPU & cached in RAM. in standard 
formats (avro, orc), and the different codecs, then compressing that to RAM not 
HDD, so that the compression code is isolated from Disk IO, etc, etc. 

If the isolated native code is faster than the java one, then the implication 
is that the bottleneck is elsewhere in the workflow, not the codec. Again: 
that's interesting information.

bq. My hardware CPU/Memory/Network bandwidh/Disk bandwidh are not bottleneck

one of them is. Always —and it can be things like CPU cache latencies, excess 
synchronization in the code, even branch-misprediction in the CPU can hurt 
efficiency. FWIW, Flamegraphs are current the tool of choice for visualising 
performance during microbenchmarks





> Bzip2 java-builtin and system-native have almost the same compress speed
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-13849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13849
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 2.6.0
>         Environment: os version: redhat6
> hadoop version: 2.6.0
> native bzip2 version: bzip2-devel-1.0.5-7.el6_0.x86_64
>            Reporter: Tao Li
>
> I tested bzip2 java-builtin and system-native compression, and I found the 
> compress speed is almost the same. (I think the system-native should have 
> better compress speed than java-builtin)
> My test case:
> 1. input file: 2.7GB text file without compression
> 2. after bzip2 java-builtin compress: 457MB, 12min 4sec
> 3. after bzip2 system-native compress: 457MB, 12min 19sec
> My MapReduce Config:
> conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false");
> conf.set("mapreduce.output.fileoutputformat.compress", "true");
> conf.set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
> conf.set("mapreduce.output.fileoutputformat.compress.codec", 
> "org.apache.hadoop.io.compress.BZip2Codec");
> conf.set("io.compression.codec.bzip2.library", "java-builtin"); // for 
> java-builtin
> conf.set("io.compression.codec.bzip2.library", "system-native"); // for 
> system-native
> And I am sure I have enable the bzip2 native, the output of command "hadoop 
> checknative -a" is as follows:
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:    true /lib64/libz.so.1
> snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
> lz4:     true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: true /usr/lib64/libcrypto.so



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to