[
https://issues.apache.org/jira/browse/KAFKA-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126346#comment-15126346
]
Ismael Juma edited comment on KAFKA-3174 at 2/1/16 3:02 PM:
------------------------------------------------------------
[~becket_qin] We have started recommending Java 8 around the same time we
released 0.9.0.0 (we also mention that LinkedIn is using Java 8 there):
http://kafka.apache.org/documentation.html#java
I did some investigation so that we understand the specifics of the improvement
to CRC32 in the JDK. It relies on SSE 2, SSE 4.1, AVX and CLMUL. SSE has been
available for a long time, CLMUL since Intel Westmere (2010) and AVX since
Intel Sandy Bridge (2011). It's probably OK to assume that these instructions
will be available for those who are constrained by CPU performance.
Note that this is not using CRC32 CPU instruction as we would have to use
CRC32C for that (see KAFKA-1449 for more details on what is possible if we are
willing to support CRC32C).
I wrote a simple JMH benchmark:
https://gist.github.com/ijuma/f86ad935715cfd4e258e
I tested it on my Ivy Bridge MacBook on JDK 7 update 80 and JDK 8 update 76,
configuring JMH to use 10 one second measurement iterations, 10 one second
warmup iterations and 1 fork.
JDK 8 update 76 results:
{code}
[info] Benchmark (bytesSize) Mode Cnt Score Error
Units
[info] Crc32Bench.jdkCrc32 8 avgt 10 24.902 ± 0.728
ns/op
[info] Crc32Bench.jdkCrc32 16 avgt 10 48.819 ± 2.550
ns/op
[info] Crc32Bench.jdkCrc32 32 avgt 10 83.434 ± 2.668
ns/op
[info] Crc32Bench.jdkCrc32 128 avgt 10 127.679 ± 5.185
ns/op
[info] Crc32Bench.jdkCrc32 1024 avgt 10 450.105 ± 18.943
ns/op
[info] Crc32Bench.jdkCrc32 65536 avgt 10 25579.406 ± 683.017
ns/op
[info] Crc32Bench.jdkCrc32 1048576 avgt 10 408708.242 ± 12183.543
ns/op
[info] Crc32Bench.kafkaCrc32 8 avgt 10 14.761 ± 0.647
ns/op
[info] Crc32Bench.kafkaCrc32 16 avgt 10 19.114 ± 0.423
ns/op
[info] Crc32Bench.kafkaCrc32 32 avgt 10 34.243 ± 1.066
ns/op
[info] Crc32Bench.kafkaCrc32 128 avgt 10 114.481 ± 2.812
ns/op
[info] Crc32Bench.kafkaCrc32 1024 avgt 10 835.630 ± 22.412
ns/op
[info] Crc32Bench.kafkaCrc32 65536 avgt 10 52234.713 ± 2229.624
ns/op
[info] Crc32Bench.kafkaCrc32 1048576 avgt 10 822903.613 ± 20950.560
ns/op
{code}
JDK 7 update 80 results:
{code}
[info] Benchmark (bytesSize) Mode Cnt Score Error
Units
[info] Crc32Bench.jdkCrc32 8 avgt 10 114.802 ± 8.289
ns/op
[info] Crc32Bench.jdkCrc32 16 avgt 10 122.030 ± 3.153
ns/op
[info] Crc32Bench.jdkCrc32 32 avgt 10 131.082 ± 5.501
ns/op
[info] Crc32Bench.jdkCrc32 128 avgt 10 154.116 ± 6.164
ns/op
[info] Crc32Bench.jdkCrc32 1024 avgt 10 512.151 ± 15.934
ns/op
[info] Crc32Bench.jdkCrc32 65536 avgt 10 25460.014 ± 1532.627
ns/op
[info] Crc32Bench.jdkCrc32 1048576 avgt 10 401996.290 ± 18606.012
ns/op
[info] Crc32Bench.kafkaCrc32 8 avgt 10 14.493 ± 0.494
ns/op
[info] Crc32Bench.kafkaCrc32 16 avgt 10 20.329 ± 2.019
ns/op
[info] Crc32Bench.kafkaCrc32 32 avgt 10 37.706 ± 0.338
ns/op
[info] Crc32Bench.kafkaCrc32 128 avgt 10 124.197 ± 6.368
ns/op
[info] Crc32Bench.kafkaCrc32 1024 avgt 10 908.327 ± 32.487
ns/op
[info] Crc32Bench.kafkaCrc32 65536 avgt 10 57000.705 ± 2976.852
ns/op
[info] Crc32Bench.kafkaCrc32 1048576 avgt 10 940433.528 ± 26257.962
ns/op
{code}
Using a VM intrinsic avoids JNI set-up costs making JDK 8 much faster than JDK
7 for small byte arrays. Having said that, Kafka's pure Java implementation is
still faster for byte arrays of up to 128 bytes according to this benchmark.
Surprisingly, the results are similar for JDK 7 and JDK 8 for larger byte
arrays. I had a quick look at the assembly generated for JDK 8 and it seems to
use AVX and CLMUL as per the OpenJDK commit I linked to. Unfortunately, it's a
bit more work to look at the assembly generated by JDK 7 on a Mac and so I
didn't. More investigation would be required to understand why this is so (and
to be able to trust the numbers).
Looking at how we compute CRCs in `Record`, there are two different code paths
depending on whether we call it from `Compressor` or not. The former invokes
Crc32 update methods several times (both the byte array and int versions) while
the latter invokes the byte array version once only.
To really understand the impact of this change, I think we need to benchmark
the producer with varying message sizes with both implementations.
[~becket_qin], how did you come up with the 2x as fast figure?
was (Author: ijuma):
[~becket_qin] We have started recommending Java 8 around the same time we
released 0.9.0.0 (we also mention that LinkedIn is using Java 8 there):
http://kafka.apache.org/documentation.html#java
I did some investigation so that we understand the specifics of the improvement
to CRC32 in the JDK. It relies on SSE 2, SSE 4.1, AVX and CLMUL. SSE has been
available for a long time, CLMUL since Intel Westmere (2010) and AVX since
Intel Sandy Bridge (2011). It's probably OK to assume that these instructions
will be available for those who are constrained by CPU performance.
Note that this is not using CRC32 CPU instruction as we would have to use
CRC32C for that (see KAFKA-1449 for more details on what is possible if we are
willing to support CRC32C).
I wrote a simple JMH benchmark:
https://gist.github.com/ijuma/f86ad935715cfd4e258e
I tested it on my Ivy Bridge MacBook on JDK 7 update 80 and JDK 8 update 76,
configuring JMH to use 10 one second measurement iterations, 10 one second
warmup iterations and 1 fork.
JDK 8 update 76 results:
{code}
[info] Benchmark (bytesSize) Mode Cnt Score Error
Units
[info] Crc32Bench.jdkCrc32 8 avgt 10 24.902 ± 0.728
ns/op
[info] Crc32Bench.jdkCrc32 16 avgt 10 48.819 ± 2.550
ns/op
[info] Crc32Bench.jdkCrc32 32 avgt 10 83.434 ± 2.668
ns/op
[info] Crc32Bench.jdkCrc32 128 avgt 10 127.679 ± 5.185
ns/op
[info] Crc32Bench.jdkCrc32 1024 avgt 10 450.105 ± 18.943
ns/op
[info] Crc32Bench.jdkCrc32 65536 avgt 10 25579.406 ± 683.017
ns/op
[info] Crc32Bench.jdkCrc32 1048576 avgt 10 408708.242 ± 12183.543
ns/op
[info] Crc32Bench.kafkaCrc32 8 avgt 10 14.761 ± 0.647
ns/op
[info] Crc32Bench.kafkaCrc32 16 avgt 10 19.114 ± 0.423
ns/op
[info] Crc32Bench.kafkaCrc32 32 avgt 10 34.243 ± 1.066
ns/op
[info] Crc32Bench.kafkaCrc32 128 avgt 10 114.481 ± 2.812
ns/op
[info] Crc32Bench.kafkaCrc32 1024 avgt 10 835.630 ± 22.412
ns/op
[info] Crc32Bench.kafkaCrc32 65536 avgt 10 52234.713 ± 2229.624
ns/op
[info] Crc32Bench.kafkaCrc32 1048576 avgt 10 822903.613 ± 20950.560
ns/op
{code}
JDK 7 update 80 results:
{code}
[info] Benchmark (bytesSize) Mode Cnt Score Error
Units
[info] Crc32Bench.jdkCrc32 8 avgt 10 114.802 ± 8.289
ns/op
[info] Crc32Bench.jdkCrc32 16 avgt 10 122.030 ± 3.153
ns/op
[info] Crc32Bench.jdkCrc32 32 avgt 10 131.082 ± 5.501
ns/op
[info] Crc32Bench.jdkCrc32 128 avgt 10 154.116 ± 6.164
ns/op
[info] Crc32Bench.jdkCrc32 1024 avgt 10 512.151 ± 15.934
ns/op
[info] Crc32Bench.jdkCrc32 65536 avgt 10 25460.014 ± 1532.627
ns/op
[info] Crc32Bench.jdkCrc32 1048576 avgt 10 401996.290 ± 18606.012
ns/op
[info] Crc32Bench.kafkaCrc32 8 avgt 10 14.493 ± 0.494
ns/op
[info] Crc32Bench.kafkaCrc32 16 avgt 10 20.329 ± 2.019
ns/op
[info] Crc32Bench.kafkaCrc32 32 avgt 10 37.706 ± 0.338
ns/op
[info] Crc32Bench.kafkaCrc32 128 avgt 10 124.197 ± 6.368
ns/op
[info] Crc32Bench.kafkaCrc32 1024 avgt 10 908.327 ± 32.487
ns/op
[info] Crc32Bench.kafkaCrc32 65536 avgt 10 57000.705 ± 2976.852
ns/op
[info] Crc32Bench.kafkaCrc32 1048576 avgt 10 940433.528 ± 26257.962
ns/op
{code}
Using a VM intrinsic avoids JNI set-up costs making JDK 8 much faster than JDK
7 for small byte arrays. Having said that, Kafka's pure Java implementation is
still faster for byte arrays of up to 128 bytes according to this benchmark.
Surprisingly, the results are similar for JDK 7 and JDK 8 for larger byte
arrays. I had a quick look at the assembly generated for JDK 8 and it seems to
use AVX and CLMUL as per the OpenJDK commit I linked to. Unfortunately, it's a
bit more work to look at the assembly generated by JDK 7 on a Mac and so I
didn't. More investigation would be required to understand why this is so (and
to be able to trust the numbers).
Looking at how we compute CRCs in `Record`, there are two different code paths
depending on whether we call it from `Compressor` or not. The former invokes
Crc32 update methods several times (both the byte array and int versions) while
the latter invokes the byte array version once only.
To really understand the impact of this change, I think we need to benchmark
the producer with varying message sizes with both implementations.
[~becket_qin], how do you come up with the 2x as fast figure?
> Re-evaluate the CRC32 class performance.
> ----------------------------------------
>
> Key: KAFKA-3174
> URL: https://issues.apache.org/jira/browse/KAFKA-3174
> Project: Kafka
> Issue Type: Improvement
> Affects Versions: 0.9.0.0
> Reporter: Jiangjie Qin
> Assignee: Jiangjie Qin
> Fix For: 0.9.0.1
>
>
> We used org.apache.kafka.common.utils.CRC32 in clients because it has better
> performance than java.util.zip.CRC32 in Java 1.6.
> In a recent test I ran it looks in Java 1.8 the CRC32 class is 2x as fast as
> the Crc32 class we are using now. We may want to re-evaluate the performance
> of Crc32 class and see it makes sense to simply use java CRC32 instead.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)