[ 
https://issues.apache.org/jira/browse/CASSANDRA-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314622#comment-17314622
 ] 

Alexey Zotov edited comment on CASSANDRA-16360 at 4/5/21, 6:54 AM:
-------------------------------------------------------------------

I've made an initial research on this task. Looks like the _CRC32C_ hardware 
support is only available when intrinsic code is used. Here are a few reference 
tickets mentioning that:
 * [https://bugs.openjdk.java.net/browse/JDK-8189745]
 * [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8191328]
 * [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8073583]

And it looks like intrinsic code is only available for the [native _CRC32C_ 
class|[https://docs.oracle.com/javase/9/docs/api/java/util/zip/CRC32C.html]] 
which comes with Java 9.

Here are the results of a microbenchmark test for native _CRC32_ and _CRC32C_ 
classes (run was made using Java 11.0.9): 
{code:java}
[java] Benchmark                             (bufferSize)  Mode  Cnt     Score  
  Error   Units
[java] ChecksumBench.benchCrc32                        31  avgt    5    98.823 
±   2.667  ns/op
[java] ChecksumBench.benchCrc32                       131  avgt    5   133.014 
±   5.831  ns/op
[java] ChecksumBench.benchCrc32                       517  avgt    5   173.939 
±  14.456  ns/op
[java] ChecksumBench.benchCrc32                      2041  avgt    5   270.847 
±   7.071  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic             31  avgt    5   118.761 
±  77.032  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic            131  avgt    5   157.799 
± 377.481  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic            517  avgt    5   238.125 
± 900.150  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic           2041  avgt    5   276.828 
±   7.814  ns/op
[java] ChecksumBench.benchCrc32c                       31  avgt    5    50.619 
±   2.178  ns/op
[java] ChecksumBench.benchCrc32c                      131  avgt    5    69.229 
±   2.186  ns/op
[java] ChecksumBench.benchCrc32c                      517  avgt    5   190.943 
±   3.741  ns/op
[java] ChecksumBench.benchCrc32c                     2041  avgt    5   276.401 
±   4.161  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic            31  avgt    5    56.111 
±   1.834  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic           131  avgt    5    75.475 
±   1.912  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic           517  avgt    5   196.557 
±   5.209  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic          2041  avgt    5   281.095 
±   5.765  ns/op
{code}
Two obvious facts were reaffirmed by this run:
 # intrinsic code works faster than non-intrinsic
 # _CRC32C_ works faster than _CRC32_

----
As I mentioned before, _CRC32C_ class is unavailable in Java 8. That means we 
cannot directly use that in the code base, otherwise it won't compile with Java 
8. So there are two ways to proceed: 1) we wait until Java 8 support is 
abandoned and start using _CRC32C_ 2) if Java 9+ is used then we use _CRC32C_ , 
otherwise we fallback to some Java-based implementation. The second approach 
sounds reasonable to me since we need to target for newer versions and faster 
solutions.

So I've given a try to two available Java-based implementations (Guava - 
_hasherCrc32c_ and Snappy - _pureJavaCrc32c_) in another microbenchmark test 
(run was made using Java 11.0.9):
{code:java}
[java] Benchmark                          (bufferSize)  Mode  Cnt     Score     
Error  Units
[java] ChecksumBench.benchCrc32                     31  avgt    5   113.076 ±  
33.568  ns/op
[java] ChecksumBench.benchCrc32                    131  avgt    5    88.348 ±  
11.433  ns/op
[java] ChecksumBench.benchCrc32                    517  avgt    5   175.464 ±  
56.718  ns/op
[java] ChecksumBench.benchCrc32                   2041  avgt    5   273.945 ±  
16.331  ns/op
[java] ChecksumBench.benchHasherCrc32c              31  avgt   25   150.860 ±   
2.883  ns/op
[java] ChecksumBench.benchHasherCrc32c             131  avgt   25   559.423 ± 
141.932  ns/op
[java] ChecksumBench.benchHasherCrc32c             517  avgt   25  1599.504 ±  
82.253  ns/op
[java] ChecksumBench.benchHasherCrc32c            2041  avgt   25  5988.707 ± 
103.558  ns/op
[java] ChecksumBench.benchPureJavaCrc32c            31  avgt   25    99.486 ±   
1.464  ns/op
[java] ChecksumBench.benchPureJavaCrc32c           131  avgt   25   252.278 ±  
20.278  ns/op
[java] ChecksumBench.benchPureJavaCrc32c           517  avgt   25   822.142 ±  
38.860  ns/op
[java] ChecksumBench.benchPureJavaCrc32c          2041  avgt   25  3083.762 ± 
200.493  ns/op
{code}
This test shows that:
 # Snappy's implementation outperforms Guava's  - which basically is not really 
important, we can use either of them
 # They both work slower than native _CRC32C_ implementation with intrinsic - 
which is actually expected
 # They both work slower than native _CRC32C_ implementation without intrinsic 
- which is totally fine
 # They both work slower than native _CRC32_ implementation - *which appears to 
be a problem*

 Basically that means that the users who use Java 9+ would benefit, whereas the 
users who use Java 8 would suffer from this change.

Here I need some input/feedback from the maintainers/community whether we need 
to proceed with this change or not (I have some draft changes though). Maybe 
[~tjake] and [~benedict] can help with the decision since they worked on the 
microbenchmarks and performance-related functionality in C.
----
On a separate note, Hadoop used Snappy's Java-based implementation. Then they 
migrated to native CRC32C implementation and started using Snappy's Java-based 
implementation as a fallback option if Java 8 is used. Here are the details:  
[https://github.com/apache/hadoop/pull/291|https://github.com/apache/hadoop/pull/291.].
 We can do something similar, if we're fine that for Java 8 users performance 
will slightly degrade.
----
The benchmark I used is available here (see ChecksumBench class):  
[https://github.com/apache/cassandra/pull/951]. I feel that PR can be merged 
because it contains only benchmark changes which seem to be useful for further 
developments.


was (Author: azotcsit):
I've made an initial research on this task. Looks like the _CRC32C_ hardware 
support is only available when intrinsic code is used. Here are a few reference 
tickets mentioning that:
 * [https://bugs.openjdk.java.net/browse/JDK-8189745]
 * [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8191328]
 * [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8073583]

And it looks like intrinsic code is only available for the [native _CRC32C_ 
class|[https://docs.oracle.com/javase/9/docs/api/java/util/zip/CRC32C.html]] 
which comes with Java 9.

Here are the results of a microbenchmark test for native _CRC32_ and _CRC32C_ 
classes (run was made using Java 11.0.9): 
{code:java}
[java] Benchmark                             (bufferSize)  Mode  Cnt     Score  
  Error   Units
[java] ChecksumBench.benchCrc32                        31  avgt    5    98.823 
±   2.667  ns/op
[java] ChecksumBench.benchCrc32                       131  avgt    5   133.014 
±   5.831  ns/op
[java] ChecksumBench.benchCrc32                       517  avgt    5   173.939 
±  14.456  ns/op
[java] ChecksumBench.benchCrc32                      2041  avgt    5   270.847 
±   7.071  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic             31  avgt    5   118.761 
±  77.032  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic            131  avgt    5   157.799 
± 377.481  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic            517  avgt    5   238.125 
± 900.150  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic           2041  avgt    5   276.828 
±   7.814  ns/op
[java] ChecksumBench.benchCrc32c                       31  avgt    5    50.619 
±   2.178  ns/op
[java] ChecksumBench.benchCrc32c                      131  avgt    5    69.229 
±   2.186  ns/op
[java] ChecksumBench.benchCrc32c                      517  avgt    5   190.943 
±   3.741  ns/op
[java] ChecksumBench.benchCrc32c                     2041  avgt    5   276.401 
±   4.161  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic            31  avgt    5    56.111 
±   1.834  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic           131  avgt    5    75.475 
±   1.912  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic           517  avgt    5   196.557 
±   5.209  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic          2041  avgt    5   281.095 
±   5.765  ns/op
{code}
Two obvious facts were reaffirmed by this run:
 # intrinsic code works faster than non-intrinsic
 # _CRC32C_ works faster than _CRC32_

----
As I mentioned before, _CRC32C_ class is unavailable in Java 8. That means we 
cannot directly use that in the code base, otherwise it won't compile with Java 
8. So there are two ways to proceed: 1) we wait until Java 8 support is 
abandoned and start using _CRC32C_ 2) if Java 9+ is used then we use _CRC32C_ , 
otherwise we fallback to some Java-based implementation. The second approach 
sounds reasonable to me since we need to target for newer versions and faster 
solutions.

So I've given a try to two available Java-based implementations (Guava - 
_hasherCrc32c_ and Snappy - _pureJavaCrc32c_) in another microbenchmark test 
(run was made using Java 11.0.9):
{code:java}
[java] Benchmark                          (bufferSize)  Mode  Cnt     Score     
Error  Units
[java] ChecksumBench.benchCrc32                     31  avgt    5   113.076 ±  
33.568  ns/op
[java] ChecksumBench.benchCrc32                    131  avgt    5    88.348 ±  
11.433  ns/op
[java] ChecksumBench.benchCrc32                    517  avgt    5   175.464 ±  
56.718  ns/op
[java] ChecksumBench.benchCrc32                   2041  avgt    5   273.945 ±  
16.331  ns/op
[java] ChecksumBench.benchHasherCrc32c              31  avgt   25   150.860 ±   
2.883  ns/op
[java] ChecksumBench.benchHasherCrc32c             131  avgt   25   559.423 ± 
141.932  ns/op
[java] ChecksumBench.benchHasherCrc32c             517  avgt   25  1599.504 ±  
82.253  ns/op
[java] ChecksumBench.benchHasherCrc32c            2041  avgt   25  5988.707 ± 
103.558  ns/op
[java] ChecksumBench.benchPureJavaCrc32c            31  avgt   25    99.486 ±   
1.464  ns/op
[java] ChecksumBench.benchPureJavaCrc32c           131  avgt   25   252.278 ±  
20.278  ns/op
[java] ChecksumBench.benchPureJavaCrc32c           517  avgt   25   822.142 ±  
38.860  ns/op
[java] ChecksumBench.benchPureJavaCrc32c          2041  avgt   25  3083.762 ± 
200.493  ns/op

{code}
This test shows that:
 # Snappy's implementation outperforms Guava's  - which basically is not really 
important, we can use either of them
 # They both work slower than native _CRC32C_ implementation with intrinsic - 
which is actually expected
 # They both work slower than native _CRC32C_ implementation without intrinsic 
- which is totally fine
 # They both work slower than native _CRC32_ implementation - *which appears to 
be a problem*

 Basically that means that the users who use Java 9+ would benefit, whereas the 
users who use Java 8 would suffer from this change.

Here I need some input/feedback from the maintainers/community whether we need 
to proceed with this change or not (I have some draft changes though). Maybe 
[~tjake] and [~benedict] can help with the decision since they worked on the 
microbenchmarks and performance-related functionality in C.
----
On a separate note, Hadoop used Snappy's Java-based implementation. Then they 
migrated to native CRC32C implementation and started using Snappy's Java-based 
implementation as a fallback option if Java 8 is used. Here are the details:  
[https://github.com/apache/hadoop/pull/291|https://github.com/apache/hadoop/pull/291.].
 We can do something similar, if we're fine that for Java 8 users performance 
will slightly degrade.
----
The benchmark I used is available here (see ChecksumBench class):  
[https://github.com/apache/cassandra/pull/951]. I feel that PR can be merged 
because it contains only benchmark changes which seem to be useful for further 
developments.

> CRC32 is inefficient on x86
> ---------------------------
>
>                 Key: CASSANDRA-16360
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16360
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Client
>            Reporter: Avi Kivity
>            Priority: Normal
>              Labels: protocolv5
>             Fix For: 4.0.x
>
>
> The client/server protocol specifies CRC24 and CRC32 as the checksum 
> algorithm (cql_protocol_V5_framing.asc). Those however are expensive to 
> compute; this affects both the client and the server.
>  
> A better checksum algorithm is CRC32C, which has hardware support on x86 (as 
> well as other modern architectures).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to