[jira] [Comment Edited] (CASSANDRA-21075) Optimize UTF8Validator.validate for ASCII prefixed Strings

Dmitry Konstantinov (Jira) Mon, 15 Dec 2025 06:14:06 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-21075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045165#comment-18045165
 ]


Dmitry Konstantinov edited comment on CASSANDRA-21075 at 12/15/25 2:13 PM:
---------------------------------------------------------------------------

[https://github.com/apache/cassandra/compare/trunk...netudima:cassandra:CASSANDRA-21075-trunk-experiments]

Initial microbenchmark results for a short (12 bytes) ASCII string:

Temurin-11.0.28+6
{code:java}
     [java] Benchmark                                                  Mode  
Cnt   Score   Error  Units
     [java] UTF8ValidatorBench.testOldBimorphic                        avgt   
15  40.309 ± 3.404  ns/op
     [java] UTF8ValidatorBench.testNewBimorphic                        avgt   
15  13.775 ± 0.067  ns/op

     [java] UTF8ValidatorBench.testOldMonomorphicArray                 avgt   
15  23.723 ± 2.159  ns/op
     [java] UTF8ValidatorBench.testNewMonomorphicArray                 avgt   
15  10.708 ± 0.332  ns/op

     [java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer        avgt   
15  44.297 ± 0.849  ns/op
     [java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer        avgt   
15  13.497 ± 0.140  ns/op
{code}

Temurin-17.0.12+7 
{code}
     [java] UTF8ValidatorBench.testNewBimorphic                        avgt   
15  11.405 ± 0.172  ns/op
     [java] UTF8ValidatorBench.testOldBimorphic                        avgt   
15  30.383 ± 0.696  ns/op

     [java] UTF8ValidatorBench.testOldMonomorphicArray                 avgt   
15  18.528 ± 0.785  ns/op
     [java] UTF8ValidatorBench.testNewMonomorphicArray                 avgt   
15   8.791 ± 0.070  ns/op

     [java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer        avgt   
15  27.887 ± 0.161  ns/op
     [java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer        avgt   
15  11.705 ± 0.213  ns/op
{code}


was (Author: dnk):
[https://github.com/apache/cassandra/compare/trunk...netudima:cassandra:CASSANDRA-21075-trunk-experiments]

Initial microbenchmark results for a short (12 bytes) ASCII string:
{code:java}
     [java] Benchmark                                                  Mode  
Cnt   Score   Error  Units
     [java] UTF8ValidatorBench.testOldBimorphic                        avgt   
15  40.309 ± 3.404  ns/op
     [java] UTF8ValidatorBench.testNewBimorphic                        avgt   
15  13.775 ± 0.067  ns/op

     [java] UTF8ValidatorBench.testOldMonomorphicArray                 avgt   
15  23.723 ± 2.159  ns/op
     [java] UTF8ValidatorBench.testNewMonomorphicArray                 avgt   
15  10.708 ± 0.332  ns/op

     [java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer        avgt   
15  44.297 ± 0.849  ns/op
     [java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer        avgt   
15  13.497 ± 0.140  ns/op
{code}

> Optimize UTF8Validator.validate for ASCII prefixed Strings
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-21075
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21075
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: CQL/Interpreter
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: before_cpu.html
>
>
> In my batch write test, UTF8 validation contributes 2.1% of CPU: 
> [^before_cpu.html] 
> In UTF8Validator.validate we can apply the same optimization as Guava and JDK 
> does: they use a plain loop to check if it is ASCII symbol before going into 
> more complicated UTF8 parsing:
>  * 
> [https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Utf8.java#L123]
> {code:java}
> for (int i = off; i < end; i++) {
>     if (bytes[i] < 0) {
>         return isWellFormedSlowPath(bytes, i, end);
>     }
> } {code}
>  * java.lang.StringCoding#decodeUTF8 
> {code:java}
> // ascii-bais, which has a relative impact to the non-ascii-only bytes
> if (COMPACT_STRINGS && !hasNegatives(src, sp, len))
>     return resultCached().with(Arrays.copyOfRange(src, sp, sp + len),
>                                    LATIN1);
> return decodeUTF8_0(src, sp, len, doReplace);
> where:
> public static boolean hasNegatives(byte[] ba, int off, int len) {
>     for (int i = off; i < off + len; i++) {
>         if (ba[i] < 0) {
>             return true;
>         }
>     }
>     return false;
> } {code}
> See also: 
> [https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-edition/]
> Additionally, using of ValueAccessor is not a free lunch and by avoiding it 
> we can get extra boost, especially in non-monomorphic cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-21075) Optimize UTF8Validator.validate for ASCII prefixed Strings

Reply via email to