[ 
https://issues.apache.org/jira/browse/CASSANDRA-21020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082417#comment-18082417
 ] 

Dmitry Konstantinov commented on CASSANDRA-21020:
-------------------------------------------------

Initial draft for metrics only: [https://github.com/apache/cassandra/pull/4831]

JMH results for ThreadLocal retrieval:
{code:java}
     [java] Benchmark                             Mode  Cnt  Score   Error  
Units
     [java] CassandraThreadLocalBench.casssandra  avgt    4  0.652 ± 0.369  
ns/op
     [java] CassandraThreadLocalBench.netty       avgt    4  1.278 ± 0.444  
ns/op

{code}
JMH results for ThreadLocalCounter increment:
{code:java}
     [java] Benchmark                                        (type)  Mode  Cnt  
Score   Error  Units
     [java] ThreadLocalMetricsBench.increment      NettyThreadLocal  avgt    5  
2.570 ± 0.333  ns/op
     [java] ThreadLocalMetricsBench.increment  CassandraThreadLocal  avgt    5  
1.657 ± 0.202  ns/op
{code}
 

 

An Claude analysis of perfasm output printed by using -Djmh.args="-prof 
xctraceasm"
h4. Cassandra inner loop — 2 instructions
{code:java}
0x11ea00932:  mov 0x1f8(%r10),%r8d   ; load threadLocalMetrics field off 
CassandraThread  (0.44%)
0x11ea00939:  test %r8d,%r8d          ; null check
              je <never taken>
0x11ea00946:  shl $0x3,%r8           ; decode compressed oop                    
           (24.98%)
              ; → back edge
{code}
The JIT has proven the {{instanceof CassandraThread}} check is always true (the 
thread type is
monomorphic), so it hoisted it above the loop entirely — it doesn't appear 
inside the hot path
at all. The entire get is: one field load + one null check + one shift. The 25% 
weight on the
{{shl}} is just retirement stall from the preceding load, not the instruction 
itself being slow.
h4. Netty inner loop — 9 instructions
{code:java}
0x120e06c40:  mov 0x174(%rax),%r11d       ; load threadLocalMap off 
FastThreadLocalThread      (4.5%)
0x120e06c47:  mov 0x54(%r12,%r11,8),%r10d ; load indexedVariables[] off 
threadLocalMap         (4.2%)
0x120e06c4c:  mov 0xc(%r12,%r10,8),%ecx   ; load array length                   
               (6.2%)
0x120e06c51:  mov 0xc(%rsi),%edi           ; load FastThreadLocal.index 
(constant)              (5.3%)
0x120e06c54:  cmp %ecx,%edi               ; bounds check                        
                (4.6%)
              jge <never taken>
0x120e06c5c:  cmp %ecx,%edi               ; redundant bounds check (JIT failed 
to elim)         (5.6%)
              jae <never taken>
0x120e06c68:  shl $0x3,%r10               ; decode compressed oop on array base 
                (5.3%)
0x120e06c6c:  mov 0x10(%r10,%rdi,4),%r10d ; load indexedVariables[index]        
               (4.6%)
0x120e06c71:  cmp $0xf9e2dbe7,%r10d       ; UNSET sentinel check                
                (5.0%)
              je <never taken>
0x120e06c80:  mov 0x8(%r12,%r10,8),%r11d  ; load klass for checkcast            
               (5.2%)
0x120e06c85:  cmp $0x1b2a10,%r11d         ; checkcast ThreadLocalMetricsV2      
               (7.2%)
              jne <never taken>
0x120e06c92:  shl $0x3,%r10               ; decode result                       
                (4.9%)
              ; → back edge
{code}
Netty does *three pointer dereferences* in a chain ({{{}thread → threadLocalMap 
→ indexedVariables[] → element{}}}),
plus an UNSET sentinel check and a {{{}checkcast{}}}. Each dereference is a 
potential cache miss and a
data-dependency stall — the next load can't start until the previous one 
completes. Even with
everything in L1, dependent loads have ~4 cycle latency each, so 3 chained 
loads = minimum ~12
cycles vs V1's single load at ~4 cycles.

> Optimize thread local for metrics and tracing
> ---------------------------------------------
>
>                 Key: CASSANDRA-21020
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21020
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Observability/Metrics, Observability/Tracing
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Netty thread local logic is used in tracing to keep state and in 
> metrics (thread local counters logic introduced in CASSANDRA-20250), so we do 
> the thread local lookups many times during a request processing. These cases 
> can be optimized by placing these objects as field variables to Thread itself 
> by introducing CassandraThread as a child of FastThreadLocalThread.
> Similar idea can be found even in JDK (ThreadLocalRandom logic was introduced 
> for ForkJoinPool speedup)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to