[jira] [Commented] (GROOVY-10307) Groovy 4 runtime performance on average 2.4x slower than Groovy 3

ASF GitHub Bot (Jira) Wed, 28 Jan 2026 15:04:07 -0800


    [ 
https://issues.apache.org/jira/browse/GROOVY-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054973#comment-18054973
 ]


ASF GitHub Bot commented on GROOVY-10307:
-----------------------------------------

jamesfredley opened a new pull request, #2374:
URL: https://github.com/apache/groovy/pull/2374

   This PR significantly improves invokedynamic performance in Grails 7 / 
Groovy 4 by implementing several key optimizations:
   
   https://issues.apache.org/jira/browse/GROOVY-10307
   https://github.com/apache/grails-core/issues/15293
   
   Using https://github.com/jamesfredley/grails7-performance-regression as a 
test application
   
   **10K/10K = groovy.indy.optimize.threshold/groovy.indy.fallback.threshold**
   
   ### Key Results
   
   | Configuration | Runtime (4GB) | vs No-Indy | vs Baseline Indy | Startup |
   |--------------|---------------|------------|------------------|---------|
   | **No-Indy** | **80ms** | 1.0x | 3.5x faster | ~27s |
   | **Optimized 10K/10K** | **155ms** | 1.9x slower | **1.8x faster** | ~20s |
   | **Optimized 200/200** | **144ms** | 1.8x slower | **1.9x faster** | ~31s |
   | **Baseline Indy** | 280ms | 3.5x slower | 1.0x | ~21s |
   
   
   ## Problem Statement
   
   ### Root Cause Analysis
   
   The performance regression was traced to the **global SwitchPoint guard** in 
Groovy's invokedynamic implementation:
   
   1. **Global SwitchPoint Invalidation:** When ANY metaclass changes anywhere 
in the application, ALL call sites are invalidated
   2. **Grails Framework Behavior:** Grails frequently modifies metaclasses 
during request processing (controllers, services, domain classes)
   3. **Cascade Effect:** Each metaclass change triggers mass invalidation, 
forcing all call sites to re-resolve their targets
   4. **Guard Failures:** The constant invalidation causes guards to fail 
repeatedly, preventing the JIT from optimizing method handle chains
   
   ---
   
   ## Solution Implemented
   
   ### Optimizations Applied
   
   1. **Disabled Global SwitchPoint Guard**
       - Removed the global SwitchPoint that caused ALL call sites to 
invalidate when ANY metaclass changed
       - Individual call sites now manage their own invalidation via cache 
clearing
       - Can be re-enabled with: `-Dgroovy.indy.switchpoint.guard=true`
   
   2. **Monomorphic Fast-Path**
       - Added volatile fields (`latestClassName`/`latestMethodHandleWrapper`) 
to `CacheableCallSite`
       - Skips synchronization and map lookups for call sites that consistently 
see the same types
       - This is the common case in practice (most call sites are monomorphic)
   
   3. **Call Site Cache Invalidation**
       - Implemented call site registration mechanism (`ALL_CALL_SITES` set 
with weak references)
       - When metaclass changes, all registered call sites have their caches 
cleared
       - Targets reset to default (fromCache) path, ensuring metaclass changes 
are visible
       - Garbage-collected call sites are automatically removed
   
   4. **Optimized Cache Operations**
       - Added `get()` method for cache lookup without automatic put
       - Added `putIfAbsent()` for thread-safe cache population
       - Added `clearCache()` for metaclass change invalidation
       - Reduced allocations on the hot path
   
   5. **Pre-Guard Method Handle Storage**
       - Selector now stores the method handle before argument type guards are 
applied (`handleBeforeArgGuards`)
       - Enables potential future optimizations for direct invocation
   
   ### What Was NOT Changed
   
   - **Default thresholds remain 10,000/10,000** (same as baseline GROOVY_4_0_X)
   - **All guards remain intact** - guards are NOT bypassed, only the global 
SwitchPoint is disabled
   - **Cache semantics unchanged** - LRU eviction, soft references, same 
default size (4)
   
   ---
   
   ## Performance Results
   
   ### Test Environment
   
   - **Java:** JetBrains JDK 25
   - **OS:** Windows 10/11
   - **Test Application:** Grails 7 performance benchmark
   - **Data Set:** 5 Companies, 29 Departments, 350 Employees, 102 Projects, 
998 Tasks, 349 Milestones, 20 Skills
   - **Test Method:** 5 warmup + 50 test iterations per run, 5 runs per 
configuration
   
   ### Runtime Performance Comparison (4GB Heap - Recommended)
   
   | Configuration | Min | Max | Avg | Stability |
   |--------------|-----|-----|-----|-----------|
   | **No-Indy** | 78ms | ~92ms | **82ms** | Excellent |
   | **Optimized 200/200** | 129ms | ~187ms | **144ms** | Excellent |
   | **Optimized 10K/10K** | 149ms | ~166ms | **155ms** | Excellent |
   | **Optimized 1K/1K** | 183ms | ~261ms | 185ms | Good |
   | **Baseline Indy** | 277ms | ~299ms | 280ms | Good |
   
   ### Startup Time Comparison
   
   | Configuration | Default (~16GB) | 4GB | 8GB | Notes |
   |--------------|-----------------|-----|-----|-------|
   | No-Indy | 27s | 27s | 27s | Consistent |
   | Baseline Indy | 28s | **21s** | **21s** | Faster with constrained memory |
   | Optimized 10K/10K | 21s | **20s** | **20s** | Fastest indy startup |
   | Optimized 1K/1K | 24s | 29s | 23s | Variable |
   | Optimized 200/200 | **47s** | **31s** | **37s** | Longer due to eager 
optimization |
   
   ### Threshold Trade-off Analysis
   
   | Thresholds | Best Startup | Best Runtime | Best For |
   |------------|--------------|--------------|----------|
   | **10,000/10,000** | **20s** | 155ms | Fast startup, most production |
   | **1,000/1,000** | 23-29s | 185-260ms | Not recommended |
   | **200/200** | 31-47s | **144ms** | Best runtime, long-running services |
   
   ### Visual Performance Summary (4GB Heap)
   
   ```
   Runtime Performance - Lower is Better
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   No-Indy:        ████ 82ms
   Opt 200/200:    ███████ 144ms         ← BEST INDY RUNTIME
   Opt 10K/10K:    ████████ 155ms        ← RECOMMENDED
   Opt 1K/1K:      █████████ 185ms
   Baseline Indy:  ██████████████ 280ms
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   
   Startup Time - Lower is Better
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   Opt 10K/10K:    ████████ 20s          ← FASTEST INDY
   Baseline Indy:  █████████ 21s
   No-Indy:        ███████████ 27s
   Opt 1K/1K:      ████████████ 29s
   Opt 200/200:    █████████████ 31s
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   ```
   
   ### Complete 15-Configuration Test Matrix
   
   All tests performed without GC logging for accurate measurements. Runtime 
values are stabilized averages (runs 3-5).
   
   | Configuration | Memory | Startup | Runtime |
   |--------------|--------|---------|---------|
   | **No-Indy** | Default (~16GB) | 27s | 78ms |
   | **No-Indy** | 4GB | 27s | 82ms |
   | **No-Indy** | 8GB | 27s | 81ms |
   | **Baseline Indy** | Default (~16GB) | 28s | 460ms |
   | **Baseline Indy** | 4GB | 21s | 280ms |
   | **Baseline Indy** | 8GB | 21s | 285ms |
   | **Optimized 10K/10K** | Default (~16GB) | 21s | 155ms |
   | **Optimized 10K/10K** | 4GB | 20s | 155ms |
   | **Optimized 10K/10K** | 8GB | 20s | 154ms |
   | **Optimized 1K/1K** | Default (~16GB) | 24s | 260ms |
   | **Optimized 1K/1K** | 4GB | 29s | 185ms |
   | **Optimized 1K/1K** | 8GB | 23s | 186ms |
   | **Optimized 200/200** | Default (~16GB) | 47s | 200ms |
   | **Optimized 200/200** | 4GB | 31s | 144ms |
   | **Optimized 200/200** | 8GB | 37s | 147ms |
   
   
   ## Performance Improvement Analysis
   
   ### Gap Reduction vs No-Indy (4GB Heap)
   
   | Configuration | Runtime | Gap to No-Indy | Multiplier | Gap Reduction |
   |--------------|---------|----------------|------------|---------------|
   | No-Indy | 82ms | - | 1.0x | - |
   | Baseline Indy | 280ms | 198ms | 3.4x slower | - |
   | Optimized 10K/10K | 155ms | 73ms | 1.9x slower | **63%** |
   | Optimized 1K/1K | 185ms | 103ms | 2.3x slower | 48% |
   | **Optimized 200/200** | **144ms** | **62ms** | **1.8x slower** | **69%** |
   
   
   ## Garbage Collection Analysis
   
   GC analysis was performed using `-Xlog:gc*` with G1GC on key 4GB heap 
configurations.
   
   ### GC Analysis (4GB Heap Configurations)
   
   | Configuration | Young GC | Full GC | Avg Pause | Max Pause | Metaspace |
   |--------------|----------|---------|-----------|-----------|-----------|
   | **No-Indy** | 38 | 0 | 20.84ms | 33.29ms | 217MB |
   | **Baseline Indy** | 100 | 0 | 14.21ms | 31.64ms | 126MB |
   | **Optimized 10K/10K** | 68 | 0 | 20.68ms | 33.88ms | 159MB |
   | **Optimized 200/200** | 62 | 0 | 21.99ms | 33.23ms | 214MB |
   
   ### Key GC Findings
   
   1. **Baseline Indy Triggers 2.6x More GC Events**
       - Baseline indy: 100 Young GC events during benchmark
       - No-indy: 38 Young GC events
       - Optimized: 62-68 Young GC events (32% reduction vs baseline)
       - The constant guard failures and method re-resolution in baseline indy 
cause significantly more object allocations
   
   2. **Optimized Reduces GC by 32%**
       - Optimized 10K/10K: 68 Young GC events (32% fewer than baseline)
       - Optimized 200/200: 62 Young GC events (38% fewer than baseline)
       - Reduced allocations from fewer method re-resolutions
   
   3. **No Full GC in Any Configuration**
       - 4GB heap provides sufficient headroom
       - G1GC handles all workloads with Young GC only
   
   4. **Metaspace Usage Correlates with JIT Optimization**
       - Baseline indy: 126MB (lowest - less JIT optimization of method handles)
       - Optimized 10K/10K: 159MB (medium - moderate caching)
       - No-indy: 217MB (high - more direct JIT compilation)
       - Optimized 200/200: 214MB (high - more compiled method handles due to 
aggressive optimization)
   
   5. **GC Pause Times Are Similar**
       - All configurations: ~21ms average, ~33ms maximum
       - GC is NOT the primary cause of performance differences
       - The performance gap is from method handle overhead, not GC
   
   ### Heap Usage Patterns
   
   | Configuration | Typical Before GC | Typical After GC | Live Data |
   |--------------|-------------------|------------------|-----------|
   | No-Indy | ~1320MB | ~142MB | ~142MB |
   | Baseline Indy | ~1350MB | ~130MB | ~130MB |
   | Optimized Indy | ~1320MB | ~145MB | ~145MB |
   
   All configurations show similar heap patterns - the optimizations don't 
significantly change memory usage, they reduce CPU overhead.
   
   ### Memory Recommendations
   
   1. **4GB Heap is Optimal** - Provides headroom without waste
   2. **Default G1GC Settings Work Well** - No tuning needed
   3. **Monitor Metaspace** - Optimized indy uses ~230MB, ensure no restrictive 
limit
   4. **GC Tuning Won't Improve Indy Performance** - The issue is CPU-bound, 
not memory-bound
   
   ---
   
   ## Failed Optimization Attempts
   
   These approaches were tried but caused test failures. They are documented 
here to prevent repeating the same mistakes.
   
   ### Failed Attempt 1: Skip Metaclass Guards for Cache Hits
   
   **What it tried:** Store method handle before guards, use unguarded handle 
for cache hits.
   
   **Why it failed:** Call sites can be polymorphic (called with multiple 
receiver types). The unguarded handles contain `explicitCastArguments` with 
embedded type casts that throw `ClassCastException` when different types pass 
through.
   
   ### Failed Attempt 2: Version-Based Metaclass Guards
   
   **What it tried:** Use metaclass version instead of identity for guards, 
skip guards for "cache hits."
   
   **Why it failed:** Same polymorphic call site issue - skipping guards for 
cache hits is fundamentally unsafe.
   
   ### Failed Attempt 3: Skip Argument Guards After Warmup
   
   **What it tried:** After warmup period, use handles without argument type 
guards.
   
   **Why it failed:** Call sites can receive different argument types over 
their lifetime. Skipping argument guards causes `ClassCastException` or 
`IncompatibleClassChangeError`.
   
   ### Key Learning
   
   **Any optimization that bypasses guards based on cache state is unsafe 
because:**
   1. The cache key (receiver Class) identifies which handle to use
   2. But the call site itself can receive different types over its lifetime
   3. Guards in the method handle chain ensure proper fallback when types don't 
match
   4. The JVM's `explicitCastArguments` will throw if types don't match
   
   **Safe optimizations must:**
   1. Keep all guards intact
   2. Focus on reducing guard overhead, not bypassing guards
   3. Reduce cache misses rather than optimize the miss path unsafely
   
   ---
   
   ## Remaining Performance Gap
   
   The optimized indy is still slower than non-indy (depending on thresholds). 
The remaining gap comes from:
   
   1. **Guard evaluation overhead** - Every call must check type guards
   2. **Method handle chain traversal** - Composed handles have inherent 
overhead
   3. **Cache lookup cost** - Even with monomorphic fast-path, there's lookup 
overhead
   4. **Fallback path cost** - Cache misses trigger full method resolution
   
   ### Potential Future Optimization Areas
   
   1. **Reduce guard evaluation cost** - Make guards cheaper (not skip them)
   2. **Improve cache hit rate** - Larger cache, better eviction policy
   3. **Reduce method handle chain depth** - Fewer composed handles
   4. **JIT-friendly patterns** - Ensure method handle chains can be inlined
   5. **Profile-guided optimization** - Use runtime profiles to specialize hot 
paths
   
   ---
   
   ## Configuration Reference
   
   ### System Properties
   
   | Property | Default | Description |
   |----------|---------|-------------|
   | `groovy.indy.optimize.threshold` | 10000 | Hit count before setting 
guarded target on call site |
   | `groovy.indy.fallback.threshold` | 10000 | Fallback count before resetting 
to cache lookup path |
   | `groovy.indy.callsite.cache.size` | 4 | Size of LRU cache per call site |
   | `groovy.indy.switchpoint.guard` | false | Enable global SwitchPoint guard 
(NOT recommended) |
   
   
   ## Files Modified
   
   ### Source Files 
   
   | File | Changes |
   |------|---------|
   | `IndyInterface.java` | Call site registration, cache invalidation on 
metaclass change, optimized fromCache() |
   | `CacheableCallSite.java` | Monomorphic fast-path, get(), putIfAbsent(), 
clearCache() methods |
   | `Selector.java` | Disabled SwitchPoint guard, handleBeforeArgGuards 
storage |
   | `MethodHandleWrapper.java` | directMethodHandle field, hit count tracking |




> Groovy 4 runtime performance on average 2.4x slower than Groovy 3
> -----------------------------------------------------------------
>
>                 Key: GROOVY-10307
>                 URL: https://issues.apache.org/jira/browse/GROOVY-10307
>             Project: Groovy
>          Issue Type: Bug
>          Components: bytecode, performance
>    Affects Versions: 4.0.0-beta-1, 3.0.9
>         Environment: OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 
> (build 11.0.11+9)
> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
> WIN10 (tests) / REL 8 (web application)
> IntelliJ 2021.2 
>            Reporter: mgroovy
>            Priority: Major
>         Attachments: groovy_3_0_9_gc.png, groovy_3_0_9_loop2.png, 
> groovy_3_0_9_loop4.png, groovy_3_0_9_mem.png, groovy_4_0_0_b1_loop2.png, 
> groovy_4_0_0_b1_loop4.png, groovy_4_0_0_b1_loop4_gc.png, 
> groovy_4_0_0_b1_loop4_mem.png, 
> groovysql_performance_groovy4_2_xx_yy_zzzz.groovy, loops.groovy, 
> profile3.txt, profile4-loops.txt, profile4.txt, profile4d.txt
>
>
> Groovy 4.0.0-beta-1 runtime performance in our framework is on average 2 to 3 
> times slower compared to using Groovy 3.0.9 (regular i.e. non-INDY)
> * Our complete framework and application code is completely written in 
> Groovy, spread over multiple IntelliJ modules
> ** mixed @CompileDynamic/@TypeChecked and @CompileStatic
> ** No Java classes left in project, i.e. no cross compilation occurs
> * We build using IntelliJ 2021.2 Groovy build process, then run / deploy the 
> compiled class files
> ** We do _not_ use a Groovy based DSL, nor do we execute Groovy scripts 
> during execution
> * Performance degradation when using Groovy 4.0.0-beta-1 instead of Groovy 
> 3.0.9 (non-INDY):
> ** The performance of the largest of our web applications has dropped 3x 
> (startup) / 2x (table refresh) respectively
> *** Stack: Tomcat/Vaadin/Ebean plus framework generated SQL
> ** Our test suite runs about 2.4 times as long as before (120 min when using 
> G4, compared to about 50 min with G3)
> *** JUnit 5 
> *** test suite also contains no scripts / dynamic code execution
> *** Individual test performance varies: A small number of tests runs faster, 
> but the majority is slower, with some extreme cases taking nearly 10x as long 
> to finish
> * Using Groovy 3.0.9 INDY displays nearly identical performance degradation, 
> so it seems that the use of invoke dynamic is somehow at fault



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (GROOVY-10307) Groovy 4 runtime performance on average 2.4x slower than Groovy 3

Reply via email to