[
https://issues.apache.org/jira/browse/GROOVY-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054973#comment-18054973
]
ASF GitHub Bot commented on GROOVY-10307:
-----------------------------------------
jamesfredley opened a new pull request, #2374:
URL: https://github.com/apache/groovy/pull/2374
This PR significantly improves invokedynamic performance in Grails 7 /
Groovy 4 by implementing several key optimizations:
https://issues.apache.org/jira/browse/GROOVY-10307
https://github.com/apache/grails-core/issues/15293
Using https://github.com/jamesfredley/grails7-performance-regression as a
test application
**10K/10K = groovy.indy.optimize.threshold/groovy.indy.fallback.threshold**
### Key Results
| Configuration | Runtime (4GB) | vs No-Indy | vs Baseline Indy | Startup |
|--------------|---------------|------------|------------------|---------|
| **No-Indy** | **80ms** | 1.0x | 3.5x faster | ~27s |
| **Optimized 10K/10K** | **155ms** | 1.9x slower | **1.8x faster** | ~20s |
| **Optimized 200/200** | **144ms** | 1.8x slower | **1.9x faster** | ~31s |
| **Baseline Indy** | 280ms | 3.5x slower | 1.0x | ~21s |
## Problem Statement
### Root Cause Analysis
The performance regression was traced to the **global SwitchPoint guard** in
Groovy's invokedynamic implementation:
1. **Global SwitchPoint Invalidation:** When ANY metaclass changes anywhere
in the application, ALL call sites are invalidated
2. **Grails Framework Behavior:** Grails frequently modifies metaclasses
during request processing (controllers, services, domain classes)
3. **Cascade Effect:** Each metaclass change triggers mass invalidation,
forcing all call sites to re-resolve their targets
4. **Guard Failures:** The constant invalidation causes guards to fail
repeatedly, preventing the JIT from optimizing method handle chains
---
## Solution Implemented
### Optimizations Applied
1. **Disabled Global SwitchPoint Guard**
- Removed the global SwitchPoint that caused ALL call sites to
invalidate when ANY metaclass changed
- Individual call sites now manage their own invalidation via cache
clearing
- Can be re-enabled with: `-Dgroovy.indy.switchpoint.guard=true`
2. **Monomorphic Fast-Path**
- Added volatile fields (`latestClassName`/`latestMethodHandleWrapper`)
to `CacheableCallSite`
- Skips synchronization and map lookups for call sites that consistently
see the same types
- This is the common case in practice (most call sites are monomorphic)
3. **Call Site Cache Invalidation**
- Implemented call site registration mechanism (`ALL_CALL_SITES` set
with weak references)
- When metaclass changes, all registered call sites have their caches
cleared
- Targets reset to default (fromCache) path, ensuring metaclass changes
are visible
- Garbage-collected call sites are automatically removed
4. **Optimized Cache Operations**
- Added `get()` method for cache lookup without automatic put
- Added `putIfAbsent()` for thread-safe cache population
- Added `clearCache()` for metaclass change invalidation
- Reduced allocations on the hot path
5. **Pre-Guard Method Handle Storage**
- Selector now stores the method handle before argument type guards are
applied (`handleBeforeArgGuards`)
- Enables potential future optimizations for direct invocation
### What Was NOT Changed
- **Default thresholds remain 10,000/10,000** (same as baseline GROOVY_4_0_X)
- **All guards remain intact** - guards are NOT bypassed, only the global
SwitchPoint is disabled
- **Cache semantics unchanged** - LRU eviction, soft references, same
default size (4)
---
## Performance Results
### Test Environment
- **Java:** JetBrains JDK 25
- **OS:** Windows 10/11
- **Test Application:** Grails 7 performance benchmark
- **Data Set:** 5 Companies, 29 Departments, 350 Employees, 102 Projects,
998 Tasks, 349 Milestones, 20 Skills
- **Test Method:** 5 warmup + 50 test iterations per run, 5 runs per
configuration
### Runtime Performance Comparison (4GB Heap - Recommended)
| Configuration | Min | Max | Avg | Stability |
|--------------|-----|-----|-----|-----------|
| **No-Indy** | 78ms | ~92ms | **82ms** | Excellent |
| **Optimized 200/200** | 129ms | ~187ms | **144ms** | Excellent |
| **Optimized 10K/10K** | 149ms | ~166ms | **155ms** | Excellent |
| **Optimized 1K/1K** | 183ms | ~261ms | 185ms | Good |
| **Baseline Indy** | 277ms | ~299ms | 280ms | Good |
### Startup Time Comparison
| Configuration | Default (~16GB) | 4GB | 8GB | Notes |
|--------------|-----------------|-----|-----|-------|
| No-Indy | 27s | 27s | 27s | Consistent |
| Baseline Indy | 28s | **21s** | **21s** | Faster with constrained memory |
| Optimized 10K/10K | 21s | **20s** | **20s** | Fastest indy startup |
| Optimized 1K/1K | 24s | 29s | 23s | Variable |
| Optimized 200/200 | **47s** | **31s** | **37s** | Longer due to eager
optimization |
### Threshold Trade-off Analysis
| Thresholds | Best Startup | Best Runtime | Best For |
|------------|--------------|--------------|----------|
| **10,000/10,000** | **20s** | 155ms | Fast startup, most production |
| **1,000/1,000** | 23-29s | 185-260ms | Not recommended |
| **200/200** | 31-47s | **144ms** | Best runtime, long-running services |
### Visual Performance Summary (4GB Heap)
```
Runtime Performance - Lower is Better
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
No-Indy: ████ 82ms
Opt 200/200: ███████ 144ms ← BEST INDY RUNTIME
Opt 10K/10K: ████████ 155ms ← RECOMMENDED
Opt 1K/1K: █████████ 185ms
Baseline Indy: ██████████████ 280ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Startup Time - Lower is Better
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Opt 10K/10K: ████████ 20s ← FASTEST INDY
Baseline Indy: █████████ 21s
No-Indy: ███████████ 27s
Opt 1K/1K: ████████████ 29s
Opt 200/200: █████████████ 31s
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```
### Complete 15-Configuration Test Matrix
All tests performed without GC logging for accurate measurements. Runtime
values are stabilized averages (runs 3-5).
| Configuration | Memory | Startup | Runtime |
|--------------|--------|---------|---------|
| **No-Indy** | Default (~16GB) | 27s | 78ms |
| **No-Indy** | 4GB | 27s | 82ms |
| **No-Indy** | 8GB | 27s | 81ms |
| **Baseline Indy** | Default (~16GB) | 28s | 460ms |
| **Baseline Indy** | 4GB | 21s | 280ms |
| **Baseline Indy** | 8GB | 21s | 285ms |
| **Optimized 10K/10K** | Default (~16GB) | 21s | 155ms |
| **Optimized 10K/10K** | 4GB | 20s | 155ms |
| **Optimized 10K/10K** | 8GB | 20s | 154ms |
| **Optimized 1K/1K** | Default (~16GB) | 24s | 260ms |
| **Optimized 1K/1K** | 4GB | 29s | 185ms |
| **Optimized 1K/1K** | 8GB | 23s | 186ms |
| **Optimized 200/200** | Default (~16GB) | 47s | 200ms |
| **Optimized 200/200** | 4GB | 31s | 144ms |
| **Optimized 200/200** | 8GB | 37s | 147ms |
## Performance Improvement Analysis
### Gap Reduction vs No-Indy (4GB Heap)
| Configuration | Runtime | Gap to No-Indy | Multiplier | Gap Reduction |
|--------------|---------|----------------|------------|---------------|
| No-Indy | 82ms | - | 1.0x | - |
| Baseline Indy | 280ms | 198ms | 3.4x slower | - |
| Optimized 10K/10K | 155ms | 73ms | 1.9x slower | **63%** |
| Optimized 1K/1K | 185ms | 103ms | 2.3x slower | 48% |
| **Optimized 200/200** | **144ms** | **62ms** | **1.8x slower** | **69%** |
## Garbage Collection Analysis
GC analysis was performed using `-Xlog:gc*` with G1GC on key 4GB heap
configurations.
### GC Analysis (4GB Heap Configurations)
| Configuration | Young GC | Full GC | Avg Pause | Max Pause | Metaspace |
|--------------|----------|---------|-----------|-----------|-----------|
| **No-Indy** | 38 | 0 | 20.84ms | 33.29ms | 217MB |
| **Baseline Indy** | 100 | 0 | 14.21ms | 31.64ms | 126MB |
| **Optimized 10K/10K** | 68 | 0 | 20.68ms | 33.88ms | 159MB |
| **Optimized 200/200** | 62 | 0 | 21.99ms | 33.23ms | 214MB |
### Key GC Findings
1. **Baseline Indy Triggers 2.6x More GC Events**
- Baseline indy: 100 Young GC events during benchmark
- No-indy: 38 Young GC events
- Optimized: 62-68 Young GC events (32% reduction vs baseline)
- The constant guard failures and method re-resolution in baseline indy
cause significantly more object allocations
2. **Optimized Reduces GC by 32%**
- Optimized 10K/10K: 68 Young GC events (32% fewer than baseline)
- Optimized 200/200: 62 Young GC events (38% fewer than baseline)
- Reduced allocations from fewer method re-resolutions
3. **No Full GC in Any Configuration**
- 4GB heap provides sufficient headroom
- G1GC handles all workloads with Young GC only
4. **Metaspace Usage Correlates with JIT Optimization**
- Baseline indy: 126MB (lowest - less JIT optimization of method handles)
- Optimized 10K/10K: 159MB (medium - moderate caching)
- No-indy: 217MB (high - more direct JIT compilation)
- Optimized 200/200: 214MB (high - more compiled method handles due to
aggressive optimization)
5. **GC Pause Times Are Similar**
- All configurations: ~21ms average, ~33ms maximum
- GC is NOT the primary cause of performance differences
- The performance gap is from method handle overhead, not GC
### Heap Usage Patterns
| Configuration | Typical Before GC | Typical After GC | Live Data |
|--------------|-------------------|------------------|-----------|
| No-Indy | ~1320MB | ~142MB | ~142MB |
| Baseline Indy | ~1350MB | ~130MB | ~130MB |
| Optimized Indy | ~1320MB | ~145MB | ~145MB |
All configurations show similar heap patterns - the optimizations don't
significantly change memory usage, they reduce CPU overhead.
### Memory Recommendations
1. **4GB Heap is Optimal** - Provides headroom without waste
2. **Default G1GC Settings Work Well** - No tuning needed
3. **Monitor Metaspace** - Optimized indy uses ~230MB, ensure no restrictive
limit
4. **GC Tuning Won't Improve Indy Performance** - The issue is CPU-bound,
not memory-bound
---
## Failed Optimization Attempts
These approaches were tried but caused test failures. They are documented
here to prevent repeating the same mistakes.
### Failed Attempt 1: Skip Metaclass Guards for Cache Hits
**What it tried:** Store method handle before guards, use unguarded handle
for cache hits.
**Why it failed:** Call sites can be polymorphic (called with multiple
receiver types). The unguarded handles contain `explicitCastArguments` with
embedded type casts that throw `ClassCastException` when different types pass
through.
### Failed Attempt 2: Version-Based Metaclass Guards
**What it tried:** Use metaclass version instead of identity for guards,
skip guards for "cache hits."
**Why it failed:** Same polymorphic call site issue - skipping guards for
cache hits is fundamentally unsafe.
### Failed Attempt 3: Skip Argument Guards After Warmup
**What it tried:** After warmup period, use handles without argument type
guards.
**Why it failed:** Call sites can receive different argument types over
their lifetime. Skipping argument guards causes `ClassCastException` or
`IncompatibleClassChangeError`.
### Key Learning
**Any optimization that bypasses guards based on cache state is unsafe
because:**
1. The cache key (receiver Class) identifies which handle to use
2. But the call site itself can receive different types over its lifetime
3. Guards in the method handle chain ensure proper fallback when types don't
match
4. The JVM's `explicitCastArguments` will throw if types don't match
**Safe optimizations must:**
1. Keep all guards intact
2. Focus on reducing guard overhead, not bypassing guards
3. Reduce cache misses rather than optimize the miss path unsafely
---
## Remaining Performance Gap
The optimized indy is still slower than non-indy (depending on thresholds).
The remaining gap comes from:
1. **Guard evaluation overhead** - Every call must check type guards
2. **Method handle chain traversal** - Composed handles have inherent
overhead
3. **Cache lookup cost** - Even with monomorphic fast-path, there's lookup
overhead
4. **Fallback path cost** - Cache misses trigger full method resolution
### Potential Future Optimization Areas
1. **Reduce guard evaluation cost** - Make guards cheaper (not skip them)
2. **Improve cache hit rate** - Larger cache, better eviction policy
3. **Reduce method handle chain depth** - Fewer composed handles
4. **JIT-friendly patterns** - Ensure method handle chains can be inlined
5. **Profile-guided optimization** - Use runtime profiles to specialize hot
paths
---
## Configuration Reference
### System Properties
| Property | Default | Description |
|----------|---------|-------------|
| `groovy.indy.optimize.threshold` | 10000 | Hit count before setting
guarded target on call site |
| `groovy.indy.fallback.threshold` | 10000 | Fallback count before resetting
to cache lookup path |
| `groovy.indy.callsite.cache.size` | 4 | Size of LRU cache per call site |
| `groovy.indy.switchpoint.guard` | false | Enable global SwitchPoint guard
(NOT recommended) |
## Files Modified
### Source Files
| File | Changes |
|------|---------|
| `IndyInterface.java` | Call site registration, cache invalidation on
metaclass change, optimized fromCache() |
| `CacheableCallSite.java` | Monomorphic fast-path, get(), putIfAbsent(),
clearCache() methods |
| `Selector.java` | Disabled SwitchPoint guard, handleBeforeArgGuards
storage |
| `MethodHandleWrapper.java` | directMethodHandle field, hit count tracking |
> Groovy 4 runtime performance on average 2.4x slower than Groovy 3
> -----------------------------------------------------------------
>
> Key: GROOVY-10307
> URL: https://issues.apache.org/jira/browse/GROOVY-10307
> Project: Groovy
> Issue Type: Bug
> Components: bytecode, performance
> Affects Versions: 4.0.0-beta-1, 3.0.9
> Environment: OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9
> (build 11.0.11+9)
> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
> WIN10 (tests) / REL 8 (web application)
> IntelliJ 2021.2
> Reporter: mgroovy
> Priority: Major
> Attachments: groovy_3_0_9_gc.png, groovy_3_0_9_loop2.png,
> groovy_3_0_9_loop4.png, groovy_3_0_9_mem.png, groovy_4_0_0_b1_loop2.png,
> groovy_4_0_0_b1_loop4.png, groovy_4_0_0_b1_loop4_gc.png,
> groovy_4_0_0_b1_loop4_mem.png,
> groovysql_performance_groovy4_2_xx_yy_zzzz.groovy, loops.groovy,
> profile3.txt, profile4-loops.txt, profile4.txt, profile4d.txt
>
>
> Groovy 4.0.0-beta-1 runtime performance in our framework is on average 2 to 3
> times slower compared to using Groovy 3.0.9 (regular i.e. non-INDY)
> * Our complete framework and application code is completely written in
> Groovy, spread over multiple IntelliJ modules
> ** mixed @CompileDynamic/@TypeChecked and @CompileStatic
> ** No Java classes left in project, i.e. no cross compilation occurs
> * We build using IntelliJ 2021.2 Groovy build process, then run / deploy the
> compiled class files
> ** We do _not_ use a Groovy based DSL, nor do we execute Groovy scripts
> during execution
> * Performance degradation when using Groovy 4.0.0-beta-1 instead of Groovy
> 3.0.9 (non-INDY):
> ** The performance of the largest of our web applications has dropped 3x
> (startup) / 2x (table refresh) respectively
> *** Stack: Tomcat/Vaadin/Ebean plus framework generated SQL
> ** Our test suite runs about 2.4 times as long as before (120 min when using
> G4, compared to about 50 min with G3)
> *** JUnit 5
> *** test suite also contains no scripts / dynamic code execution
> *** Individual test performance varies: A small number of tests runs faster,
> but the majority is slower, with some extreme cases taking nearly 10x as long
> to finish
> * Using Groovy 3.0.9 INDY displays nearly identical performance degradation,
> so it seems that the use of invoke dynamic is somehow at fault
--
This message was sent by Atlassian Jira
(v8.20.10#820010)