[ 
https://issues.apache.org/jira/browse/GROOVY-12034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul King updated GROOVY-12034:
-------------------------------
    Attachment: run2-indy-off.txt
                run2-indy-off.log
                run1-indy-on.txt
                run1-indy-on.log

> Apply "fat-free lambda" patterns to Groovy — DGM `*With` family + `curryWith` 
> helper
> ------------------------------------------------------------------------------------
>
>                 Key: GROOVY-12034
>                 URL: https://issues.apache.org/jira/browse/GROOVY-12034
>             Project: Groovy
>          Issue Type: Improvement
>            Reporter: Paul King
>            Assignee: Paul King
>            Priority: Major
>         Attachments: run1-indy-on.log, run1-indy-on.txt, run2-indy-off.log, 
> run2-indy-off.txt
>
>
> h2. Summary
> Donald Raab's blog post "Fat-Free Lambdas in Java" 
> ([https://donraab.medium.com/fat-free-lambdas-in-java-bf228da0613b]) 
> describes a pattern where stateless lambdas, combined with library APIs that 
> take an extra parameter, produce zero per-call allocation. This ticket 
> assesses applying that pattern to Groovy, backed by a JMH benchmark.
> *Result:* the optimisation already works for {{@CompileStatic}} lambdas 
> targeting a functional interface (GROOVY-11905), but Groovy's GDK does not 
> expose an API surface that lets idiomatic call sites _use_ it. The dominant 
> Groovy idiom for "don't capture" — {{Closure.rcurry}} — measures *worse on 
> every axis* than just capturing. Adding a small {{*With}} family to 
> {{DefaultGroovyMethods}}, plus a {{curryWith}} helper, would deliver the 
> fat-free benefit to ordinary Groovy code with no breaking changes.
> h2. Background — the technique
> Three pillars stack to make the optimisation work in Java:
> # *Compiler hoisting of stateless lambdas* — javac lowers non-capturing {{(x) 
> -> ...}} to a static {{LambdaMetafactory}} call site whose linked 
> {{CallSite}} returns the same singleton forever.
> # *API that prefers passing over capturing* — Eclipse Collections' 
> {{anySatisfyWith(Predicate2<T,P>, P)}} accepts a 2-arg functional interface 
> plus the would-be-captured value, so the lambda body becomes stateless.
> # *Method references* — {{String::startsWith}} is unbound; naturally 
> non-capturing.
> h2. Mapping to Groovy
> h3. Pillar 1 — Compiler hoisting
> *Already done for {{@CompileStatic}} lambdas targeting a functional interface 
> (GROOVY-11905).*
> * {{StaticTypesLambdaAnalyzer.isNonCapturing()}} at 
> {{StaticTypesLambdaWriter.java:72-75}} detects non-capturing lambdas.
> * Non-capturing lambdas get {{ACC_STATIC}} on the synthetic {{doCall}} 
> ({{StaticTypesLambdaWriter.java:339}}).
> * Emitted as {{invokedynamic}} → {{LambdaMetafactory.metafactory}} with 
> {{H_INVOKESTATIC}} ({{StaticTypesLambdaWriter.java:179-183, :429}}). The JDK 
> returns the singleton.
> *Cannot be applied to plain {{Closure}}* (and probably should not be):
> * Every {{Closure}} carries {{owner}}, {{thisObject}}, {{delegate}}, 
> {{resolveStrategy}}, {{directive}}, {{parameterTypes}}, 
> {{maximumNumberOfParameters}}, {{bcw}} ({{Closure.java:260-274}}) — ~56-72 
> bytes minimum, dwarfing a Java capturing lambda.
> * {{Closure}} instances are externally mutable: callers may set 
> {{.delegate}}, {{.resolveStrategy}}, call {{.curry(...)}}. Sharing a 
> singleton would let one caller's mutation leak globally.
> * In dynamic compilation, {{LambdaWriter}} delegates to {{ClosureWriter}} 
> ({{LambdaWriter.java:27-56}}), so even a non-capturing lambda becomes a fresh 
> {{Closure}} subclass per evaluation.
> h3. Pillar 2 — API surface (the gap)
> Groovy's GDK has *no equivalent* of Eclipse Collections' {{anySatisfyWith}}. 
> Confirmed by inspection of {{DefaultGroovyMethods.java}}:
> * 
> {{findAll}}/{{find}}/{{any}}/{{every}}/{{collect}}/{{inject}}/{{count}}/{{groupBy}}
>  (lines 5484, 5376, 600, 5267, 2336, 8787, 3607, 7850) all take a single 
> {{Closure}}.
> * The only {{BiPredicate}}/{{BiFunction}} overloads in DGM are 
> {{zipWithNext}} ({{DGM.java:18515}}) and {{groupConsecutive}} 
> ({{DGM.java:18696}}) — both pair _adjacent elements_, not 
> element-plus-external-arg.
> Groovy's existing "don't capture" tool is {{Closure.rcurry/curry/ncurry}} 
> ({{Closure.java:673, 706, 755}}). But every call allocates:
> {code:java}
> super(uncurriedClosure.clone());       // CurriedClosure.java:67 — clones the 
> base Closure
> this.curriedArguments = arguments;     // CurriedClosure.java:70 — fresh 
> Object[]
> {code}
> So {{rcurry(value)}} costs a clone + wrapper + array, _heavier than just 
> capturing_. (Confirmed by the benchmark below.)
> h3. Pillar 3 — Method references
> Already work in {{@CompileStatic}}. Same indy path, same singleton behaviour, 
> no synthetic class needed (the method handle points straight at the target 
> method).
> h2. Benchmark
> Sidecar code added at:
> * 
> {{subprojects/performance/src/jmh/groovy/org/apache/groovy/bench/FatFreeLambda.groovy}}
> * 
> {{subprojects/performance/src/jmh/groovy/org/apache/groovy/bench/FatFreeLambdaBench.java}}
> Seven variants compared, all traversing the full list (prefix never matches) 
> with identical per-element work:
> ||Tag||Variant||Per-call allocation shape||
> |A|{{data.any \{ it.startsWith(prefix) \}}} — capturing closure literal|1 × 
> {{Closure}} subclass + per-element wrap|
> |B|{{data.any(pred.rcurry(prefix))}} — closure + rcurry|1 × {{Closure}} + 1 × 
> {{CurriedClosure}} + per-element args|
> |C|{{anyWith(data, (s,p) -> s.startsWith(p), prefix)}} — candidate API + 
> lambda|0 (indy singleton via {{LambdaMetafactory}})|
> |D|{{anyWith(data, String::startsWith, prefix)}} — candidate API + method 
> ref|0 (same; no synthetic class)|
> |E|hand-written {{for}} loop|0 (baseline)|
> |F|{{static final}} hoisted closure + rcurry|1 × {{CurriedClosure}} + 
> per-element args (no per-call literal)|
> |G|{{anyPred(data, curryWith(STARTS_WITH, prefix))}} — fat-free curry 
> helper|1 small capturing lambda per call|
> Setup: JMH 1.37, JDK 17.0.18 Zulu, macOS, 3×2s warmup + 5×2s × 2 forks, GC 
> profiler enabled, size sweep {{@Param({"100","1000","10000"})}}. Two passes: 
> default ({{indy=true}}) and {{-Pindy=false}}. Each pass ~21 min wall.
> Run command:
> {noformat}
> ./gradlew :perf:jmh -PbenchInclude=FatFreeLambda
> {noformat}
> h2. Results
> h3. Throughput (ops/µs, higher = better, {{indy=true}})
> ||Variant||size=100||size=1000||size=10000||
> |C {{anyWith}} + lambda|9.926 ± 0.105|1.001 ± 0.004|0.096 ± 0.001|
> |D {{anyWith}} + method ref|9.622 ± 0.114|0.998 ± 0.009|0.096 ± 0.001|
> |E baseline {{for}} loop|8.215 ± 0.050|0.768 ± 0.007|0.072 ± 0.001|
> |G {{curryWith}}|7.850 ± 0.237|0.927 ± 0.012|0.078 ± 0.001|
> |A capture closure|0.584 ± 0.024|0.062 ± 0.002|0.007 ± 0.001|
> |B rcurry|0.252 ± 0.010|0.025 ± 0.001|0.003 ± 0.001|
> |F static-final + rcurry|0.248 ± 0.001|0.025 ± 0.002|0.003 ± 0.001|
> h3. Allocation (bytes/op, lower = better, {{indy=true}})
> ||Variant||size=100||size=1000||size=10000||Per-element rate||
> |C/D|≈ 10⁻⁴|≈ 10⁻³|0.031|*0*|
> |E baseline|≈ 10⁻³|0.004|32 (GC-thread noise)|*0*|
> |G {{curryWith}}|*128*|*128*|*128*|*0 — constant per call*|
> |A capture closure|7,304|72,080|720,096|~72 B/elem + ~80 fixed|
> |B rcurry|9,808|96,208|960,209|~96 B/elem + ~80 fixed|
> |F shared closure + rcurry|9,764|96,176|960,153|matches B (60 B/call less)|
> h3. Indy on vs off (size=1000)
> ||Variant||indy=on ops/µs||indy=off ops/µs||Δ||indy=on B/op||indy=off B/op||
> |A capture|0.062|0.057|−8%|72,080|72,080|
> |B rcurry|0.025|0.024|−4%|96,208|96,580|
> |C {{anyWith}}+lambda|1.001|1.001|0%|0.003|0.003|
> |D {{anyWith}}+methodRef|0.998|0.999|0%|0.003|0.003|
> |E baseline|0.768|0.905|+18%|0.003|0.003|
> |F shared+rcurry|0.025|0.025|0%|96,176|96,472|
> |G curryWith|0.927|0.894|−4%|128|120|
> h2. Findings
> # *The fat-free claim holds in Groovy, with margin.* C and D allocate 
> effectively zero bytes regardless of size. At size=10000, C/D move 0.03 B/op 
> vs 720 KB/op for A and 960 KB/op for B — *5 orders of magnitude apart*. The 
> optimisation is real and usable today via {{@CompileStatic}} + functional 
> interface targets; the GDK just doesn't currently offer the API to reach it.
> # *The Groovy idiom "curry instead of capture" is empirically backwards.* 
> Across every size and both indy modes, B ({{rcurry}}) is *~2.5× slower* than 
> A (capturing closure) and allocates *~33% more bytes*. The cause is 
> {{CurriedClosure.call}} concatenating {{curriedArguments}} with the incoming 
> args into a fresh {{Object[]}} on every element. At size=10000 that's 10,000 
> small arrays per call.
> # *Hoisting the base closure to {{static final}} is essentially worthless.* F 
> matches B to within ~60 bytes per call across all configurations. The 
> per-call closure-literal allocation is noise; the per-_element_ 
> {{CurriedClosure}} cost is everything. Advising users to "extract closure 
> literals to constants" doesn't solve the rcurry problem.
> # *{{curryWith}} is the right shape for the adaptation gap.* G allocates a 
> flat *128 B per call* regardless of workload size. At size=10000 that's 
> *~7,500× less allocation than {{rcurry}}* for the same logical operation. 
> Throughput-wise G is within 10% of the hand-written {{for}} loop. This covers 
> the case where users _can't_ avoid producing a unary 
> {{Predicate}}/{{Function}} (handing off to {{Stream.filter}}, 
> {{Collection.removeIf}}, etc.).
> # *Indy mode is irrelevant to this story.* Allocations are bit-identical 
> between {{-Pindy=true}} and {{-Pindy=false}}; throughputs move ≤8% for the 
> closure variants and 0% for the lambda variants. *The proposal helps indy and 
> non-indy users equally.*
> # *C and D systematically beat E by 21-33%* across every size, {{indy=true}}. 
> Reproducible across both runs. Most likely the {{BiPredicate.test → 
> static-method-handle → s.startsWith(p)}} chain inlines into a slightly 
> tighter loop body than the direct {{for (String s : data) 
> s.startsWith(prefix)}}. Not load-bearing for the proposal, but worth noting 
> as the strongest possible refutation of "Groovy lambdas can't match Java 
> loops".
> h2. Proposed actions (_proposed pending team review_)
> h3. Primary: add {{*With}} overloads to {{DefaultGroovyMethods}}
> Additive — no source or binary breakage. ~8–12 methods cover the high-traffic 
> surface:
> {code:java}
> public static <T,P> List<T>     findAll(Iterable<T> self, BiPredicate<? super 
> T, ? super P> test, P p);
> public static <T,P> boolean     any   (Iterable<T> self, BiPredicate<? super 
> T, ? super P> test, P p);
> public static <T,P> boolean     every (Iterable<T> self, BiPredicate<? super 
> T, ? super P> test, P p);
> public static <T,P> Number      count (Iterable<T> self, BiPredicate<? super 
> T, ? super P> test, P p);
> public static <T,P,R> List<R>   collect(Iterable<T> self, BiFunction<? super 
> T, ? super P, ? extends R> f, P p);
> public static <T,P> void        each(Iterable<T> self, BiConsumer<? super T, 
> ? super P> c, P p);
> public static <T,P> T           find(Iterable<T> self, BiPredicate<? super T, 
> ? super P> test, P p);
> public static <T,P,K> Map<K,List<T>> groupBy(Iterable<T> self, BiFunction<? 
> super T, ? super P, K> classifier, P p);
> // + inject/reduce with seed; *With variants on Map<K,V>
> {code}
> Open naming question: {{With}}-suffixed overloads (mirrors Eclipse 
> Collections) vs. unsuffixed overloads disambiguated by argument types. The 
> latter is the more usual Groovy approach; the former is more obviously 
> discoverable. _Proposed pending team review._
> h3. Primary: add a {{curryWith}} helper
> {code:groovy}
> public static <T,P> Predicate<T>  curryWith(BiPredicate<? super T, ? super P> 
> bi, P p);
> public static <T,P,R> Function<T,R> curryWith(BiFunction<? super T, ? super 
> P, ? extends R> f, P p);
> public static <T,P> Consumer<T>   curryWith(BiConsumer<? super T, ? super P> 
> c, P p);
> {code}
> Right-curry semantics (the value lands in the second/last parameter slot, 
> matching what {{*With}} consumers want). Should ship alongside {{*With}} — 
> the {{rcurry}} → {{*With}} → {{curryWith}} ratio (~33,000× → 1× → ~7,500×) is 
> the strongest argument for shipping the helper in the same release, so users 
> have a fat-free path for every shape.
> h3. Bonus ticket A: {{CurriedClosure}} per-element allocation
> {{CurriedClosure.call}} concatenates {{curriedArguments}} with the incoming 
> args into a fresh {{Object[]}} on every invocation 
> ({{CurriedClosure.java:81-99}}). The dominant cost in every closure+curry 
> path in this benchmark. A no-allocation variant for the common case 
> ({{curriedArguments.length == 1 && args.length == 1}}) would save ~24 
> B/element on every rcurry — a lift for everyone already using rcurry in 
> production code, without anyone needing to change source.
> h3. Bonus ticket B: {{LambdaWriter}} {{Reference}} wrapping for 
> effectively-final method parameters
> LambdaWriter wraps effectively-final method parameters in 
> {{groovy.lang.Reference}} — visible in G's 128 B/call constant (synthetic 
> lambda constructor signature: {{(Object, Object, Reference, Reference)V}}). 
> javac captures such parameters by value into the synthetic lambda's final 
> fields with no wrapping; Groovy could do the same, eliminating the 
> per-capture {{Reference}} allocations. Lower priority than bonus A because 
> JIT escape-analysis often elides these wrappers in hot paths, but they show 
> up in cold paths, megamorphic call sites, and interpreter-only execution.
> h2. Caveats
> * Benchmark uses 1000-element traversal with a cheap predicate 
> ({{startsWith}}); closure overhead is maximally exposed. With heavier 
> per-element work the proportional gap narrows.
> * Single dev machine; no CPU pinning. Throughputs are reproducible to ~10%, 
> allocations to <1 byte.
> * JEP 450 compact object headers would further widen the C/D advantage but 
> require Java 25+; Groovy's minimum supported JDK is currently 17 (commit 
> {{e9b9cf1ec5}} / GROOVY-12027).
> * The proposed {{*With}} family pays off most under {{@CompileStatic}} + 
> lambda syntax. Dynamic-Groovy callers using closures get no allocation win — 
> but the closure overloads remain, so no regression.
> * *Related:* GROOVY-11905 (non-capturing lambda optimisation — landed)
> h2. Appendix — raw bench output
> Two full-fidelity passes saved at:
> * Run 1 ({{indy=true}}): {{/tmp/fatfree-bench/run1-indy-on.\{log,txt\}}}
> * Run 2 ({{indy=false}}): {{/tmp/fatfree-bench/run2-indy-off.\{log,txt\}}}
> Bench source: 
> {{subprojects/performance/src/jmh/groovy/org/apache/groovy/bench/FatFreeLambda\{,Bench\}.\{groovy,java\}}}
> To reproduce with GC profiler:
> {noformat}
> # Add `profilers = ['gc']` to the jmh { } block in
> # build-logic/src/main/groovy/org.apache.groovy-performance.gradle
> ./gradlew :perf:jmh -PbenchInclude=FatFreeLambda                # indy=true
> ./gradlew :perf:jmh -PbenchInclude=FatFreeLambda -Pindy=false   # indy=false
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to