gerlowskija commented on code in PR #4514: URL: https://github.com/apache/solr/pull/4514#discussion_r3454742751
########## changelog/unreleased/SOLR-18284-load-and-memory-circuit-breakers.yml: ########## @@ -0,0 +1,8 @@ +# See https://github.com/apache/solr/blob/main/dev-docs/changelog.adoc +title: Load-average and memory circuit breakers no longer stampede or trip on transient pre-GC heap peaks Review Comment: [0] "stampede"? "trip"? I understand what this is pointing to, but IMO there's not quite enough information here for a user reading our changelog to judge whether they're impacted or whether they need to read more. IMO we'd be better off breaking this into two changelog entries (i.e. totally separate files) that give a tad more detail on the negative behavior users will no longer see. e.g. > LoadAverageCircuitBreaker now uses cached metric values to avoid consuming unnecessary CPU at high RPS. and... > MemoryCircuitBreaker now looks at post-GC heap utilization to avoid false-positives when the heap is filled with collectible garbage. ########## solr/core/src/java/org/apache/solr/util/circuitbreaker/TtlSampledMetric.java: ########## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.circuitbreaker; + +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.concurrent.atomic.AtomicReference; +import java.util.function.Supplier; + +/** + * Single-flight, time-bounded cache around an expensive metric sample. + * + * <ul> Review Comment: [Q] What is this list representing and does it belong in Javadocs? AFAICT this class doesn't expose the fresh/stale/cold distinction in any way, so it's largely an implementation detail? ########## solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java: ########## @@ -17,75 +17,145 @@ package org.apache.solr.util.circuitbreaker; -import java.io.IOException; import java.lang.invoke.MethodHandles; import java.lang.management.ManagementFactory; import java.lang.management.MemoryMXBean; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryType; +import java.lang.management.MemoryUsage; +import java.util.ArrayList; +import java.util.List; import java.util.Locale; -import org.apache.solr.util.RefCounted; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.regex.Pattern; +import org.apache.solr.common.util.EnvUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** - * Tracks the current JVM heap usage and triggers if a moving heap usage average over 30 seconds - * exceeds the defined percentage of the maximum heap size allocated to the JVM. Once the average - * memory usage goes below the threshold, it will start allowing queries again. + * Trips when post-collection live data in the JVM heap exceeds a configured percentage of the + * maximum heap size. * - * <p>The memory threshold is defined as a percentage of the maximum memory allocated -- see - * memThreshold in <code>solrconfig.xml</code>. + * <p>The signal is read from {@link MemoryPoolMXBean#getCollectionUsage()} on the old/tenured heap + * pool, which reports memory usage immediately after the most recent collection that affected that + * pool. This is the only memory reading that distinguishes "live data" from "garbage waiting to be + * collected." + * + * <p>Earlier versions of this breaker sampled {@link MemoryMXBean#getHeapMemoryUsage()} on a + * 30-second moving average, which produced a high signal during normal operation: with a + * generational collector, {@code used} climbs toward {@code max} between collections — that's the + * steady-state shape, not a problem. The new signal updates only when an old-gen GC runs, which is + * the only point at which "how full is the heap really?" has a defined answer. + * + * <p>Pool selection by collector: + * + * <ul> + * <li><b>G1 / Parallel / Serial / generational ZGC:</b> uses the pool whose name matches the + * word-boundary pattern {@code \b(Old|Tenured)\b}. + * <li><b>Non-generational ZGC and Shenandoah:</b> single combined heap pool — the breaker sums + * {@code getCollectionUsage()} across every {@link MemoryType#HEAP} pool instead. + * </ul> + * + * <p>Pre-first-GC, {@link MemoryPoolMXBean#getCollectionUsage()} can return {@code null} on every + * pool; in that case the breaker reports {@code 0} live bytes and will not trip until the JVM has + * performed at least one collection on a heap pool. + * + * <p>The threshold semantics are unchanged: configure a percentage of the maximum heap size, and Review Comment: [0] "semantics are unchanged" - Similar to my comment on L44 above, references to how things _used to behave_ or past history are comprehensible in this PR but won't make much sense six months from now. Change-history is best left out of Javadocs IMO ########## solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java: ########## @@ -17,75 +17,145 @@ package org.apache.solr.util.circuitbreaker; -import java.io.IOException; import java.lang.invoke.MethodHandles; import java.lang.management.ManagementFactory; import java.lang.management.MemoryMXBean; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryType; +import java.lang.management.MemoryUsage; +import java.util.ArrayList; +import java.util.List; import java.util.Locale; -import org.apache.solr.util.RefCounted; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.regex.Pattern; +import org.apache.solr.common.util.EnvUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** - * Tracks the current JVM heap usage and triggers if a moving heap usage average over 30 seconds - * exceeds the defined percentage of the maximum heap size allocated to the JVM. Once the average - * memory usage goes below the threshold, it will start allowing queries again. + * Trips when post-collection live data in the JVM heap exceeds a configured percentage of the + * maximum heap size. * - * <p>The memory threshold is defined as a percentage of the maximum memory allocated -- see - * memThreshold in <code>solrconfig.xml</code>. + * <p>The signal is read from {@link MemoryPoolMXBean#getCollectionUsage()} on the old/tenured heap + * pool, which reports memory usage immediately after the most recent collection that affected that + * pool. This is the only memory reading that distinguishes "live data" from "garbage waiting to be + * collected." + * + * <p>Earlier versions of this breaker sampled {@link MemoryMXBean#getHeapMemoryUsage()} on a Review Comment: [Q] Is there a reason that a javadocs reader would care what the behavior used to be? This feels like LLM output that could be probably be trimmed back ########## solr/core/src/java/org/apache/solr/util/circuitbreaker/TtlSampledMetric.java: ########## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.circuitbreaker; + +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.concurrent.atomic.AtomicReference; +import java.util.function.Supplier; + +/** + * Single-flight, time-bounded cache around an expensive metric sample. + * + * <ul> + * <li><b>Fresh:</b> within the TTL window every caller returns the cached value with one volatile + * read. + * <li><b>Stale:</b> exactly one thread runs the underlying sampler at a time (chosen by CAS); all + * other concurrent callers immediately return the most recent published value rather than + * queueing behind the refresh or piling onto the sampler. + * <li><b>Cold (no sample yet):</b> guarded by a one-time {@code synchronized (this)} block so + * only the first caller computes; the rest see the result. The monitor is held across the + * sampler invocation, so concurrent first-callers will block — but this path runs at most + * once per instance, and only callers that arrive before any value has been published are + * affected. + * </ul> + * + * <p>Used by {@link LoadAverageCircuitBreaker} and {@link MemoryCircuitBreaker} so that high-QPS + * admission control cannot stampede the OS load-average syscall or the post-GC heap-pool walk: even + * under thousands of concurrent {@code isTripped()} callers, the underlying sampler is invoked at + * most once per TTL window. + * + * <p><b>Exception behavior:</b> if the {@code source} supplier throws, the exception propagates to + * the calling thread and no new sample is published. Any previously-published value remains and + * other concurrent callers continue to see it; the next caller to find the entry stale will retry + * the supplier. The {@code refreshing} flag is always released, so a thrown sampler does not wedge + * the single-flight latch. + */ +final class TtlSampledMetric<T> { Review Comment: [Q] The implementation here is great and I applaud your concurrency code. But I'm a little surprised that we'd have to implement this functionality ourselves and that there's nothing "off the shelf" that we could use instead. Surely we're not the first folks out there to calculate primitives with a TTL. Did you consider (and discard) the idea of using Caffeine or something similar here instead? Or might that be worth considering? ########## solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java: ########## @@ -96,19 +166,51 @@ public MemoryCircuitBreaker setThreshold(double thresholdValueInPercentage) { @Override public boolean isTripped() { - - long localAllowedMemory = getCurrentMemoryThreshold(); + long localAllowedMemory = heapMemoryThreshold; long localSeenMemory = getAvgMemoryUsage(); allowedMemory.set(localAllowedMemory); - seenMemory.set(localSeenMemory); - return (localSeenMemory >= localAllowedMemory); + return localSeenMemory >= localAllowedMemory; } + /** + * Returns post-GC live bytes for use by {@link #isTripped()}, cached for {@link + * CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS} ms so high-QPS callers don't repeatedly walk the + * heap pool list. + * + * <p>The historical name is preserved for source-compatibility. Tests that need to inject a + * synthetic value typically override this method directly (which bypasses the cache); tests that + * want to feed a synthetic sample <em>through</em> the cache should override {@link + * #samplePostGcLiveBytes()} instead. The implementation no longer averages anything. + */ Review Comment: [0] Can you expand a bit more on the rationale for retaining the now-incorrect name? I understand the bit about having tests that override this method, but seemingly those tests could be updated if we decided to change the name of this method? ########## solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java: ########## @@ -17,75 +17,145 @@ package org.apache.solr.util.circuitbreaker; -import java.io.IOException; import java.lang.invoke.MethodHandles; import java.lang.management.ManagementFactory; import java.lang.management.MemoryMXBean; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryType; +import java.lang.management.MemoryUsage; +import java.util.ArrayList; +import java.util.List; import java.util.Locale; -import org.apache.solr.util.RefCounted; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.regex.Pattern; +import org.apache.solr.common.util.EnvUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** - * Tracks the current JVM heap usage and triggers if a moving heap usage average over 30 seconds - * exceeds the defined percentage of the maximum heap size allocated to the JVM. Once the average - * memory usage goes below the threshold, it will start allowing queries again. + * Trips when post-collection live data in the JVM heap exceeds a configured percentage of the + * maximum heap size. * - * <p>The memory threshold is defined as a percentage of the maximum memory allocated -- see - * memThreshold in <code>solrconfig.xml</code>. + * <p>The signal is read from {@link MemoryPoolMXBean#getCollectionUsage()} on the old/tenured heap + * pool, which reports memory usage immediately after the most recent collection that affected that + * pool. This is the only memory reading that distinguishes "live data" from "garbage waiting to be + * collected." + * + * <p>Earlier versions of this breaker sampled {@link MemoryMXBean#getHeapMemoryUsage()} on a + * 30-second moving average, which produced a high signal during normal operation: with a + * generational collector, {@code used} climbs toward {@code max} between collections — that's the + * steady-state shape, not a problem. The new signal updates only when an old-gen GC runs, which is + * the only point at which "how full is the heap really?" has a defined answer. + * + * <p>Pool selection by collector: + * + * <ul> + * <li><b>G1 / Parallel / Serial / generational ZGC:</b> uses the pool whose name matches the Review Comment: [Q] Can you provide a bit more context on this and the stuff at L75-L112 below? Specifically: what value does this differentiation serve? MemoryCircuitBreaker was agnostic of pool and GC-algorithm prior to this PR and there's nothing in SOLR-18284 that points to that being an issue. Is this a fix for an additional bug that you found while working on this PR? If so, is that additional bug documented somewhere? ########## solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java: ########## @@ -17,75 +17,145 @@ package org.apache.solr.util.circuitbreaker; -import java.io.IOException; import java.lang.invoke.MethodHandles; import java.lang.management.ManagementFactory; import java.lang.management.MemoryMXBean; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryType; +import java.lang.management.MemoryUsage; +import java.util.ArrayList; +import java.util.List; import java.util.Locale; -import org.apache.solr.util.RefCounted; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.regex.Pattern; +import org.apache.solr.common.util.EnvUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** - * Tracks the current JVM heap usage and triggers if a moving heap usage average over 30 seconds - * exceeds the defined percentage of the maximum heap size allocated to the JVM. Once the average - * memory usage goes below the threshold, it will start allowing queries again. + * Trips when post-collection live data in the JVM heap exceeds a configured percentage of the Review Comment: [0] "collection" is a pretty overloaded term in Solr; consider rephrasing to indicate garbage-collection more explicitly ########## solr/core/src/test/org/apache/solr/util/circuitbreaker/TestHeapPressureCircuitBreakers.java: ########## @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.circuitbreaker; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryType; +import java.util.concurrent.atomic.AtomicInteger; +import org.apache.solr.SolrTestCase; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +/** + * Verifies that {@link MemoryCircuitBreaker} reads post-GC live data — not the raw, garbage- + * inflated heap usage — so it does not trip on transient pre-collection peaks that GC will reclaim. + */ +public class TestHeapPressureCircuitBreakers extends SolrTestCase { + + @Before + public void setUpProps() { + System.setProperty(CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS, "0"); + } + + @After + public void tearDownProps() { + System.clearProperty(CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS); + } + + @Test + public void memoryBreakerReadsRealHeapPools() { + // Sanity: the in-process JVM exposes at least one heap pool. samplePostGcLiveBytes() must + // return a non-negative value computed from real pool data, regardless of which collector + // is active. (Pre-first-GC, getCollectionUsage() can be zero on every pool — accept that.) + boolean hasHeapPool = false; + for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) { + if (pool.getType() == MemoryType.HEAP) { + hasHeapPool = true; + break; + } + } + assertTrue("JVM must expose at least one HEAP MemoryPoolMXBean", hasHeapPool); + + long bytes = new MemoryCircuitBreaker().samplePostGcLiveBytes(); + assertTrue("post-GC live bytes must be non-negative, got " + bytes, bytes >= 0); + } + + @Test + public void memoryBreakerTripsWhenOverThreshold() { + MemoryCircuitBreaker breaker = + new MemoryCircuitBreaker() { + @Override + protected long getAvgMemoryUsage() { + return Long.MAX_VALUE; // simulate "heap is exhausted post-GC" + } + }; + breaker.setThreshold(50.0); + assertTrue(breaker.isTripped()); + } + + @Test + public void memoryBreakerDoesNotTripWhenUnderThreshold() { + MemoryCircuitBreaker breaker = + new MemoryCircuitBreaker() { + @Override + protected long getAvgMemoryUsage() { + return 0L; // simulate "post-GC live data is empty" + } + }; + breaker.setThreshold(50.0); + assertFalse(breaker.isTripped()); + } + + @Test + public void samplePostGcLiveBytesIsCachedThroughGetAvgMemoryUsage() { + // With a long TTL, getAvgMemoryUsage() should consult samplePostGcLiveBytes() at most once + // across many isTripped() invocations — i.e. the cache wraps the overridable hook, not just + // the static heap-pool walk. + System.setProperty(CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS, "60000"); + try { + AtomicInteger calls = new AtomicInteger(); + MemoryCircuitBreaker breaker = + new MemoryCircuitBreaker() { + @Override + protected long samplePostGcLiveBytes() { + calls.incrementAndGet(); + return 0L; + } + }; + breaker.setThreshold(50.0); + for (int i = 0; i < 50; i++) { + breaker.isTripped(); + } + assertEquals( + "samplePostGcLiveBytes() must be invoked at most once per TTL window", 1, calls.get()); + } finally { + System.setProperty(CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS, "0"); Review Comment: [0] This can probably be dropped given the `@After` method above that clears this between test methods (and the test-rule in `SolrTestCase` that clears it between test-classes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
