Thanks so much for sharing a thorough analysis! I'm sure we'll figure out something for MMapDirectory eventually. My team plans to look at it closer if nobody beats us to it. Having a benchmark is what's needed next / most so that any supposed solution can be verified vs just "hoping".
Uwe recommented[1] a possible work-around: set org.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=16 [1]: https://lists.apache.org/thread/732pzogtmmq9gmm9bztws7o06c5c1b0j RE commits -- I kind of wish that in SolrCloud, commits (sent from the outside) were soft by default. But it's generally not a big issue since most apps shouldn't be sending a commit frequently. Tests are the exception. ~ David On Mon, Mar 30, 2026 at 10:06 AM Bram Luyten <[email protected]> wrote: > > Thank you David, > > Mystery, for sure. Fun ... depends on your definition I guess ;) > > TL;DR We were able to get the end-to-end duration of our test suite > back down to baseline. The root cause of the majority of the > regression turned out to be in our application's ORM layer > (Hibernate 7 upgrade), not Solr itself. However, the Solr-related > investigation uncovered real performance observations around > MMapDirectory, commit overhead, and soft vs hard commits on Solr 10 > that we wanted to share in case they're useful to others. > > *ROOT CAUSE: NOT SOLR (FOR THE MOST PART)* > > Our migration upgraded both Solr (8 -> 10) and Hibernate (6 -> 7) > simultaneously, which made attribution difficult. After detailed > profiling, we found that a Hibernate 7 behavioral change around > auto-flushing with JOINED inheritance tables was responsible for > the majority of the regression (116 min -> ~27 min after the fix). > The Solr-related findings below are independent observations that > contribute to the remaining difference between our Solr 8 and > Solr 10 test suite timings, though we have not isolated exactly > how much of the remaining ~8 minute gap they explain versus other > framework changes in our upgrade (Hibernate 7, JUnit 6, Jackson 3, > Spring 7). > > *FINDING 1: MMAPDIRECTORY PANAMA MEMORYSEGMENT OVERHEAD* > > After ruling out Jetty, we used JFR analysis to find that the SB4 > branch generates 3,906 HandshakeAllThreads VM operations in 61 > seconds vs 65 on main. Using -Xlog:handshake*=trace we identified > the handshake type: CloseScopedMemory, with 163,637 operations in > 74 seconds (2,211/sec). > > This is caused by Lucene 10's MMapDirectory using Panama > MemorySegment instead of the old MappedByteBuffer approach. Each > Arena.close() triggers a CloseScopedMemory handshake that > deoptimizes the top frame of every JVM thread. This is documented > in: > > https://github.com/apache/lucene/issues/13325 > https://bugs.openjdk.org/browse/JDK-8335480 (fix in JDK 24) > > This was also reported on the Solr side: > https://issues.apache.org/jira/browse/SOLR-17375 > > The suggested workaround there is to set > -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false, > which works on Lucene 9. However, this option was removed in > Lucene 10: the Panama MemorySegment implementation is now the only > MMapDirectory code path, with no fallback and no disable flag. > > The key difference across Solr versions: > - Solr 8 / Lucene 8: MappedByteBuffer + Unsafe.invokeCleaner(), > no handshakes at all > - Solr 9 / Lucene 9: Panama MemorySegment, but > -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false > can disable it (as suggested in SOLR-17375) > - Solr 10 / Lucene 10: Panama is mandatory, the > enableMemorySegments flag no longer exists > > Our workaround for Solr 10: set > -Dsolr.directoryFactory=solr.NIOFSDirectoryFactory in the test > JVM args. This avoids MMap entirely and reduced CloseScopedMemory > from 163,637 to 12,379 operations (13x fewer). In CI, this > brought our IT suite from 116 minutes to 109 minutes, a modest > 6% improvement. Locally (macOS, 10 cores) the difference was > similarly small (74s to 71s for a single test class), because > the handshake coordination is cheap with many cores. The remaining > 12k CloseScopedMemory events likely come from Lucene internals > that still use MemorySegment regardless of the directory factory. > > *FINDING 2: HARD COMMITS ARE 25% SLOWER ON LUCENE 10* > > We instrumented SolrServiceImpl.commit() with System.nanoTime() > timing around the SolrClient.commit() call and ran the same test class > (DiscoveryRestControllerIT, 83 tests, ~2000 Solr commits) on both > branches: > > Main (Solr 8 / Lucene 8.11.4): > 1,972 commits, total 24,600 ms, avg 12 ms/commit > > SB4 (Solr 10 / Lucene 10.3.2): > 1,966 commits, total 30,180 ms, avg 15 ms/commit > > That is a 25% increase per commit, and commits account for 43% > of the total test class runtime on both branches. With ~2000 > commits per test class and hundreds of test classes in the full > IT suite, this adds up to tens of minutes. > > We believe this is at least partly related to the lucene103 codec > TrieReader regression documented in: > > https://github.com/apache/lucene/issues/15820 > <https://github.com/apache/lucene/issues/15820> > > which reports a 4x slower seekExact for _id lookups compared to > the lucene90 FST approach. Every solr.add() does an ID lookup > for deduplication, and the hard commit path triggers segment > reopening which exercises this code path. > > *FINDING 3: SOFT COMMITS ELIMINATE THE OVERHEAD* > > Our DSpace codebase was calling solr.commit() (hard commit) > everywhere, despite having autoCommit configured in solrconfig.xml > with maxTime=10000 and openSearcher=true. This meant every > explicit commit was doing an fsync + opening a new IndexSearcher, > which is redundant when autoCommit handles durability on a > 10-second schedule. > > We switched all solr.commit() calls to solr.commit(true, true, true) > (soft commit: waitFlush=true, waitSearcher=true, softCommit=true). > > Results for DiscoveryRestControllerIT (83 tests): > > Main, hard commit (Solr 8): 57.48s > SB4, hard commit (Solr 10): 70.05s (+22%) > SB4, soft commit (Solr 10): 42.94s (-25% vs main!) > > Soft commits on Solr 10 are faster than hard commits on Solr 8. > We filed this as a separate improvement for DSpace since it > benefits all versions: > > https://github.com/DSpace/DSpace/issues/12153 > https://github.com/DSpace/DSpace/pull/12154 > > Note: when switching to soft commits in tests, we did encounter > a subtle behavioral difference with EmbeddedSolrServer on Solr 10: > deleteByQuery("*:*") followed by a soft commit does not reliably > make the deletes visible to NRT readers. We had to add an explicit > hard commit after deleteByQuery in our test cleanup to ensure > proper test isolation. This was not an issue with Solr 8. > > *SUMMARY* > > Our embedded Solr 10 test suite (6 cores, ~2000 commits per test > class, JDK 21) went from 19 minutes (Solr 8 baseline) to 116 > minutes after upgrading both Solr and Hibernate simultaneously. > > After investigation, the regression broke down as: > > 1. Hibernate 7 auto-flush behavioral change: 116 -> ~27 min > (the dominant factor, not Solr-related) > 2. NIOFSDirectoryFactory to avoid Panama MMap overhead: ~6% gain > 3. Soft commits instead of hard commits: eliminated the per-class > Solr 10 vs Solr 8 gap entirely > > The final CI result after all fixes: 27 minutes for the full > 3,400+ integration test suite, which is close to the Solr 8 > baseline of 19 minutes. The remaining gap (~8 min) is likely > a combination of the Lucene 10 codec overhead and the general > overhead of EmbeddedSolrServer on Solr 10. > > If any of this analysis seems off or if there are better > approaches we should consider, we would welcome the feedback. > > Best regards, > Bram Luyten > > On Wed, Mar 25, 2026 at 2:27 AM David Smiley <[email protected]> wrote: > > > No problem. Good luck to you Bram! Looks like a fun mystery. > > > > On Tue, Mar 24, 2026 at 11:34 AM Bram Luyten <[email protected]> > > wrote: > > > > > > Hi David, > > > > > > Thank you for the detailed response. I owe you an apology: after > > > re-examining our data based on your feedback, the wall-clock profiling > > > led us to an incorrect attribution. Sorry for the noise. > > > > > > You were right to question the wall-clock numbers for background > > > threads. When we re-checked with CPU profiling (async-profiler -e cpu), > > > AdaptiveExecutionStrategy.produce() shows exactly 0 CPU samples on the > > > Solr 10 branch. The selector thread is idle, not busy-polling. > > > Wall-clock profiling inflated it because it samples all threads > > > regardless of state. Total CPU samples are nearly identical between > > > branches (17,519 vs 17,119), same distribution. > > > > > > To answer your question: we create exactly one CoreContainer for the > > > entire test suite, held as a static singleton with 6 cores. Between > > > tests we clear data via deleteByQuery + commit, but the container stays > > > alive for the full JVM lifetime. So the "lots of CoreContainers" > > > scenario does not apply here. > > > > > > Given identical CPU profiles and zero Jetty CPU samples, the Solr path > > > is almost certainly not our bottleneck. We will look elsewhere. I don't > > > think the SolrStartup benchmark would be productive at this point. > > > > > > Again, apologies for the false alarm, and thank you for steering us in > > > the right direction. > > > > > > Best regards, > > > Bram Luyten > > > > > > On Tue, Mar 24, 2026 at 2:43 PM David Smiley <[email protected]> wrote: > > > > > > > Hello Bram, > > > > > > > > Some of what you are sharing confuses me. I don't think sharing the > > > > wall-clock-time is pertinent for background threads -- and I assume > > > > those Jetty HttpClients are in the background doing nothing. Yes, > > > > CoreContainer creates a Jetty HttpClient that is unused in an embedded > > > > mode. Curious; are you creating lots of CoreContainers (perhaps > > > > indirectly via creating EmbeddedSolrServer)? Maybe we have a > > > > regression there. I suspect a test environment would be doing this, > > > > creating a CoreContainer for each test, basically. Solr's tests do > > > > this too! And a slowdown as big as you show sounds like something > > > > we'd notice... most likely. On the other hand, if your CI/tests > > > > creates very few CoreContainers and there's all this slowdown you > > > > report, then CoreContainer startup is mostly irrelevant. > > > > > > > > We do have a benchmark that should capture a slowdown in this area -- > > > > > > > > > > https://github.com/apache/solr/blob/9c911e7337cd1026accc1a825e26906039982328/solr/benchmark/src/java/org/apache/solr/bench/lifecycle/SolrStartup.java > > > > (scope is a bit larger but good enough) but we don't have continuous > > > > benchmarking over releases to make relative comparisons. We've been > > > > talking about that, but the recent discussions are unlikely to support > > > > a way to do this for embedded Solr. I've been working on this > > > > benchmark code lately as well. *Anyway*, I recommend that you try > > > > this benchmark, starting with its great README, mostly documenting JMH > > > > itself. If you do that and find some curious/suspicious things, I'd > > > > love to hear more! > > > > > > > > On Tue, Mar 24, 2026 at 3:51 AM Bram Luyten <[email protected]> > > > > wrote: > > > > > > > > > > Hi all, > > > > > > > > > > Disclaimer: I am a DSpace developer, not a Solr/Jetty internals > > > > > expert. Much of the profiling and analysis below was done with heavy > > > > > assistance from Claude. I'm sharing this because the data seems > > > > > significant, > > > > > but I may be misinterpreting some of it. Corrections and guidance are > > > > very > > > > > welcome. > > > > > > > > > > > > > > > CONTEXT > > > > > --------------- > > > > > > > > > > We are upgrading DSpace (open-source repository software) from > > > > > Spring Boot 3 / Solr 8 to Spring Boot 4 / Solr 10. Our integration > > > > > test suite uses embedded Solr via solr-core as a test dependency > > > > > (EmbeddedSolrServer style, no HTTP traffic -- everything is > > > > > in-process in a single JVM). > > > > > > > > > > After the upgrade, our IT suite went from ~31 minutes to ~2 hours > > > > > in CI. We spent considerable time profiling and eliminating other > > > > > causes (Hibernate 7, Spring 7, H2 database, GC, lock contention, > > > > > caching). Wall-clock profiling with async-profiler ultimately > > > > > pointed to embedded Solr as the primary bottleneck. > > > > > > > > > > Note: we previously reported the Solr 10 POM issue with missing > > > > > Jackson 2 dependency versions (solr-core, solr-solrj, solr-api). > > > > > We have the workaround in place (explicit dependency declarations), > > > > > so the embedded Solr 10 has a complete classpath. > > > > > > > > > > > > > > > THE PROBLEM > > > > > ---------------------- > > > > > > > > > > Wall-clock profiling (async-profiler -e wall) of the same test class > > > > > (DiscoveryRestControllerIT, 83 tests) on both branches shows: > > > > > > > > > > Component Main (Solr 8) SB4 (Solr 10) Difference > > > > > ---------------------------------------------------------------- > > > > > Solr total 3.6s 11.5s +7.9s > > > > > Hibernate 0.2s 0.2s 0.0s > > > > > H2 Database 0.1s 0.1s 0.0s > > > > > Spring 0.1s 0.1s 0.0s > > > > > Test total 68.4s 84.3s +15.9s > > > > > > > > > > Solr accounts for 50% of the total wall-clock difference (7.9s out > > > > > of 15.9s). Hibernate, H2, and Spring are essentially unchanged. > > > > > > > > > > > > > > > THE ROOT CAUSE > > > > > --------------------------- > > > > > > > > > > Breaking down the Solr wall-clock time by operation: > > > > > > > > > > Operation Main SB4 > > > > > --------------------------------------------------------------- > > > > > Jetty EatWhatYouKill.produce() 2558 (58%) -- > > > > > Jetty AdaptiveExecutionStrategy.produce() -- 12786 (91%) > > > > > DirectUpdateHandler2.commit() 522 (12%) 707 (5%) > > > > > SpellChecker.newSearcher() 119 (3%) 261 (2%) > > > > > > > > > > (Numbers are async-profiler wall-clock samples) > > > > > > > > > > The dominant operation is Jetty's NIO selector execution strategy: > > > > > > > > > > - Solr 8 / Jetty 9: EatWhatYouKill.produce(): 2558 samples (58%) > > > > > - Solr 10 / Jetty 12: AdaptiveExecutionStrategy.produce(): 12786 > > > > samples > > > > > (91%) > > > > > - That is a 5x increase in wall-clock time > > > > > > > > > > The full stack trace shows: > > > > > > > > > > ThreadPoolExecutor > > > > > -> MDCAwareThreadPoolExecutor > > > > > -> ManagedSelector (Jetty NIO selector) > > > > > -> AdaptiveExecutionStrategy.produce() > > > > > -> AdaptiveExecutionStrategy.tryProduce() > > > > > -> AdaptiveExecutionStrategy.produceTask() > > > > > -> ... -> KQueue.poll (macOS NIO) > > > > > > > > > > This is the Jetty HTTP client's NIO event loop. Even though we use > > > > > EmbeddedSolrServer (no HTTP traffic), Solr 10's CoreContainer > > > > > appears to create an internal Jetty HTTP client (likely for > > > > > inter-shard communication via HttpJettySolrClient). In embedded > > > > > single-node mode, this client has no work to do, but its NIO > > > > > selector thread still runs, and AdaptiveExecutionStrategy.produce() > > > > > idles much less efficiently than Jetty 9's EatWhatYouKill did. > > > > > > > > > > On macOS this manifests as busy-polling in KQueue.poll. The impact > > > > > may differ on Linux (epoll). > > > > > > > > > > > > > > > PROFILING METHODOLOGY > > > > > ----------------------------------------- > > > > > > > > > > - Tool: async-profiler 4.3 (wall-clock mode, safepoint-free) > > > > > - JDK: OpenJDK 21.0.9 > > > > > - Both branches use the same H2 2.4.240 test database > > > > > - Both branches use the same test code and Solr schema/config > > > > > - The only Solr-related difference is the Solr version (8.11.4 vs > > > > 10.0.0) > > > > > - Profiling was done on macOS (Apple Silicon), but the CI slowdown > > > > > (GitHub Actions, Ubuntu) shows the same pattern at larger scale > > > > > > > > > > > > > > > WHAT WE RULED OUT > > > > > --------------------------------- > > > > > > > > > > Before identifying the Solr/Jetty issue, we investigated and ruled > > > > > out many other causes: > > > > > > > > > > - Hibernate 7 overhead: SQL query count is similar (fewer on SB4), > > > > > query execution time is <40ms total for 1400+ queries > > > > > - H2 database: same version (2.4.240) on both branches, negligible > > > > > wall-clock difference > > > > > - GC pauses: only +0.7s extra on SB4 (1.4% of total difference) > > > > > - Lock contention: main actually has MORE lock contention than SB4 > > > > > - Hibernate session.clear(): tested with/without, no effect > > > > > - JaCoCo coverage: tested with/without, no effect > > > > > - Hibernate caching (L2, query cache): disabled both, no effect > > > > > - Hibernate batch fetch size: tested, no effect > > > > > > > > > > > > > > > QUESTIONS FOR THE SOLR TEAM > > > > > -------------------------------------------------- > > > > > > > > > > 1. Does embedded mode (EmbeddedSolrServer / CoreContainer without > > > > > an HTTP listener) need to create a Jetty HTTP client at all? > > > > > If the client is only for shard-to-shard communication, it > > > > > seems unnecessary in single-node embedded testing. > > > > > > > > > > 2. If the HTTP client is required, can its NIO selector / thread > > > > > pool be configured with minimal resources for embedded mode? > > > > > (e.g., fewer selector threads, smaller thread pool, or an > > > > > idle-friendly execution strategy) > > > > > > > > > > 3. Is there a Solr configuration (solr.xml property, system > > > > > property, or CoreContainer API) that we can use from the > > > > > consuming application to reduce this overhead? > > > > > > > > > > 4. Is this specific to macOS (KQueue) or does it also affect > > > > > Linux (epoll)? Our CI runs on Ubuntu and shows a larger > > > > > slowdown (3.8x) than local macOS (1.28x), which could be > > > > > related. > > > > > > > > > > ENVIRONMENT > > > > > ----------------------- > > > > > > > > > > Solr: 10.0.0 (solr-core as test dependency for embedded server) > > > > > Jetty: 12.0.x (pulled in transitively by Solr 10) > > > > > JDK: 21 > > > > > OS: macOS (profiled), Ubuntu (CI where the 4x slowdown manifests) > > > > > Project: DSpace (https://github.com/DSpace/DSpace) > > > > > PR: https://github.com/DSpace/DSpace/pull/11810 > > > > > > > > > > Happy to provide the full async-profiler flame graph files or > > > > > additional profiling data if useful. > > > > > > > > > > Thanks, > > > > > Bram Luyten, Atmire > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [email protected] > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
