Thanks so much for sharing a thorough analysis!

I'm sure we'll figure out something for MMapDirectory eventually.  My
team plans to look at it closer if nobody beats us to it.  Having a
benchmark is what's needed next / most so that any supposed solution
can be verified vs just "hoping".

Uwe recommented[1] a possible work-around: set
org.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=16

[1]: https://lists.apache.org/thread/732pzogtmmq9gmm9bztws7o06c5c1b0j

RE commits -- I kind of wish that in SolrCloud, commits (sent from the
outside) were soft by default.  But it's generally not a big issue
since most apps shouldn't be sending a commit frequently.  Tests are
the exception.

~ David


On Mon, Mar 30, 2026 at 10:06 AM Bram Luyten <[email protected]> wrote:
>
> Thank you David,
>
> Mystery, for sure. Fun ... depends on your definition I guess ;)
>
> TL;DR We were able to get the end-to-end duration of our test suite
> back down to baseline. The root cause of the majority of the
> regression turned out to be in our application's ORM layer
> (Hibernate 7 upgrade), not Solr itself. However, the Solr-related
> investigation uncovered real performance observations around
> MMapDirectory, commit overhead, and soft vs hard commits on Solr 10
> that we wanted to share in case they're useful to others.
>
> *ROOT CAUSE: NOT SOLR (FOR THE MOST PART)*
>
> Our migration upgraded both Solr (8 -> 10) and Hibernate (6 -> 7)
> simultaneously, which made attribution difficult. After detailed
> profiling, we found that a Hibernate 7 behavioral change around
> auto-flushing with JOINED inheritance tables was responsible for
> the majority of the regression (116 min -> ~27 min after the fix).
> The Solr-related findings below are independent observations that
> contribute to the remaining difference between our Solr 8 and
> Solr 10 test suite timings, though we have not isolated exactly
> how much of the remaining ~8 minute gap they explain versus other
> framework changes in our upgrade (Hibernate 7, JUnit 6, Jackson 3,
> Spring 7).
>
> *FINDING 1: MMAPDIRECTORY PANAMA MEMORYSEGMENT OVERHEAD*
>
> After ruling out Jetty, we used JFR analysis to find that the SB4
> branch generates 3,906 HandshakeAllThreads VM operations in 61
> seconds vs 65 on main. Using -Xlog:handshake*=trace we identified
> the handshake type: CloseScopedMemory, with 163,637 operations in
> 74 seconds (2,211/sec).
>
> This is caused by Lucene 10's MMapDirectory using Panama
> MemorySegment instead of the old MappedByteBuffer approach. Each
> Arena.close() triggers a CloseScopedMemory handshake that
> deoptimizes the top frame of every JVM thread. This is documented
> in:
>
> https://github.com/apache/lucene/issues/13325
> https://bugs.openjdk.org/browse/JDK-8335480 (fix in JDK 24)
>
> This was also reported on the Solr side:
> https://issues.apache.org/jira/browse/SOLR-17375
>
> The suggested workaround there is to set
> -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false,
> which works on Lucene 9. However, this option was removed in
> Lucene 10: the Panama MemorySegment implementation is now the only
> MMapDirectory code path, with no fallback and no disable flag.
>
> The key difference across Solr versions:
> - Solr 8 / Lucene 8: MappedByteBuffer + Unsafe.invokeCleaner(),
>   no handshakes at all
> - Solr 9 / Lucene 9: Panama MemorySegment, but
>   -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
>   can disable it (as suggested in SOLR-17375)
> - Solr 10 / Lucene 10: Panama is mandatory, the
>   enableMemorySegments flag no longer exists
>
> Our workaround for Solr 10: set
> -Dsolr.directoryFactory=solr.NIOFSDirectoryFactory in the test
> JVM args. This avoids MMap entirely and reduced CloseScopedMemory
> from 163,637 to 12,379 operations (13x fewer). In CI, this
> brought our IT suite from 116 minutes to 109 minutes, a modest
> 6% improvement. Locally (macOS, 10 cores) the difference was
> similarly small (74s to 71s for a single test class), because
> the handshake coordination is cheap with many cores. The remaining
> 12k CloseScopedMemory events likely come from Lucene internals
> that still use MemorySegment regardless of the directory factory.
>
> *FINDING 2: HARD COMMITS ARE 25% SLOWER ON LUCENE 10*
>
> We instrumented SolrServiceImpl.commit() with System.nanoTime()
> timing around the SolrClient.commit() call and ran the same test class
> (DiscoveryRestControllerIT, 83 tests, ~2000 Solr commits) on both
> branches:
>
>   Main (Solr 8 / Lucene 8.11.4):
>     1,972 commits, total 24,600 ms, avg 12 ms/commit
>
>   SB4 (Solr 10 / Lucene 10.3.2):
>     1,966 commits, total 30,180 ms, avg 15 ms/commit
>
> That is a 25% increase per commit, and commits account for 43%
> of the total test class runtime on both branches. With ~2000
> commits per test class and hundreds of test classes in the full
> IT suite, this adds up to tens of minutes.
>
> We believe this is at least partly related to the lucene103 codec
> TrieReader regression documented in:
>
>  https://github.com/apache/lucene/issues/15820
> <https://github.com/apache/lucene/issues/15820>
>
> which reports a 4x slower seekExact for _id lookups compared to
> the lucene90 FST approach. Every solr.add() does an ID lookup
> for deduplication, and the hard commit path triggers segment
> reopening which exercises this code path.
>
> *FINDING 3: SOFT COMMITS ELIMINATE THE OVERHEAD*
>
> Our DSpace codebase was calling solr.commit() (hard commit)
> everywhere, despite having autoCommit configured in solrconfig.xml
> with maxTime=10000 and openSearcher=true. This meant every
> explicit commit was doing an fsync + opening a new IndexSearcher,
> which is redundant when autoCommit handles durability on a
> 10-second schedule.
>
> We switched all solr.commit() calls to solr.commit(true, true, true)
> (soft commit: waitFlush=true, waitSearcher=true, softCommit=true).
>
> Results for DiscoveryRestControllerIT (83 tests):
>
>   Main, hard commit (Solr 8):  57.48s
>   SB4, hard commit (Solr 10):  70.05s  (+22%)
>   SB4, soft commit (Solr 10):  42.94s  (-25% vs main!)
>
> Soft commits on Solr 10 are faster than hard commits on Solr 8.
> We filed this as a separate improvement for DSpace since it
> benefits all versions:
>
>   https://github.com/DSpace/DSpace/issues/12153
>   https://github.com/DSpace/DSpace/pull/12154
>
> Note: when switching to soft commits in tests, we did encounter
> a subtle behavioral difference with EmbeddedSolrServer on Solr 10:
> deleteByQuery("*:*") followed by a soft commit does not reliably
> make the deletes visible to NRT readers. We had to add an explicit
> hard commit after deleteByQuery in our test cleanup to ensure
> proper test isolation. This was not an issue with Solr 8.
>
> *SUMMARY*
>
> Our embedded Solr 10 test suite (6 cores, ~2000 commits per test
> class, JDK 21) went from 19 minutes (Solr 8 baseline) to 116
> minutes after upgrading both Solr and Hibernate simultaneously.
>
> After investigation, the regression broke down as:
>
> 1. Hibernate 7 auto-flush behavioral change: 116 -> ~27 min
>    (the dominant factor, not Solr-related)
> 2. NIOFSDirectoryFactory to avoid Panama MMap overhead: ~6% gain
> 3. Soft commits instead of hard commits: eliminated the per-class
>    Solr 10 vs Solr 8 gap entirely
>
> The final CI result after all fixes: 27 minutes for the full
> 3,400+ integration test suite, which is close to the Solr 8
> baseline of 19 minutes. The remaining gap (~8 min) is likely
> a combination of the Lucene 10 codec overhead and the general
> overhead of EmbeddedSolrServer on Solr 10.
>
> If any of this analysis seems off or if there are better
> approaches we should consider, we would welcome the feedback.
>
> Best regards,
> Bram Luyten
>
> On Wed, Mar 25, 2026 at 2:27 AM David Smiley <[email protected]> wrote:
>
> > No problem.  Good luck to you Bram!  Looks like a fun mystery.
> >
> > On Tue, Mar 24, 2026 at 11:34 AM Bram Luyten <[email protected]>
> > wrote:
> > >
> > > Hi David,
> > >
> > > Thank you for the detailed response. I owe you an apology: after
> > > re-examining our data based on your feedback, the wall-clock profiling
> > > led us to an incorrect attribution. Sorry for the noise.
> > >
> > > You were right to question the wall-clock numbers for background
> > > threads. When we re-checked with CPU profiling (async-profiler -e cpu),
> > > AdaptiveExecutionStrategy.produce() shows exactly 0 CPU samples on the
> > > Solr 10 branch. The selector thread is idle, not busy-polling.
> > > Wall-clock profiling inflated it because it samples all threads
> > > regardless of state. Total CPU samples are nearly identical between
> > > branches (17,519 vs 17,119), same distribution.
> > >
> > > To answer your question: we create exactly one CoreContainer for the
> > > entire test suite, held as a static singleton with 6 cores. Between
> > > tests we clear data via deleteByQuery + commit, but the container stays
> > > alive for the full JVM lifetime. So the "lots of CoreContainers"
> > > scenario does not apply here.
> > >
> > > Given identical CPU profiles and zero Jetty CPU samples, the Solr path
> > > is almost certainly not our bottleneck. We will look elsewhere. I don't
> > > think the SolrStartup benchmark would be productive at this point.
> > >
> > > Again, apologies for the false alarm, and thank you for steering us in
> > > the right direction.
> > >
> > > Best regards,
> > > Bram Luyten
> > >
> > > On Tue, Mar 24, 2026 at 2:43 PM David Smiley <[email protected]> wrote:
> > >
> > > > Hello Bram,
> > > >
> > > > Some of what you are sharing confuses me.  I don't think sharing the
> > > > wall-clock-time is pertinent for background threads -- and I assume
> > > > those Jetty HttpClients are in the background doing nothing. Yes,
> > > > CoreContainer creates a Jetty HttpClient that is unused in an embedded
> > > > mode.  Curious; are you creating lots of CoreContainers (perhaps
> > > > indirectly via creating EmbeddedSolrServer)?  Maybe we have a
> > > > regression there.  I suspect a test environment would be doing this,
> > > > creating a CoreContainer for each test, basically.  Solr's tests do
> > > > this too!  And a slowdown as big as you show sounds like something
> > > > we'd notice... most likely.  On the other hand, if your CI/tests
> > > > creates very few CoreContainers and there's all this slowdown you
> > > > report, then CoreContainer startup is mostly irrelevant.
> > > >
> > > > We do have a benchmark that should capture a slowdown in this area --
> > > >
> > > >
> > https://github.com/apache/solr/blob/9c911e7337cd1026accc1a825e26906039982328/solr/benchmark/src/java/org/apache/solr/bench/lifecycle/SolrStartup.java
> > > > (scope is a bit larger but good enough) but we don't have continuous
> > > > benchmarking over releases to make relative comparisons.  We've been
> > > > talking about that, but the recent discussions are unlikely to support
> > > > a way to do this for embedded Solr.  I've been working on this
> > > > benchmark code lately as well.  *Anyway*, I recommend that you try
> > > > this benchmark, starting with its great README, mostly documenting JMH
> > > > itself.  If you do that and find some curious/suspicious things, I'd
> > > > love to hear more!
> > > >
> > > > On Tue, Mar 24, 2026 at 3:51 AM Bram Luyten <[email protected]>
> > > > wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Disclaimer: I am a DSpace developer, not a Solr/Jetty internals
> > > > > expert. Much of the profiling and analysis below was done with heavy
> > > > > assistance from Claude. I'm sharing this because the data seems
> > > > > significant,
> > > > > but I may be misinterpreting some of it. Corrections and guidance are
> > > > very
> > > > > welcome.
> > > > >
> > > > >
> > > > > CONTEXT
> > > > > ---------------
> > > > >
> > > > > We are upgrading DSpace (open-source repository software) from
> > > > > Spring Boot 3 / Solr 8 to Spring Boot 4 / Solr 10. Our integration
> > > > > test suite uses embedded Solr via solr-core as a test dependency
> > > > > (EmbeddedSolrServer style, no HTTP traffic -- everything is
> > > > > in-process in a single JVM).
> > > > >
> > > > > After the upgrade, our IT suite went from ~31 minutes to ~2 hours
> > > > > in CI. We spent considerable time profiling and eliminating other
> > > > > causes (Hibernate 7, Spring 7, H2 database, GC, lock contention,
> > > > > caching). Wall-clock profiling with async-profiler ultimately
> > > > > pointed to embedded Solr as the primary bottleneck.
> > > > >
> > > > > Note: we previously reported the Solr 10 POM issue with missing
> > > > > Jackson 2 dependency versions (solr-core, solr-solrj, solr-api).
> > > > > We have the workaround in place (explicit dependency declarations),
> > > > > so the embedded Solr 10 has a complete classpath.
> > > > >
> > > > >
> > > > > THE PROBLEM
> > > > > ----------------------
> > > > >
> > > > > Wall-clock profiling (async-profiler -e wall) of the same test class
> > > > > (DiscoveryRestControllerIT, 83 tests) on both branches shows:
> > > > >
> > > > >   Component        Main (Solr 8)    SB4 (Solr 10)    Difference
> > > > >   ----------------------------------------------------------------
> > > > >   Solr total          3.6s            11.5s            +7.9s
> > > > >   Hibernate           0.2s             0.2s             0.0s
> > > > >   H2 Database         0.1s             0.1s             0.0s
> > > > >   Spring              0.1s             0.1s             0.0s
> > > > >   Test total         68.4s            84.3s           +15.9s
> > > > >
> > > > > Solr accounts for 50% of the total wall-clock difference (7.9s out
> > > > > of 15.9s). Hibernate, H2, and Spring are essentially unchanged.
> > > > >
> > > > >
> > > > > THE ROOT CAUSE
> > > > > ---------------------------
> > > > >
> > > > > Breaking down the Solr wall-clock time by operation:
> > > > >
> > > > >   Operation                                    Main        SB4
> > > > >   ---------------------------------------------------------------
> > > > >   Jetty EatWhatYouKill.produce()              2558 (58%)     --
> > > > >   Jetty AdaptiveExecutionStrategy.produce()     --        12786 (91%)
> > > > >   DirectUpdateHandler2.commit()                522 (12%)    707  (5%)
> > > > >   SpellChecker.newSearcher()                   119  (3%)    261  (2%)
> > > > >
> > > > >   (Numbers are async-profiler wall-clock samples)
> > > > >
> > > > > The dominant operation is Jetty's NIO selector execution strategy:
> > > > >
> > > > >   - Solr 8 / Jetty 9: EatWhatYouKill.produce(): 2558 samples (58%)
> > > > >   - Solr 10 / Jetty 12: AdaptiveExecutionStrategy.produce(): 12786
> > > > samples
> > > > > (91%)
> > > > >   - That is a 5x increase in wall-clock time
> > > > >
> > > > > The full stack trace shows:
> > > > >
> > > > >   ThreadPoolExecutor
> > > > >     -> MDCAwareThreadPoolExecutor
> > > > >       -> ManagedSelector (Jetty NIO selector)
> > > > >         -> AdaptiveExecutionStrategy.produce()
> > > > >           -> AdaptiveExecutionStrategy.tryProduce()
> > > > >             -> AdaptiveExecutionStrategy.produceTask()
> > > > >               -> ... -> KQueue.poll (macOS NIO)
> > > > >
> > > > > This is the Jetty HTTP client's NIO event loop. Even though we use
> > > > > EmbeddedSolrServer (no HTTP traffic), Solr 10's CoreContainer
> > > > > appears to create an internal Jetty HTTP client (likely for
> > > > > inter-shard communication via HttpJettySolrClient). In embedded
> > > > > single-node mode, this client has no work to do, but its NIO
> > > > > selector thread still runs, and AdaptiveExecutionStrategy.produce()
> > > > > idles much less efficiently than Jetty 9's EatWhatYouKill did.
> > > > >
> > > > > On macOS this manifests as busy-polling in KQueue.poll. The impact
> > > > > may differ on Linux (epoll).
> > > > >
> > > > >
> > > > > PROFILING METHODOLOGY
> > > > > -----------------------------------------
> > > > >
> > > > >   - Tool: async-profiler 4.3 (wall-clock mode, safepoint-free)
> > > > >   - JDK: OpenJDK 21.0.9
> > > > >   - Both branches use the same H2 2.4.240 test database
> > > > >   - Both branches use the same test code and Solr schema/config
> > > > >   - The only Solr-related difference is the Solr version (8.11.4 vs
> > > > 10.0.0)
> > > > >   - Profiling was done on macOS (Apple Silicon), but the CI slowdown
> > > > >     (GitHub Actions, Ubuntu) shows the same pattern at larger scale
> > > > >
> > > > >
> > > > > WHAT WE RULED OUT
> > > > > ---------------------------------
> > > > >
> > > > > Before identifying the Solr/Jetty issue, we investigated and ruled
> > > > > out many other causes:
> > > > >
> > > > >   - Hibernate 7 overhead: SQL query count is similar (fewer on SB4),
> > > > >     query execution time is <40ms total for 1400+ queries
> > > > >   - H2 database: same version (2.4.240) on both branches, negligible
> > > > >     wall-clock difference
> > > > >   - GC pauses: only +0.7s extra on SB4 (1.4% of total difference)
> > > > >   - Lock contention: main actually has MORE lock contention than SB4
> > > > >   - Hibernate session.clear(): tested with/without, no effect
> > > > >   - JaCoCo coverage: tested with/without, no effect
> > > > >   - Hibernate caching (L2, query cache): disabled both, no effect
> > > > >   - Hibernate batch fetch size: tested, no effect
> > > > >
> > > > >
> > > > > QUESTIONS FOR THE SOLR TEAM
> > > > > --------------------------------------------------
> > > > >
> > > > > 1. Does embedded mode (EmbeddedSolrServer / CoreContainer without
> > > > >    an HTTP listener) need to create a Jetty HTTP client at all?
> > > > >    If the client is only for shard-to-shard communication, it
> > > > >    seems unnecessary in single-node embedded testing.
> > > > >
> > > > > 2. If the HTTP client is required, can its NIO selector / thread
> > > > >    pool be configured with minimal resources for embedded mode?
> > > > >    (e.g., fewer selector threads, smaller thread pool, or an
> > > > >    idle-friendly execution strategy)
> > > > >
> > > > > 3. Is there a Solr configuration (solr.xml property, system
> > > > >    property, or CoreContainer API) that we can use from the
> > > > >    consuming application to reduce this overhead?
> > > > >
> > > > > 4. Is this specific to macOS (KQueue) or does it also affect
> > > > >    Linux (epoll)? Our CI runs on Ubuntu and shows a larger
> > > > >    slowdown (3.8x) than local macOS (1.28x), which could be
> > > > >    related.
> > > > >
> > > > > ENVIRONMENT
> > > > > -----------------------
> > > > >
> > > > >   Solr: 10.0.0 (solr-core as test dependency for embedded server)
> > > > >   Jetty: 12.0.x (pulled in transitively by Solr 10)
> > > > >   JDK: 21
> > > > >   OS: macOS (profiled), Ubuntu (CI where the 4x slowdown manifests)
> > > > >   Project: DSpace (https://github.com/DSpace/DSpace)
> > > > >   PR: https://github.com/DSpace/DSpace/pull/11810
> > > > >
> > > > > Happy to provide the full async-profiler flame graph files or
> > > > > additional profiling data if useful.
> > > > >
> > > > > Thanks,
> > > > > Bram Luyten, Atmire
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to