Thank you David,

Mystery, for sure. Fun ... depends on your definition I guess ;)

TL;DR We were able to get the end-to-end duration of our test suite
back down to baseline. The root cause of the majority of the
regression turned out to be in our application's ORM layer
(Hibernate 7 upgrade), not Solr itself. However, the Solr-related
investigation uncovered real performance observations around
MMapDirectory, commit overhead, and soft vs hard commits on Solr 10
that we wanted to share in case they're useful to others.

*ROOT CAUSE: NOT SOLR (FOR THE MOST PART)*

Our migration upgraded both Solr (8 -> 10) and Hibernate (6 -> 7)
simultaneously, which made attribution difficult. After detailed
profiling, we found that a Hibernate 7 behavioral change around
auto-flushing with JOINED inheritance tables was responsible for
the majority of the regression (116 min -> ~27 min after the fix).
The Solr-related findings below are independent observations that
contribute to the remaining difference between our Solr 8 and
Solr 10 test suite timings, though we have not isolated exactly
how much of the remaining ~8 minute gap they explain versus other
framework changes in our upgrade (Hibernate 7, JUnit 6, Jackson 3,
Spring 7).

*FINDING 1: MMAPDIRECTORY PANAMA MEMORYSEGMENT OVERHEAD*

After ruling out Jetty, we used JFR analysis to find that the SB4
branch generates 3,906 HandshakeAllThreads VM operations in 61
seconds vs 65 on main. Using -Xlog:handshake*=trace we identified
the handshake type: CloseScopedMemory, with 163,637 operations in
74 seconds (2,211/sec).

This is caused by Lucene 10's MMapDirectory using Panama
MemorySegment instead of the old MappedByteBuffer approach. Each
Arena.close() triggers a CloseScopedMemory handshake that
deoptimizes the top frame of every JVM thread. This is documented
in:

https://github.com/apache/lucene/issues/13325
https://bugs.openjdk.org/browse/JDK-8335480 (fix in JDK 24)

This was also reported on the Solr side:
https://issues.apache.org/jira/browse/SOLR-17375

The suggested workaround there is to set
-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false,
which works on Lucene 9. However, this option was removed in
Lucene 10: the Panama MemorySegment implementation is now the only
MMapDirectory code path, with no fallback and no disable flag.

The key difference across Solr versions:
- Solr 8 / Lucene 8: MappedByteBuffer + Unsafe.invokeCleaner(),
  no handshakes at all
- Solr 9 / Lucene 9: Panama MemorySegment, but
  -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
  can disable it (as suggested in SOLR-17375)
- Solr 10 / Lucene 10: Panama is mandatory, the
  enableMemorySegments flag no longer exists

Our workaround for Solr 10: set
-Dsolr.directoryFactory=solr.NIOFSDirectoryFactory in the test
JVM args. This avoids MMap entirely and reduced CloseScopedMemory
from 163,637 to 12,379 operations (13x fewer). In CI, this
brought our IT suite from 116 minutes to 109 minutes, a modest
6% improvement. Locally (macOS, 10 cores) the difference was
similarly small (74s to 71s for a single test class), because
the handshake coordination is cheap with many cores. The remaining
12k CloseScopedMemory events likely come from Lucene internals
that still use MemorySegment regardless of the directory factory.

*FINDING 2: HARD COMMITS ARE 25% SLOWER ON LUCENE 10*

We instrumented SolrServiceImpl.commit() with System.nanoTime()
timing around the SolrClient.commit() call and ran the same test class
(DiscoveryRestControllerIT, 83 tests, ~2000 Solr commits) on both
branches:

  Main (Solr 8 / Lucene 8.11.4):
    1,972 commits, total 24,600 ms, avg 12 ms/commit

  SB4 (Solr 10 / Lucene 10.3.2):
    1,966 commits, total 30,180 ms, avg 15 ms/commit

That is a 25% increase per commit, and commits account for 43%
of the total test class runtime on both branches. With ~2000
commits per test class and hundreds of test classes in the full
IT suite, this adds up to tens of minutes.

We believe this is at least partly related to the lucene103 codec
TrieReader regression documented in:

 https://github.com/apache/lucene/issues/15820
<https://github.com/apache/lucene/issues/15820>

which reports a 4x slower seekExact for _id lookups compared to
the lucene90 FST approach. Every solr.add() does an ID lookup
for deduplication, and the hard commit path triggers segment
reopening which exercises this code path.

*FINDING 3: SOFT COMMITS ELIMINATE THE OVERHEAD*

Our DSpace codebase was calling solr.commit() (hard commit)
everywhere, despite having autoCommit configured in solrconfig.xml
with maxTime=10000 and openSearcher=true. This meant every
explicit commit was doing an fsync + opening a new IndexSearcher,
which is redundant when autoCommit handles durability on a
10-second schedule.

We switched all solr.commit() calls to solr.commit(true, true, true)
(soft commit: waitFlush=true, waitSearcher=true, softCommit=true).

Results for DiscoveryRestControllerIT (83 tests):

  Main, hard commit (Solr 8):  57.48s
  SB4, hard commit (Solr 10):  70.05s  (+22%)
  SB4, soft commit (Solr 10):  42.94s  (-25% vs main!)

Soft commits on Solr 10 are faster than hard commits on Solr 8.
We filed this as a separate improvement for DSpace since it
benefits all versions:

  https://github.com/DSpace/DSpace/issues/12153
  https://github.com/DSpace/DSpace/pull/12154

Note: when switching to soft commits in tests, we did encounter
a subtle behavioral difference with EmbeddedSolrServer on Solr 10:
deleteByQuery("*:*") followed by a soft commit does not reliably
make the deletes visible to NRT readers. We had to add an explicit
hard commit after deleteByQuery in our test cleanup to ensure
proper test isolation. This was not an issue with Solr 8.

*SUMMARY*

Our embedded Solr 10 test suite (6 cores, ~2000 commits per test
class, JDK 21) went from 19 minutes (Solr 8 baseline) to 116
minutes after upgrading both Solr and Hibernate simultaneously.

After investigation, the regression broke down as:

1. Hibernate 7 auto-flush behavioral change: 116 -> ~27 min
   (the dominant factor, not Solr-related)
2. NIOFSDirectoryFactory to avoid Panama MMap overhead: ~6% gain
3. Soft commits instead of hard commits: eliminated the per-class
   Solr 10 vs Solr 8 gap entirely

The final CI result after all fixes: 27 minutes for the full
3,400+ integration test suite, which is close to the Solr 8
baseline of 19 minutes. The remaining gap (~8 min) is likely
a combination of the Lucene 10 codec overhead and the general
overhead of EmbeddedSolrServer on Solr 10.

If any of this analysis seems off or if there are better
approaches we should consider, we would welcome the feedback.

Best regards,
Bram Luyten

On Wed, Mar 25, 2026 at 2:27 AM David Smiley <[email protected]> wrote:

> No problem.  Good luck to you Bram!  Looks like a fun mystery.
>
> On Tue, Mar 24, 2026 at 11:34 AM Bram Luyten <[email protected]>
> wrote:
> >
> > Hi David,
> >
> > Thank you for the detailed response. I owe you an apology: after
> > re-examining our data based on your feedback, the wall-clock profiling
> > led us to an incorrect attribution. Sorry for the noise.
> >
> > You were right to question the wall-clock numbers for background
> > threads. When we re-checked with CPU profiling (async-profiler -e cpu),
> > AdaptiveExecutionStrategy.produce() shows exactly 0 CPU samples on the
> > Solr 10 branch. The selector thread is idle, not busy-polling.
> > Wall-clock profiling inflated it because it samples all threads
> > regardless of state. Total CPU samples are nearly identical between
> > branches (17,519 vs 17,119), same distribution.
> >
> > To answer your question: we create exactly one CoreContainer for the
> > entire test suite, held as a static singleton with 6 cores. Between
> > tests we clear data via deleteByQuery + commit, but the container stays
> > alive for the full JVM lifetime. So the "lots of CoreContainers"
> > scenario does not apply here.
> >
> > Given identical CPU profiles and zero Jetty CPU samples, the Solr path
> > is almost certainly not our bottleneck. We will look elsewhere. I don't
> > think the SolrStartup benchmark would be productive at this point.
> >
> > Again, apologies for the false alarm, and thank you for steering us in
> > the right direction.
> >
> > Best regards,
> > Bram Luyten
> >
> > On Tue, Mar 24, 2026 at 2:43 PM David Smiley <[email protected]> wrote:
> >
> > > Hello Bram,
> > >
> > > Some of what you are sharing confuses me.  I don't think sharing the
> > > wall-clock-time is pertinent for background threads -- and I assume
> > > those Jetty HttpClients are in the background doing nothing. Yes,
> > > CoreContainer creates a Jetty HttpClient that is unused in an embedded
> > > mode.  Curious; are you creating lots of CoreContainers (perhaps
> > > indirectly via creating EmbeddedSolrServer)?  Maybe we have a
> > > regression there.  I suspect a test environment would be doing this,
> > > creating a CoreContainer for each test, basically.  Solr's tests do
> > > this too!  And a slowdown as big as you show sounds like something
> > > we'd notice... most likely.  On the other hand, if your CI/tests
> > > creates very few CoreContainers and there's all this slowdown you
> > > report, then CoreContainer startup is mostly irrelevant.
> > >
> > > We do have a benchmark that should capture a slowdown in this area --
> > >
> > >
> https://github.com/apache/solr/blob/9c911e7337cd1026accc1a825e26906039982328/solr/benchmark/src/java/org/apache/solr/bench/lifecycle/SolrStartup.java
> > > (scope is a bit larger but good enough) but we don't have continuous
> > > benchmarking over releases to make relative comparisons.  We've been
> > > talking about that, but the recent discussions are unlikely to support
> > > a way to do this for embedded Solr.  I've been working on this
> > > benchmark code lately as well.  *Anyway*, I recommend that you try
> > > this benchmark, starting with its great README, mostly documenting JMH
> > > itself.  If you do that and find some curious/suspicious things, I'd
> > > love to hear more!
> > >
> > > On Tue, Mar 24, 2026 at 3:51 AM Bram Luyten <[email protected]>
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Disclaimer: I am a DSpace developer, not a Solr/Jetty internals
> > > > expert. Much of the profiling and analysis below was done with heavy
> > > > assistance from Claude. I'm sharing this because the data seems
> > > > significant,
> > > > but I may be misinterpreting some of it. Corrections and guidance are
> > > very
> > > > welcome.
> > > >
> > > >
> > > > CONTEXT
> > > > ---------------
> > > >
> > > > We are upgrading DSpace (open-source repository software) from
> > > > Spring Boot 3 / Solr 8 to Spring Boot 4 / Solr 10. Our integration
> > > > test suite uses embedded Solr via solr-core as a test dependency
> > > > (EmbeddedSolrServer style, no HTTP traffic -- everything is
> > > > in-process in a single JVM).
> > > >
> > > > After the upgrade, our IT suite went from ~31 minutes to ~2 hours
> > > > in CI. We spent considerable time profiling and eliminating other
> > > > causes (Hibernate 7, Spring 7, H2 database, GC, lock contention,
> > > > caching). Wall-clock profiling with async-profiler ultimately
> > > > pointed to embedded Solr as the primary bottleneck.
> > > >
> > > > Note: we previously reported the Solr 10 POM issue with missing
> > > > Jackson 2 dependency versions (solr-core, solr-solrj, solr-api).
> > > > We have the workaround in place (explicit dependency declarations),
> > > > so the embedded Solr 10 has a complete classpath.
> > > >
> > > >
> > > > THE PROBLEM
> > > > ----------------------
> > > >
> > > > Wall-clock profiling (async-profiler -e wall) of the same test class
> > > > (DiscoveryRestControllerIT, 83 tests) on both branches shows:
> > > >
> > > >   Component        Main (Solr 8)    SB4 (Solr 10)    Difference
> > > >   ----------------------------------------------------------------
> > > >   Solr total          3.6s            11.5s            +7.9s
> > > >   Hibernate           0.2s             0.2s             0.0s
> > > >   H2 Database         0.1s             0.1s             0.0s
> > > >   Spring              0.1s             0.1s             0.0s
> > > >   Test total         68.4s            84.3s           +15.9s
> > > >
> > > > Solr accounts for 50% of the total wall-clock difference (7.9s out
> > > > of 15.9s). Hibernate, H2, and Spring are essentially unchanged.
> > > >
> > > >
> > > > THE ROOT CAUSE
> > > > ---------------------------
> > > >
> > > > Breaking down the Solr wall-clock time by operation:
> > > >
> > > >   Operation                                    Main        SB4
> > > >   ---------------------------------------------------------------
> > > >   Jetty EatWhatYouKill.produce()              2558 (58%)     --
> > > >   Jetty AdaptiveExecutionStrategy.produce()     --        12786 (91%)
> > > >   DirectUpdateHandler2.commit()                522 (12%)    707  (5%)
> > > >   SpellChecker.newSearcher()                   119  (3%)    261  (2%)
> > > >
> > > >   (Numbers are async-profiler wall-clock samples)
> > > >
> > > > The dominant operation is Jetty's NIO selector execution strategy:
> > > >
> > > >   - Solr 8 / Jetty 9: EatWhatYouKill.produce(): 2558 samples (58%)
> > > >   - Solr 10 / Jetty 12: AdaptiveExecutionStrategy.produce(): 12786
> > > samples
> > > > (91%)
> > > >   - That is a 5x increase in wall-clock time
> > > >
> > > > The full stack trace shows:
> > > >
> > > >   ThreadPoolExecutor
> > > >     -> MDCAwareThreadPoolExecutor
> > > >       -> ManagedSelector (Jetty NIO selector)
> > > >         -> AdaptiveExecutionStrategy.produce()
> > > >           -> AdaptiveExecutionStrategy.tryProduce()
> > > >             -> AdaptiveExecutionStrategy.produceTask()
> > > >               -> ... -> KQueue.poll (macOS NIO)
> > > >
> > > > This is the Jetty HTTP client's NIO event loop. Even though we use
> > > > EmbeddedSolrServer (no HTTP traffic), Solr 10's CoreContainer
> > > > appears to create an internal Jetty HTTP client (likely for
> > > > inter-shard communication via HttpJettySolrClient). In embedded
> > > > single-node mode, this client has no work to do, but its NIO
> > > > selector thread still runs, and AdaptiveExecutionStrategy.produce()
> > > > idles much less efficiently than Jetty 9's EatWhatYouKill did.
> > > >
> > > > On macOS this manifests as busy-polling in KQueue.poll. The impact
> > > > may differ on Linux (epoll).
> > > >
> > > >
> > > > PROFILING METHODOLOGY
> > > > -----------------------------------------
> > > >
> > > >   - Tool: async-profiler 4.3 (wall-clock mode, safepoint-free)
> > > >   - JDK: OpenJDK 21.0.9
> > > >   - Both branches use the same H2 2.4.240 test database
> > > >   - Both branches use the same test code and Solr schema/config
> > > >   - The only Solr-related difference is the Solr version (8.11.4 vs
> > > 10.0.0)
> > > >   - Profiling was done on macOS (Apple Silicon), but the CI slowdown
> > > >     (GitHub Actions, Ubuntu) shows the same pattern at larger scale
> > > >
> > > >
> > > > WHAT WE RULED OUT
> > > > ---------------------------------
> > > >
> > > > Before identifying the Solr/Jetty issue, we investigated and ruled
> > > > out many other causes:
> > > >
> > > >   - Hibernate 7 overhead: SQL query count is similar (fewer on SB4),
> > > >     query execution time is <40ms total for 1400+ queries
> > > >   - H2 database: same version (2.4.240) on both branches, negligible
> > > >     wall-clock difference
> > > >   - GC pauses: only +0.7s extra on SB4 (1.4% of total difference)
> > > >   - Lock contention: main actually has MORE lock contention than SB4
> > > >   - Hibernate session.clear(): tested with/without, no effect
> > > >   - JaCoCo coverage: tested with/without, no effect
> > > >   - Hibernate caching (L2, query cache): disabled both, no effect
> > > >   - Hibernate batch fetch size: tested, no effect
> > > >
> > > >
> > > > QUESTIONS FOR THE SOLR TEAM
> > > > --------------------------------------------------
> > > >
> > > > 1. Does embedded mode (EmbeddedSolrServer / CoreContainer without
> > > >    an HTTP listener) need to create a Jetty HTTP client at all?
> > > >    If the client is only for shard-to-shard communication, it
> > > >    seems unnecessary in single-node embedded testing.
> > > >
> > > > 2. If the HTTP client is required, can its NIO selector / thread
> > > >    pool be configured with minimal resources for embedded mode?
> > > >    (e.g., fewer selector threads, smaller thread pool, or an
> > > >    idle-friendly execution strategy)
> > > >
> > > > 3. Is there a Solr configuration (solr.xml property, system
> > > >    property, or CoreContainer API) that we can use from the
> > > >    consuming application to reduce this overhead?
> > > >
> > > > 4. Is this specific to macOS (KQueue) or does it also affect
> > > >    Linux (epoll)? Our CI runs on Ubuntu and shows a larger
> > > >    slowdown (3.8x) than local macOS (1.28x), which could be
> > > >    related.
> > > >
> > > > ENVIRONMENT
> > > > -----------------------
> > > >
> > > >   Solr: 10.0.0 (solr-core as test dependency for embedded server)
> > > >   Jetty: 12.0.x (pulled in transitively by Solr 10)
> > > >   JDK: 21
> > > >   OS: macOS (profiled), Ubuntu (CI where the 4x slowdown manifests)
> > > >   Project: DSpace (https://github.com/DSpace/DSpace)
> > > >   PR: https://github.com/DSpace/DSpace/pull/11810
> > > >
> > > > Happy to provide the full async-profiler flame graph files or
> > > > additional profiling data if useful.
> > > >
> > > > Thanks,
> > > > Bram Luyten, Atmire
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to