Roy Teeuwen created FELIX-6828:
----------------------------------
Summary: Whiteboard startup race produces permanent 404s
Key: FELIX-6828
URL: https://issues.apache.org/jira/browse/FELIX-6828
Project: Felix
Issue Type: Bug
Components: HTTP Service
Affects Versions: http.jetty12-1.1.8, http.base-5.1.16
Reporter: Roy Teeuwen
*Summary*
Under concurrent `org.apache.felix.http` `ConfigurationAdmin` updates and
shadowed `ServletContextHelper` churn, `WhiteboardManager` gets stuck in a
state where every URL returns 404. Servlet registrations made afterwards
log success but Jetty's URL routing never picks them up. The corruption is
permanent for the JVM lifetime; only restarting the Felix HTTP bundle (or
the JVM) recovers.
*When it happens*
Triggered by an OSGi container that combines both:
1. **A stop/start cycle of the HTTP stack mid-traffic.** Any
`ConfigurationAdmin` update for PID `org.apache.felix.http` causes
`JettyService.updated()` to call `stopJetty()` then `startJetty()`, which
tears down and rebuilds `WhiteboardManager`. Sling Starter 14 hits this
on package install: `org.apache.sling.installer.factory.packages`
triggers a bundle refresh, which restarts the Felix HTTP bundle and
re-delivers the persisted config.
2. **Concurrent `ServletContextHelper` registration / unregistration on
another thread.** In the bundle-refresh wave, several Sling bundles
(Sling Engine, `SlingHttpContext`, etc.) register/unregister their SCHs
while the HTTP stack is stopping/starting.
*Root cause*
`WhiteboardManager.stop()` nulls `this.webContext` **before** it closes its
service trackers. While the trackers are still open, a concurrent
`registerService(ServletContextHelper)` from another thread synchronously
fires `addingService` on the open tracker, which calls
`addContextHelper(...)` and reads the now-`null` `webContext` into a new
`WhiteboardContextHandler`. The handler's `activate(...)` then tries to
build a `SharedServletContextImpl(webContext, ...)`, whose constructor
unconditionally calls `webContext.getContextPath()` and NPEs.
The same teardown cascade exposes a second race in
`WhiteboardManager.deactivate()`: a TOCTOU between the existing
`if (handler.getRegistry() != null)` check and the subsequent
`handler.getRegistry().getEventListenerRegistry()...` call. Plus a number
of unguarded `handler.getRegistry()` chains in
`register/unregisterWhiteboardService`,
`addWhiteboardService`, `removeWhiteboardService` and `sessionIdChanged`
that NPE if the registry has just been nulled.
*Logs*
*ERROR* org.apache.felix.http: Exception during controller unregister
java.lang.NullPointerException: Cannot invoke
"PerContextHandlerRegistry.getEventListenerRegistry()" because
the return value of "WhiteboardContextHandler.getRegistry()" is null
at WhiteboardManager.deactivate(WhiteboardManager.java:340)
at WhiteboardManager.removeContextHelper(WhiteboardManager.java:462)
at ServletContextHelperTracker.removed(ServletContextHelperTracker.java:106)
...
at WhiteboardManager.stop(WhiteboardManager.java:202)
at HttpServiceController.unregister(HttpServiceController.java:158)
at JettyService.stopJetty(JettyService.java:230)
at JettyService.updated(JettyService.java:206)
at JettyManagedService.updated(JettyManagedService.java:38)
at
ConfigurationManager$UpdateConfiguration.run(ConfigurationManager.java:1418)
java.lang.NullPointerException: Cannot invoke
"jakarta.servlet.ServletContext.getContextPath()" because
"webContext" is null
at SharedServletContextImpl.<init>(SharedServletContextImpl.java:86)
at WhiteboardContextHandler.activate(WhiteboardContextHandler.java:94)
at WhiteboardManager.activate(WhiteboardManager.java:253)
at WhiteboardManager.addContextHelper(WhiteboardManager.java:369)
at ServletContextHelperTracker.addingService(...)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)