[ 
https://issues.apache.org/jira/browse/FELIX-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roy Teeuwen updated FELIX-6828:
-------------------------------
    Description: 
*Summary*

Under concurrent `org.apache.felix.http` `ConfigurationAdmin` updates and
shadowed `ServletContextHelper` churn, `WhiteboardManager` gets stuck in a
state where every URL returns 404. Servlet registrations made afterwards
log success but Jetty's URL routing never picks them up. The corruption is
permanent for the JVM lifetime; only restarting the Felix HTTP bundle (or
the JVM) recovers.

*When it happens*

Triggered by an OSGi container that combines both:

1. *{*}A stop/start cycle of the HTTP stack mid-traffic.{*}* Any
`ConfigurationAdmin` update for PID `org.apache.felix.http` causes
`JettyService.updated()` to call `stopJetty()` then `startJetty()`, which
tears down and rebuilds `WhiteboardManager`. Sling Starter 14 hits this
on package install: `org.apache.sling.installer.factory.packages`
triggers a bundle refresh, which restarts the Felix HTTP bundle and
re-delivers the persisted config.

2. **Concurrent `ServletContextHelper` registration / unregistration on
another thread.** In the bundle-refresh wave, several Sling bundles
(Sling Engine, `SlingHttpContext`, etc.) register/unregister their SCHs
while the HTTP stack is stopping/starting.

*Root cause*

`WhiteboardManager.stop()` nulls `this.webContext` *{*}before{*}* it closes its
service trackers. While the trackers are still open, a concurrent
`registerService(ServletContextHelper)` from another thread synchronously
fires `addingService` on the open tracker, which calls
`addContextHelper(...)` and reads the now-`null` `webContext` into a new
`WhiteboardContextHandler`. The handler's `activate(...)` then tries to
build a `SharedServletContextImpl(webContext, ...)`, whose constructor
unconditionally calls `webContext.getContextPath()` and NPEs.

The same teardown cascade exposes a second race in
`WhiteboardManager.deactivate()`: a TOCTOU between the existing
`if (handler.getRegistry() != null)` check and the subsequent
`handler.getRegistry().getEventListenerRegistry()...` call. Plus a number
of unguarded `handler.getRegistry()` chains in 
`register/unregisterWhiteboardService`,
`addWhiteboardService`, `removeWhiteboardService` and `sessionIdChanged`
that NPE if the registry has just been nulled.

*Logs*
{code:java}
ERROR org.apache.felix.http: Exception during controller unregister
java.lang.NullPointerException: Cannot invoke
"PerContextHandlerRegistry.getEventListenerRegistry()" because
the return value of "WhiteboardContextHandler.getRegistry()" is null
at WhiteboardManager.deactivate(WhiteboardManager.java:340)
at WhiteboardManager.removeContextHelper(WhiteboardManager.java:462)
at ServletContextHelperTracker.removed(ServletContextHelperTracker.java:106)
...
at WhiteboardManager.stop(WhiteboardManager.java:202)
at HttpServiceController.unregister(HttpServiceController.java:158)
at JettyService.stopJetty(JettyService.java:230)
at JettyService.updated(JettyService.java:206)
at JettyManagedService.updated(JettyManagedService.java:38)
at 
ConfigurationManager$UpdateConfiguration.run(ConfigurationManager.java:1418){code}
{code:java}
java.lang.NullPointerException: Cannot invoke
"jakarta.servlet.ServletContext.getContextPath()" because
"webContext" is null
at SharedServletContextImpl.<init>(SharedServletContextImpl.java:86)
at WhiteboardContextHandler.activate(WhiteboardContextHandler.java:94)
at WhiteboardManager.activate(WhiteboardManager.java:253)
at WhiteboardManager.addContextHelper(WhiteboardManager.java:369)
at ServletContextHelperTracker.addingService(...){code}

  was:
*Summary*

Under concurrent `org.apache.felix.http` `ConfigurationAdmin` updates and
shadowed `ServletContextHelper` churn, `WhiteboardManager` gets stuck in a
state where every URL returns 404. Servlet registrations made afterwards
log success but Jetty's URL routing never picks them up. The corruption is
permanent for the JVM lifetime; only restarting the Felix HTTP bundle (or
the JVM) recovers.

*When it happens*

Triggered by an OSGi container that combines both:

1. **A stop/start cycle of the HTTP stack mid-traffic.** Any
   `ConfigurationAdmin` update for PID `org.apache.felix.http` causes
   `JettyService.updated()` to call `stopJetty()` then `startJetty()`, which
   tears down and rebuilds `WhiteboardManager`. Sling Starter 14 hits this
   on package install: `org.apache.sling.installer.factory.packages`
   triggers a bundle refresh, which restarts the Felix HTTP bundle and
   re-delivers the persisted config.

2. **Concurrent `ServletContextHelper` registration / unregistration on
   another thread.** In the bundle-refresh wave, several Sling bundles
   (Sling Engine, `SlingHttpContext`, etc.) register/unregister their SCHs
   while the HTTP stack is stopping/starting. 

*Root cause*

`WhiteboardManager.stop()` nulls `this.webContext` **before** it closes its
service trackers. While the trackers are still open, a concurrent
`registerService(ServletContextHelper)` from another thread synchronously
fires `addingService` on the open tracker, which calls
`addContextHelper(...)` and reads the now-`null` `webContext` into a new
`WhiteboardContextHandler`. The handler's `activate(...)` then tries to
build a `SharedServletContextImpl(webContext, ...)`, whose constructor
unconditionally calls `webContext.getContextPath()` and NPEs.

The same teardown cascade exposes a second race in
`WhiteboardManager.deactivate()`: a TOCTOU between the existing
`if (handler.getRegistry() != null)` check and the subsequent
`handler.getRegistry().getEventListenerRegistry()...` call. Plus a number
of unguarded `handler.getRegistry()` chains in 
`register/unregisterWhiteboardService`,
`addWhiteboardService`, `removeWhiteboardService` and `sessionIdChanged`
that NPE if the registry has just been nulled.

*Logs*

*ERROR* org.apache.felix.http: Exception during controller unregister
java.lang.NullPointerException: Cannot invoke
  "PerContextHandlerRegistry.getEventListenerRegistry()" because
  the return value of "WhiteboardContextHandler.getRegistry()" is null
    at WhiteboardManager.deactivate(WhiteboardManager.java:340)
    at WhiteboardManager.removeContextHelper(WhiteboardManager.java:462)
    at ServletContextHelperTracker.removed(ServletContextHelperTracker.java:106)
    ...
    at WhiteboardManager.stop(WhiteboardManager.java:202)
    at HttpServiceController.unregister(HttpServiceController.java:158)
    at JettyService.stopJetty(JettyService.java:230)
    at JettyService.updated(JettyService.java:206)
    at JettyManagedService.updated(JettyManagedService.java:38)
    at 
ConfigurationManager$UpdateConfiguration.run(ConfigurationManager.java:1418)

java.lang.NullPointerException: Cannot invoke
  "jakarta.servlet.ServletContext.getContextPath()" because
  "webContext" is null
    at SharedServletContextImpl.<init>(SharedServletContextImpl.java:86)
    at WhiteboardContextHandler.activate(WhiteboardContextHandler.java:94)
    at WhiteboardManager.activate(WhiteboardManager.java:253)
    at WhiteboardManager.addContextHelper(WhiteboardManager.java:369)
    at ServletContextHelperTracker.addingService(...)




> Whiteboard startup race produces permanent 404s
> -----------------------------------------------
>
>                 Key: FELIX-6828
>                 URL: https://issues.apache.org/jira/browse/FELIX-6828
>             Project: Felix
>          Issue Type: Bug
>          Components: HTTP Service
>    Affects Versions: http.base-5.1.16, http.jetty12-1.1.8
>            Reporter: Roy Teeuwen
>            Priority: Major
>
> *Summary*
> Under concurrent `org.apache.felix.http` `ConfigurationAdmin` updates and
> shadowed `ServletContextHelper` churn, `WhiteboardManager` gets stuck in a
> state where every URL returns 404. Servlet registrations made afterwards
> log success but Jetty's URL routing never picks them up. The corruption is
> permanent for the JVM lifetime; only restarting the Felix HTTP bundle (or
> the JVM) recovers.
> *When it happens*
> Triggered by an OSGi container that combines both:
> 1. *{*}A stop/start cycle of the HTTP stack mid-traffic.{*}* Any
> `ConfigurationAdmin` update for PID `org.apache.felix.http` causes
> `JettyService.updated()` to call `stopJetty()` then `startJetty()`, which
> tears down and rebuilds `WhiteboardManager`. Sling Starter 14 hits this
> on package install: `org.apache.sling.installer.factory.packages`
> triggers a bundle refresh, which restarts the Felix HTTP bundle and
> re-delivers the persisted config.
> 2. **Concurrent `ServletContextHelper` registration / unregistration on
> another thread.** In the bundle-refresh wave, several Sling bundles
> (Sling Engine, `SlingHttpContext`, etc.) register/unregister their SCHs
> while the HTTP stack is stopping/starting.
> *Root cause*
> `WhiteboardManager.stop()` nulls `this.webContext` *{*}before{*}* it closes 
> its
> service trackers. While the trackers are still open, a concurrent
> `registerService(ServletContextHelper)` from another thread synchronously
> fires `addingService` on the open tracker, which calls
> `addContextHelper(...)` and reads the now-`null` `webContext` into a new
> `WhiteboardContextHandler`. The handler's `activate(...)` then tries to
> build a `SharedServletContextImpl(webContext, ...)`, whose constructor
> unconditionally calls `webContext.getContextPath()` and NPEs.
> The same teardown cascade exposes a second race in
> `WhiteboardManager.deactivate()`: a TOCTOU between the existing
> `if (handler.getRegistry() != null)` check and the subsequent
> `handler.getRegistry().getEventListenerRegistry()...` call. Plus a number
> of unguarded `handler.getRegistry()` chains in 
> `register/unregisterWhiteboardService`,
> `addWhiteboardService`, `removeWhiteboardService` and `sessionIdChanged`
> that NPE if the registry has just been nulled.
> *Logs*
> {code:java}
> ERROR org.apache.felix.http: Exception during controller unregister
> java.lang.NullPointerException: Cannot invoke
> "PerContextHandlerRegistry.getEventListenerRegistry()" because
> the return value of "WhiteboardContextHandler.getRegistry()" is null
> at WhiteboardManager.deactivate(WhiteboardManager.java:340)
> at WhiteboardManager.removeContextHelper(WhiteboardManager.java:462)
> at ServletContextHelperTracker.removed(ServletContextHelperTracker.java:106)
> ...
> at WhiteboardManager.stop(WhiteboardManager.java:202)
> at HttpServiceController.unregister(HttpServiceController.java:158)
> at JettyService.stopJetty(JettyService.java:230)
> at JettyService.updated(JettyService.java:206)
> at JettyManagedService.updated(JettyManagedService.java:38)
> at 
> ConfigurationManager$UpdateConfiguration.run(ConfigurationManager.java:1418){code}
> {code:java}
> java.lang.NullPointerException: Cannot invoke
> "jakarta.servlet.ServletContext.getContextPath()" because
> "webContext" is null
> at SharedServletContextImpl.<init>(SharedServletContextImpl.java:86)
> at WhiteboardContextHandler.activate(WhiteboardContextHandler.java:94)
> at WhiteboardManager.activate(WhiteboardManager.java:253)
> at WhiteboardManager.addContextHelper(WhiteboardManager.java:369)
> at ServletContextHelperTracker.addingService(...){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to