Chris M. Hostetter created SOLR-16918:
-----------------------------------------

             Summary: MetricUtils.instrumentedExecutorService() is not safe in 
any excutors with a lifecycle that isn't forever
                 Key: SOLR-16918
                 URL: https://issues.apache.org/jira/browse/SOLR-16918
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Chris M. Hostetter


I have a custom Solr RequestHandler that internally uses a subclass of 
{{HttpShardHandler}} for dispatching some special types of distributed requests.

I recently upgraded form Solr 9.1 to Solr 9.3 - and in doing so started getting 
the following exceptions on SolrCore RELOAD...
{noformat}
2023-07-31 17:46:19.548 ERROR (qtp1690101810-23) [] o.a.s.s.HttpSolrCall 500 
Exception => org.apache.solr.common.SolrException: Unable to reload core 
[foo_shard1_replica_n2]
        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:2028)
org.apache.solr.common.SolrException: Unable to reload core 
[foo_shard1_replica_n2]
        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:2028) 
~[?:?]
        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1946) 
~[?:?]
        at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$2(CoreAdminOperation.java:136)
 ~[?:?]
        at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:400)
 ~[?:?]
        at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:353)
 ~[?:?]
        at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
 ~[?:?]
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224)
 ~[?:?]
        at 
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:928) ~[?:?]
...
Caused by: org.apache.solr.common.SolrException: A metric named 
QUERY./bar.shardHandler.threadPool.httpShardExecutor.pool.size already exists
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1228) ~[?:?]
        at org.apache.solr.core.SolrCore.reload(SolrCore.java:788) ~[?:?]
        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1978) 
~[?:?]
        ... 62 more
Caused by: java.lang.IllegalArgumentException: A metric named 
QUERY./bar.shardHandler.threadPool.httpShardExecutor.pool.size already exists
        at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:168) 
~[metrics-core-4.2.19.jar:4.2.19]
        at 
com.codahale.metrics.MetricRegistry.registerGauge(MetricRegistry.java:88) 
~[metrics-core-4.2.19.jar:4.2.19]
        at 
com.codahale.metrics.InstrumentedExecutorService.<init>(InstrumentedExecutorService.java:61)
 ~[metrics-core-4.2.19.jar:4.2.19]
        at 
org.apache.solr.util.stats.MetricUtils.instrumentedExecutorService(MetricUtils.java:751)
 ~[?:?]
        at 
org.apache.solr.handler.component.HttpShardHandlerFactory.initializeMetrics(HttpShardHandlerFactory.java:415)
 ~[?:?]
...
{noformat}
The problem evidently comes from {{{}HttpShardHandler{}}}'s usage of 
{{MetricUtils.instrumentedExecutorService()}} which returns an instance of the 
(third-party) dropwizard {{{}InstrumentedExecutorService{}}}.

{{InstrumentedExecutorService}} made a change in their 4.2.x releases (now used 
by Solr 9.3 ... not sure if it was in 9.2) so that in addition to some previous 
{{Metric}} instances that {{InstrumentedExecutorService}} registered in older 
versions, it also includes some "type specific" metrics that are _Gauges_ 
driven by the behavior of the delegate {{{}ExecutorService{}}}...

[https://github.com/dropwizard/metrics/commit/5f7ba7f4eb15bab0ddd458b56cdd313ea795eba7]

The lifecycles of Gauges are extremely finicky, because {{MetricRegistry}} 
can't just "re-use" existing instances (like it can with {{{}Meter{}}}, 
{{{}Counter{}}}, {{{}Timer{}}}, etc...). This is why Solr's 
{{SolrMetricManager}} jumps through a lot of hoops when registering gauges - 
and even has a custom {{GaugeWrapper}} - to ensure we can handle things like 
SolrCore {{RELOAD}} (where we have two copies of the SolrCore, and all of it's 
associated plugins, running at once - and reporting metrics - until the new 
SolrCore is fully initialized and we swap over to it)

But {{InstrumentedExecutorService}} and it's new use of Gauges doesn't know 
anything about {{SolrMetricManager}} - it doesn't even de-register it's Gauges 
in it's {{shutdown()}} method, so you can't even do this...
{code:java}
(new InstrumentedExecutorService(delegate, registry, "xyz")).shutdown();
(new InstrumentedExecutorService(delegate, registry, "xyz")).shutdown();
{code}
...w/o triggering this type of exception.
----
{{HttpShardHandler}} (and {{{}UpdateShardHandler{}}}) _tend_ to only be 
initialized at the {{CoreContainer}} level, so I think it's unlikely you'll 
encounter this kind of SolrCore RELOAD error in "stock" Solr - but 
{{MetricUtils.instrumentedExecutorService(...)}} is really just a time bomb 
waiting to go off.

If anyone starts using method in other places in solr code, it's unlikely any 
tests will notice the problem unless they do a core reload.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to