[ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13709:
----------------------------
    Attachment: apache_Lucene-Solr-Tests-8.x_449.log.txt
        Status: Open  (was: Open)

I've attached the logs from the jenkins run in question...

Interestingly: Even though the logs indicate several problems in trying to 
reload/unload the SolrCore, the test itself didn't seem to care enough about 
the state of the collection to notice the problems – the only junit failure 
recorded was a suite level failure from the ObjectTracker due to unreleased 
threads/objects.

The first sign of trouble in the logs is this WARN from a ZK watcher registered 
to monitor the schema for changes (in schemaless mode) in order to re-load the 
SolrCore – it fails with an NullPointerException...
{noformat}
   [junit4]   2> 4309877 WARN  (Thread-7591) [n:localhost:38920_solr 
c:testCloudExamplePrompt s:shard2 r:core_node7 
x:testCloudExamplePrompt_shard2_replica_n4 ] o.a.s.c.ZkController liste
ner throws error
   [junit4]   2>           => org.apache.solr.common.SolrException: Unable to 
reload core [testCloudExamplePrompt_shard2_replica_n6]
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1557)
   [junit4]   2> org.apache.solr.common.SolrException: Unable to reload core 
[testCloudExamplePrompt_shard2_replica_n6]
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1557) ~[java/:?]
   [junit4]   2>        at 
org.apache.solr.core.SolrCore.lambda$getConfListener$21(SolrCore.java:3099) 
~[java/:?]
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.lambda$fireEventListeners$14(ZkController.java:2514)
 ~[java/:?]
   [junit4]   2>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
   [junit4]   2> Caused by: java.lang.NullPointerException
   [junit4]   2>        at 
org.apache.solr.core.CoreDescriptor.<init>(CoreDescriptor.java:172) ~[java/:?]
   [junit4]   2>        at 
org.apache.solr.core.SolrCore.reload(SolrCore.java:683) ~[java/:?]
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1507) ~[java/:?]
   [junit4]   2>        ... 3 more
{noformat}
...AFAICT the only way this NPE is possily is if the CoreDescriptor for the 
_original_ SolrCore is NULL at the time the watcher fires, and the only 
concievable way that seems to be possible is if the original SolrCore hadn't 
completley finished loading.

Aparently as a result of this failure to reload, a 
SolrCoreInitializationException is recorded for the core name, and that 
ultimately causes a fast-failure response when trying to unload the core...
{noformat}
   [junit4]   2> 4310314 ERROR (qtp373709619-50629) [n:localhost:38920_solr    
x:testCloudExamplePrompt_shard2_replica_n6 ] o.a.s.h.RequestHandlerBase 
org.apache.solr.common.SolrException
: Error unregistering core [testCloudExamplePrompt_shard2_replica_n6] from 
cloud state
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.unload(CoreContainer.java:1672)
   [junit4]   2>        at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$1(CoreAdminOperation.java:105)
   [junit4]   2>        at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
   [junit4]   2>        at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
   [junit4]   2>        at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
   [junit4]   2>        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:200)
   [junit4]   2>        at 
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:820)
   [junit4]   2>        at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:786)
   [junit4]   2>        at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:546)
   [junit4]   2>        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:423)
   [junit4]   2>        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
   [junit4]   2>        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
   [junit4]   2>        at 
org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:165)
   [junit4]   2>        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
   [junit4]   2>        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
   [junit4]   2>        at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
   [junit4]   2>        at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
   [junit4]   2>        at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703)
   [junit4]   2>        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
   [junit4]   2>        at 
org.eclipse.jetty.server.Server.handle(Server.java:505)
   [junit4]   2>        at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
   [junit4]   2>        at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
   [junit4]   2>        at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
   [junit4]   2>        at 
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
   [junit4]   2>        at 
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> Caused by: 
org.apache.solr.core.SolrCoreInitializationException: SolrCore 
'testCloudExamplePrompt_shard2_replica_n6' is not available due to init 
failure: null
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1740)
   [junit4]   2>        at 
org.apache.solr.cloud.ShardLeaderElectionContext.cancelElection(ElectionContext.java:294)
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.unregister(ZkController.java:1658)
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.unregister(ZkController.java:1637)
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.unload(CoreContainer.java:1665)
   [junit4]   2>        ... 35 more
   [junit4]   2> Caused by: java.lang.NullPointerException
   [junit4]   2>        at 
org.apache.solr.core.CoreDescriptor.<init>(CoreDescriptor.java:172)
   [junit4]   2>        at 
org.apache.solr.core.SolrCore.reload(SolrCore.java:683)
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1507)
   [junit4]   2>        at 
org.apache.solr.core.SolrCore.lambda$getConfListener$21(SolrCore.java:3099)
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.lambda$fireEventListeners$14(ZkController.java:2514)
   [junit4]   2>        ... 1 more
   [junit4]   2> 
{noformat}
As i mentioned before, none of this caused any failures directly from 
{{TestSolrCLIRunExample.testInteractiveSolrCloudExample}} (the minimal checks 
it does that the collection was created and the one doc indexed can be found 
succeeded, likeley because they were handled by other cores) but the main point 
i want to draw attention to is that this situation even seems to be possible.
----
FWIW: I briefly looked into the code to try and understand where/why a SolrCore 
could be allowed to "reload" itself if it hadn't finished loading (regardless 
of wether SolrCloud was in use) and the first thing thta jumped out at me was 
the javadocs for {{SolrCores.getCoreDescriptor(String)}} (which is what 
{{SolrCore.getCoreDescriptor()}} ultimately delegates too, and would be the 
source of the 'null' that causes the NPE: {{SolrCore.reload(...)}} calls 
{{getCoreDescriptor()}} and passes the – evidently null – result to the copy 
constructor of {{new CoreDescriptor(...)}}) ...
{code:java}
  /**
   * Return the CoreDescriptor corresponding to a given core name.
   * Blocks if the SolrCore is still loading until it is ready.
   * @param coreName the name of the core
   * @return the CoreDescriptor
   */
  public CoreDescriptor getCoreDescriptor(String coreName) {
    synchronized (modifyLock) {
      if (residentDesciptors.containsKey(coreName))
        return residentDesciptors.get(coreName);
      return getTransientCacheHandler().getTransientDescriptor(coreName);
    }
  }
{code}
Note the javadocs say this method will "Blocks if the SolrCore is still loading 
until it is ready." – but there doesn't seem to be any (obvious) evidence of 
that, unless the expecation is that {{modifyLock}} will be held for the 
durration of a SolrCore initialization (but i don't see any sign/intent of that 
either).

[~markrmil...@gmail.com] : you added that comment to the javadocs as part of 
SOLR-7361, but made no other changes to the method so i'm not clear what/where 
it was being enforced at the time you wrote it.

[~erickerickson] : pinging you as well since you've done a lot of work on 
(transient) SolrCore loading in the past few years, and wondering if you have 
any insight into where/how this could happen.

> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to