[ 
https://issues.apache.org/jira/browse/SOLR-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222974#comment-17222974
 ] 

Andreas Hubold commented on SOLR-14969:
---------------------------------------

Thank you! I'm not too familiar with all this code, but your suggestion sounds 
reasonable. I just thought about a similar fix in a custom CoreAdminHandler, 
but I still have to check if that's customizable.

I don't have a stable reproducer yet, but I can still try to test a proposed 
fix. However, I will be unavailable next week.

> Race condition when creating cores leads to NPE in CoreAdmin STATUS
> -------------------------------------------------------------------
>
>                 Key: SOLR-14969
>                 URL: https://issues.apache.org/jira/browse/SOLR-14969
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: multicore
>    Affects Versions: 8.6, 8.6.3
>            Reporter: Andreas Hubold
>            Priority: Major
>
> CoreContainer#create does not correctly handle concurrent requests to create 
> the same core. There's a race condition (see also existing TODO comment in 
> the code), and CoreContainer#createFromDescriptor may be called subsequently 
> for the same core name.
> The _second call_ then fails to create an IndexWriter, and exception handling 
> causes an inconsistent CoreContainer state.
> {noformat}
> 2020-10-27 00:29:25.350 ERROR (qtp2029754983-24) [   ] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
> CREATEing SolrCore 'blueprint_acgqqafsogyc_comments': Unable to create core 
> [blueprint_acgqqafsogyc_comments] Caused by: Lock held by this virtual 
> machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock
>          at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1312)
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:95)
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> [blueprint_acgqqafsogyc_comments]
>          at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1408)
>          at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1273)
>          ... 47 more
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>          at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1071)
>          at org.apache.solr.core.SolrCore.<init>(SolrCore.java:906)
>          at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1387)
>          ... 48 more
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>          at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2184)
>          at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2308)
>          at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1130)
>          at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1012)
>          ... 50 more
> Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by 
> this virtual machine: 
> /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock
>          at 
> org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
>          at 
> org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
>          at 
> org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
>          at 
> org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105)
>          at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:785)
>          at 
> org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:126)
>          at 
> org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
>          at 
> org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:261)
>          at 
> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:135)
>          at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2145) 
> {noformat}
> CoreContainer#createFromDescriptor removes the CoreDescriptor when handling 
> this exception. The SolrCore created for the first successful call is still 
> registered in SolrCores.cores, but now there's no corresponding 
> CoreDescriptor for that name anymore.
> This inconsistency leads to subsequent NullPointerExceptions, for example 
> when using CoreAdmin STATUS with the core name: 
> CoreAdminOperation#getCoreStatus first gets the non-null SolrCore 
> (cores.getCore(cname)) but core.getInstancePath() throws an NPE, because the 
> CoreDescriptor is not registered anymore:
> {noformat}
> 2020-10-27 00:29:25.353 INFO  (qtp2029754983-19) [   ] o.a.s.s.HttpSolrCall 
> [admin] webapp=null path=/admin/cores 
> params={core=blueprint_acgqqafsogyc_comments&action=STATUS&indexInfo=false&wt=javabin&version=2}
>  status=500 QTime=0
> 2020-10-27 00:29:25.353 ERROR (qtp2029754983-19) [   ] o.a.s.s.HttpSolrCall 
> null:org.apache.solr.common.SolrException: Error handling 'STATUS' action
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:372)
>          at 
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
>          at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
> ...
> Caused by: java.lang.NullPointerException
>          at org.apache.solr.core.SolrCore.getInstancePath(SolrCore.java:333)
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:329)
>          at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:54)
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
> {noformat}
> STATUS keeps failing until Solr is restarted.
> The NPE for CoreAdmin STATUS is a regression in 8.6. It seems to be caused by 
> https://github.com/apache/lucene-solr/commit/17ae79b0905b2bf8635c1b260b30807cae2f5463#diff-9652fe8353b7eff59cd6f128bb2699d88361e670b840ee5ca1018b1bc45584d1R324



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to