[ 
https://issues.apache.org/jira/browse/SOLR-11484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe reopened SOLR-11484:
-------------------------------

Reopening to address a 100% reproducing {{CloudSolrClient}} NPE that begins at 
commit 59109e1b1 on this issue, from 
[https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/685]:

{noformat}
Checking out Revision 14bb48e3268c8e2f4aa509b1a71a9fb3e361b082 
(refs/remotes/origin/branch_7x)
[...]
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestCloudSchemaless 
-Dtests.method=test -Dtests.seed=2542B6AD904A6296 -Dtests.multiplier=3 
-Dtests.slow=true -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH 
-Dtests.timezone=America/Nipigon -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   48.2s J0 | TestCloudSchemaless.test <<<
   [junit4]    > Throwable #1: java.lang.NullPointerException
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([2542B6AD904A6296:AD1689773EB60F6E]:0)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:925)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:808)
   [junit4]    >        at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:178)
   [junit4]    >        at 
org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106)
   [junit4]    >        at 
org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71)
   [junit4]    >        at 
org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85)
   [junit4]    >        at 
org.apache.solr.schema.TestCloudSchemaless.test(TestCloudSchemaless.java:171)
   [junit4]    >        at 
org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:993)
   [junit4]    >        at 
org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:968)
[...]
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
docValues:{}, maxPointsInLeafNode=350, maxMBSortInHeap=6.677333545717319, 
sim=RandomSimilarity(queryNorm=true): {}, locale=th-TH-u-nu-thai-x-lvariant-TH, 
timezone=America/Nipigon
   [junit4]   2> NOTE: Linux 4.10.0-33-generic i386/Oracle Corporation 
1.8.0_144 (32-bit)/cpus=8,threads=1,free=305236536,total=523501568
{noformat}

> CloudSolrClient's cache of collection clusterstate can cause RouteExceptions 
> when attempting directUpdates after collection modifications
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11484
>                 URL: https://issues.apache.org/jira/browse/SOLR-11484
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Noble Paul
>             Fix For: 7.2, master (8.0)
>
>         Attachments: SOLR-11484.patch, SOLR-11484.patch, 
> jenkins.thetaphi.20662.txt
>
>
> This was discovered while auditing jenkins failures from 
> {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}} 
> (where a test explicitly deletes and then recreates a collection with the 
> same name), but as noted in a comment below, SOLR-11392 is another example of 
> non-obvious test failures that can pop up because of this bug.
> In practice, it can affect any CloudSolrClient user after changes have been 
> made to a collection (to add/move replicas, etc...)
> ----
> Original jira notes...
> {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}}
> seems to fail with non-trivial frequency, so I grabbed the logs from a recent 
> failure and starting trying to follow along with the actions to figure out 
> what exactly is happening....
> https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/
> {noformat}
>    [junit4] ERROR   20.3s J1 | 
> TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete <<<
>    [junit4]    > Throwable #1: 
> org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from 
> server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: 
> Expected mime type a
> pplication/octet-stream but got text/html. <html>
>    [junit4]    > <head>
>    [junit4]    > <meta http-equiv="Content-Type" 
> content="text/html;charset=ISO-8859-1"/>
>    [junit4]    > <title>Error 404 </title>
> {noformat}
> The crux of this failure appears to be a genuine bug in how CloudSolrClient 
> uses it's cached ClusterState info when doing (direct) updates.  The key bits 
> seem to be:
> * CloudSolrClient does _something_ (update,query,etc...) with a collection 
> causing the current cluster state for the collection to be cached
> * The actual collection changes such that a Solr node/core no longer exists 
> as part of the collection
> * CloudSolrClient is asked to process an UpdateRequest which triggers the 
> code paths for the {{directUpdate()}} method -- which attempts to route the 
> updates directly to a replica of the appropriate shard using the (cache) 
> collection state info
> * CloudSolrClient (may) attempt to send that UpdateRequest to a node/core 
> that doesn't exist, getting a 404 -- which does not (seem to) trigger a state 
> refresh, or retry to find a correct URL to resend the update to.
> Details to follow in comment....



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to