[jira] [Updated] (SOLR-7673) Race condition in shard splitting can cause operation to hang indefinitely or sub-shards to never become active

Shalin Shekhar Mangar (JIRA) Tue, 30 Jun 2015 01:29:18 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shalin Shekhar Mangar updated SOLR-7673:
----------------------------------------
    Summary: Race condition in shard splitting can cause operation to hang 
indefinitely or sub-shards to never become active  (was: Race condition in 
shard splitting waiting for sub-shard replicas to go into recovery)

> Race condition in shard splitting can cause operation to hang indefinitely or 
> sub-shards to never become active
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7673
>                 URL: https://issues.apache.org/jira/browse/SOLR-7673
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10.4, 5.2
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 5.3, Trunk
>
>         Attachments: Lucene-Solr-Tests-trunk-Java8-69.log, SOLR-7673.patch
>
>
> I found this while digging into the failure on 
> https://builds.apache.org/job/Lucene-Solr-Tests-trunk-Java8/69/
> {code}
>    [junit4]   2> 390032 INFO  
> (OverseerStateUpdate-93987739315601411-127.0.0.1:57926_-n_0000000000) 
> [n:127.0.0.1:57926_    ] o.a.s.c.o.ReplicaMutator Update state numShards=1 
> message={
>    [junit4]   2>   "core":"testasynccollectioncreation_shard1_0_replica2",
>    [junit4]   2>   "core_node_name":"core_node5",
>    [junit4]   2>   "roles":null,
>    [junit4]   2>   "base_url":"http://127.0.0.1:38702";,
>    [junit4]   2>   "node_name":"127.0.0.1:38702_",
>    [junit4]   2>   "numShards":"1",
>    [junit4]   2>   "state":"active",
>    [junit4]   2>   "shard":"shard1_0",
>    [junit4]   2>   "collection":"testasynccollectioncreation",
>    [junit4]   2>   "operation":"state"}
>    [junit4]   2> 390033 INFO  
> (OverseerStateUpdate-93987739315601411-127.0.0.1:57926_-n_0000000000) 
> [n:127.0.0.1:57926_    ] o.a.s.c.o.ZkStateWriter going to update_collection 
> /collections/testasynccollectioncreation/state.json version: 18
>    [junit4]   2> 390033 INFO  
> (RecoveryThread-testasynccollectioncreation_shard1_0_replica2) 
> [n:127.0.0.1:38702_ c:testasynccollectioncreation s:shard1_0 r:core_node5 
> x:testasynccollectioncreation_shard1_0_replica2] o.a.s.u.UpdateLog Took 2 ms 
> to seed version buckets with highest version 1503803851028824064
>    [junit4]   2> 390033 INFO  
> (zkCallback-167-thread-1-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_   
>  ] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent 
> state:SyncConnected type:NodeDataChanged 
> path:/collections/testasynccollectioncreation/state.json for collection 
> testasynccollectioncreation has occurred - updating... (live nodes size: 2)
>    [junit4]   2> 390033 INFO  
> (RecoveryThread-testasynccollectioncreation_shard1_0_replica2) 
> [n:127.0.0.1:38702_ c:testasynccollectioncreation s:shard1_0 r:core_node5 
> x:testasynccollectioncreation_shard1_0_replica2] o.a.s.c.RecoveryStrategy 
> Finished recovery process.
>    [junit4]   2> 390033 INFO  
> (zkCallback-173-thread-1-processing-n:127.0.0.1:38702_) [n:127.0.0.1:38702_   
>  ] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent 
> state:SyncConnected type:NodeDataChanged 
> path:/collections/testasynccollectioncreation/state.json for collection 
> testasynccollectioncreation has occurred - updating... (live nodes size: 2)
>    [junit4]   2> 390034 INFO  
> (zkCallback-167-thread-1-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_   
>  ] o.a.s.c.c.ZkStateReader Updating data for testasynccollectioncreation to 
> ver 19 
>    [junit4]   2> 390034 INFO  
> (zkCallback-173-thread-1-processing-n:127.0.0.1:38702_) [n:127.0.0.1:38702_   
>  ] o.a.s.c.c.ZkStateReader Updating data for testasynccollectioncreation to 
> ver 19 
>    [junit4]   2> 390298 INFO  (qtp1859109869-1664) [n:127.0.0.1:38702_    ] 
> o.a.s.h.a.CollectionsHandler Invoked Collection Action :requeststatus with 
> paramsrequestid=1004&action=REQUESTSTATUS&wt=javabin&version=2 
>    [junit4]   2> 390299 INFO  (qtp1859109869-1664) [n:127.0.0.1:38702_    ] 
> o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/collections 
> params={requestid=1004&action=REQUESTSTATUS&wt=javabin&version=2} status=0 
> QTime=1 
>    [junit4]   2> 390348 INFO  (qtp1859109869-1665) [n:127.0.0.1:38702_    ] 
> o.a.s.h.a.CoreAdminHandler Checking request status for : 10042599422118342586
>    [junit4]   2> 390348 INFO  (qtp1859109869-1665) [n:127.0.0.1:38702_    ] 
> o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/cores 
> params={qt=/admin/cores&requestid=10042599422118342586&action=REQUESTSTATUS&wt=javabin&version=2}
>  status=0 QTime=1 
>    [junit4]   2> 390348 INFO  
> (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) 
> [n:127.0.0.1:57926_    ] o.a.s.c.OverseerCollectionProcessor Asking sub shard 
> leader to wait for: testasynccollectioncreation_shard1_0_replica2 to be alive 
> on: 127.0.0.1:38702_
>    [junit4]   2> 390349 INFO  
> (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) 
> [n:127.0.0.1:57926_    ] o.a.s.c.OverseerCollectionProcessor Creating replica 
> shard testasynccollectioncreation_shard1_1_replica2 as part of slice shard1_1 
> of collection testasynccollectioncreation on 127.0.0.1:38702_
>    [junit4]   2> 390349 INFO  
> (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) 
> [n:127.0.0.1:57926_    ] o.a.s.c.c.ZkStateReader Load collection config 
> from:/collections/testasynccollectioncreation
>    [junit4]   2> 390350 INFO  
> (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) 
> [n:127.0.0.1:57926_    ] o.a.s.c.c.ZkStateReader 
> path=/collections/testasynccollectioncreation configName=conf1 specified 
> config exists in ZooKeeper
>    [junit4]   2> 390352 INFO  (qtp381614561-1631) [n:127.0.0.1:57926_    ] 
> o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/cores 
> params={nodeName=127.0.0.1:38702_&core=testasynccollectioncreation_shard1_0_replica1&async=10042599424132346542&qt=/admin/cores&coreNodeName=core_node5&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2}
>  status=0 QTime=2 
> {code}
> The following sequence of events lead to deadlock:
> # testasynccollectioncreation_shard1_0_replica2 (core_node5) becomes active
> # OCP asks sub-shard leader testasynccollectioncreation_shard1_0_replica1  to 
> wait until replica2 is in recovery
> At this point, the test just keeps pinging for status until timeout and fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-7673) Race condition in shard splitting can cause operation to hang indefinitely or sub-shards to never become active

Reply via email to