[ 
https://issues.apache.org/jira/browse/SOLR-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377774#comment-16377774
 ] 

Hoss Man commented on SOLR-11067:
---------------------------------



Since this issue is still marked "Open" i'll post this here instead of creating 
a new jira..

The logic used by REPLACENODE when no target is specified appears to be flawed. 
 The logs from jenkins falures of ReplaceNodeNoTargetTest show that sometimes 
the 'source' node is choosen to be it's own replacement...

https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-master-Linux/21538/
{noformat}
   [junit4]   2> 1369525 INFO  (qtp1209949969-11440) [n:127.0.0.1:35749_solr    
] o.a.s.h.a.CollectionsHandler Invoked Collection Action :replacenode with 
params 
async=001&action=REPLACENODE&source=127.0.0.1:45303_solr&wt=javabin&version=2 
and sendToOCPQueue=true
   [junit4]   2> 1369526 INFO  (qtp1209949969-11440) [n:127.0.0.1:35749_solr    
] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections 
params={async=001&action=REPLACENODE&source=127.0.0.1:45303_solr&wt=javabin&version=2}
 status=0 QTime=1
...
   [junit4]   2> 1369527 INFO  
(OverseerThreadFactory-4614-thread-2-processing-n:127.0.0.1:33719_solr) 
[n:127.0.0.1:33719_solr    ] o.a.s.c.a.c.ReplaceNodeCmd Going to create replica 
for collection=replacenodetest_coll_notarget shard=shard1 on node=null
...
   [junit4]   2> 1369551 INFO  
(OverseerThreadFactory-4614-thread-2-processing-n:127.0.0.1:33719_solr) 
[n:127.0.0.1:33719_solr    ] o.a.s.c.a.c.AddReplicaCmd Node Identified 
127.0.0.1:45303_solr for creating new replica
   [junit4]   2> 1369553 INFO  
(OverseerStateUpdate-72215357633069068-127.0.0.1:33719_solr-n_0000000000) 
[n:127.0.0.1:33719_solr    ] o.a.s.c.o.SliceMutator createReplica() {
   [junit4]   2>   "operation":"addreplica",
   [junit4]   2>   "collection":"replacenodetest_coll_notarget",
   [junit4]   2>   "shard":"shard1",
   [junit4]   2>   "core":"replacenodetest_coll_notarget_shard1_replica_n21",
   [junit4]   2>   "state":"down",
   [junit4]   2>   "base_url":"http://127.0.0.1:45303/solr";,
   [junit4]   2>   "node_name":"127.0.0.1:45303_solr",
   [junit4]   2>   "type":"NRT"} 
{noformat}

NOTE: The test currently fails with an obscurely vague 
{{java.lang.AssertionError}} (when the node count of cores on the source node 
is non-0 after the command completes) roughly ~30% of the times it is run by 
jenkins.  This is consistent with the idea that REPLACENODE command is randomly 
picking from _all_ currently active NODES (w/o excluding the one to be 
replaced) since there are 6 nodes to choose from, but a total of 10 cores in 
the cluster -- so instead of just failing in 1/6th of the runes, 2 out 3 runs 
there are 2 cores on the source node and the randomized selection happens twice 
in the test:  (1/3 * 1/6) + (2/3 * (1/6 + 1/6)) ~= 27%

----

I plan to commit some improvements to the test assertion/logging that helped me 
in realizing that the root problem was that entirely new cores were being added 
to the `node2bdecommissioned` which lead to discovering the (aparent) root 
cause failure, but someone who understands the actual `REPLACENODE` code needs 
to figure out "the right" fix for this bug.

SIDE NOTE: what happens if i try `REPLACENODE` w/o a target on a cluster with 
only one node? presumably that should be a failure case -- but i'm confident we 
don't have a test for it. (because if we did then based on the logs above it 
would be failing 100% of the time as it just kept re-using the source node 
every time) 


> REPLACENODE should make it optional to provide a target node
> ------------------------------------------------------------
>
>                 Key: SOLR-11067
>                 URL: https://issues.apache.org/jira/browse/SOLR-11067
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling, SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Noble Paul
>            Priority: Major
>             Fix For: master (8.0), 7.3
>
>         Attachments: SOLR-11067.patch
>
>
> The REPLACENODE API currently accepts a replacement target and moves all 
> replicas from the source to the given target. We can improve this by having 
> it figure out the right target node for each replica contained in the source.
> This can also then be a thin wrapper over nodeLost event just like how 
> UTILIZENODE (SOLR-9743) can be a wrapper over nodeAdded event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to