[ https://issues.apache.org/jira/browse/SOLR-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377774#comment-16377774 ]
Hoss Man commented on SOLR-11067: --------------------------------- Since this issue is still marked "Open" i'll post this here instead of creating a new jira.. The logic used by REPLACENODE when no target is specified appears to be flawed. The logs from jenkins falures of ReplaceNodeNoTargetTest show that sometimes the 'source' node is choosen to be it's own replacement... https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-master-Linux/21538/ {noformat} [junit4] 2> 1369525 INFO (qtp1209949969-11440) [n:127.0.0.1:35749_solr ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :replacenode with params async=001&action=REPLACENODE&source=127.0.0.1:45303_solr&wt=javabin&version=2 and sendToOCPQueue=true [junit4] 2> 1369526 INFO (qtp1209949969-11440) [n:127.0.0.1:35749_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={async=001&action=REPLACENODE&source=127.0.0.1:45303_solr&wt=javabin&version=2} status=0 QTime=1 ... [junit4] 2> 1369527 INFO (OverseerThreadFactory-4614-thread-2-processing-n:127.0.0.1:33719_solr) [n:127.0.0.1:33719_solr ] o.a.s.c.a.c.ReplaceNodeCmd Going to create replica for collection=replacenodetest_coll_notarget shard=shard1 on node=null ... [junit4] 2> 1369551 INFO (OverseerThreadFactory-4614-thread-2-processing-n:127.0.0.1:33719_solr) [n:127.0.0.1:33719_solr ] o.a.s.c.a.c.AddReplicaCmd Node Identified 127.0.0.1:45303_solr for creating new replica [junit4] 2> 1369553 INFO (OverseerStateUpdate-72215357633069068-127.0.0.1:33719_solr-n_0000000000) [n:127.0.0.1:33719_solr ] o.a.s.c.o.SliceMutator createReplica() { [junit4] 2> "operation":"addreplica", [junit4] 2> "collection":"replacenodetest_coll_notarget", [junit4] 2> "shard":"shard1", [junit4] 2> "core":"replacenodetest_coll_notarget_shard1_replica_n21", [junit4] 2> "state":"down", [junit4] 2> "base_url":"http://127.0.0.1:45303/solr", [junit4] 2> "node_name":"127.0.0.1:45303_solr", [junit4] 2> "type":"NRT"} {noformat} NOTE: The test currently fails with an obscurely vague {{java.lang.AssertionError}} (when the node count of cores on the source node is non-0 after the command completes) roughly ~30% of the times it is run by jenkins. This is consistent with the idea that REPLACENODE command is randomly picking from _all_ currently active NODES (w/o excluding the one to be replaced) since there are 6 nodes to choose from, but a total of 10 cores in the cluster -- so instead of just failing in 1/6th of the runes, 2 out 3 runs there are 2 cores on the source node and the randomized selection happens twice in the test: (1/3 * 1/6) + (2/3 * (1/6 + 1/6)) ~= 27% ---- I plan to commit some improvements to the test assertion/logging that helped me in realizing that the root problem was that entirely new cores were being added to the `node2bdecommissioned` which lead to discovering the (aparent) root cause failure, but someone who understands the actual `REPLACENODE` code needs to figure out "the right" fix for this bug. SIDE NOTE: what happens if i try `REPLACENODE` w/o a target on a cluster with only one node? presumably that should be a failure case -- but i'm confident we don't have a test for it. (because if we did then based on the logs above it would be failing 100% of the time as it just kept re-using the source node every time) > REPLACENODE should make it optional to provide a target node > ------------------------------------------------------------ > > Key: SOLR-11067 > URL: https://issues.apache.org/jira/browse/SOLR-11067 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling, SolrCloud > Reporter: Shalin Shekhar Mangar > Assignee: Noble Paul > Priority: Major > Fix For: master (8.0), 7.3 > > Attachments: SOLR-11067.patch > > > The REPLACENODE API currently accepts a replacement target and moves all > replicas from the source to the given target. We can improve this by having > it figure out the right target node for each replica contained in the source. > This can also then be a thin wrapper over nodeLost event just like how > UTILIZENODE (SOLR-9743) can be a wrapper over nodeAdded event. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org