[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694736#comment-15694736 ] ASF subversion and git services commented on SOLR-9512: --- Commit 3a3d6afef08e3c49c29cc9a015f4e7cca40cf52d in lucene-solr's branch refs/heads/branch_6x from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a3d6af ] SOLR-9512: removed unused import > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward >Assignee: Noble Paul > Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694664#comment-15694664 ] ASF subversion and git services commented on SOLR-9512: --- Commit 142461b395506efa01ec2509346bae755f1b2726 in lucene-solr's branch refs/heads/master from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=142461b ] SOLR-9512: removed unused import > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward >Assignee: Noble Paul > Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694066#comment-15694066 ] ASF subversion and git services commented on SOLR-9512: --- Commit 5650939a8d41b7bad584947a2c9dcedf3774b8de in lucene-solr's branch refs/heads/master from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5650939 ] SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to leaders > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward >Assignee: Noble Paul > Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694063#comment-15694063 ] ASF subversion and git services commented on SOLR-9512: --- Commit d87ffa4bf82c30e9a6f0bbb6b8c0087a5c07f9d6 in lucene-solr's branch refs/heads/master from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d87ffa4 ] SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to leaders > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward >Assignee: Noble Paul > Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694027#comment-15694027 ] ASF subversion and git services commented on SOLR-9512: --- Commit e309f9058985375076cac0ed982a158dd865b86a in lucene-solr's branch refs/heads/branch_6x from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e309f90 ] SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to leaders > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward >Assignee: Noble Paul > Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587088#comment-15587088 ] Shalin Shekhar Mangar commented on SOLR-9512: - bq. Case 6: do i understand it right that we would keep failing the indexing requests but 'only' until eventually the client manages to reconnect to zk? Yes, that is correct. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583374#comment-15583374 ] Christine Poerschke commented on SOLR-9512: --- Thanks Shalin and Noble for analysing and summarising this. The proposed cases 1-5 solution sounds good to me (though i have not actually looked at the code concerned to see what the implementation of that solution might look like). Case 6: do i understand it right that we would keep failing the indexing requests but 'only' until eventually the client manages to reconnect to zk? I agree that asking a random solr instance for its latest cluster state could be problematic. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576189#comment-15576189 ] Shalin Shekhar Mangar commented on SOLR-9512: - Noble and I discussed this offline. Here is a summary of the problem and the solution: There are five cases that we need to tackle. Assuming replica x is leader: # Case 1: x is disconnected from zk, y becomes leader ** currently — x throws error on indexing, client fails and keeps trying to send requests to x and fail. This will continue until X re-connects and the client gets a stale state flag in the response. # Case 2: x is dead, y becomes leader ** currently - client gets connect exception or NoResponseException (for in-flight requests) and client keeps retrying request to x. This will continue until x comes back online. # Case 3: x is disconnected from zk, no one is leader ** currently -- client keeps sending requests to x which fail because x is disconnected from leader. This will continue until X re-connects and the client gets a stale state flag in the response. # Case 4: x is dead, no one is leader yet ** currently - client gets connect exception or NoResponseException (for in-flight requests) and client keeps retrying request to x. This will continue until x comes back online. # Case 5: x is alive but now y is leader ** currently -- client gets a stale state flag from x and refreshes cluster state to see y as the new leader. All further indexing requests are sent to y. # Case 6: client is disconnected from zk ** currently -- client keeps indexing to x. If it receives a stale state error, it will try to refresh cluster state, fail and continue to send further requests to x, keep failing and keep trying to read from zk and be stuck in a cycle. Cases 1-5 are solved by a single solution -- On ConnectException, NoHttpResponseException, Leader disconnected from zk error, client should fetch state from zk again. If client fetches from zk and does not get a new version then this should be marked in a flag and subsequent retries should only happen after N seconds are elapsed or if we know for a fact that version has changed since the last zk fetch was made. N could be something small as 2 seconds or so. Case 6 is more difficult. Either we can keep failing the indexing requests or we can ask a random Solr instance to return the latest cluster state. This is kinda dangerous because it can open us up to very difficult to debug bugs so I am inclined to punt on this for now. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509487#comment-15509487 ] ASF subversion and git services commented on SOLR-9512: --- Commit d326adc8bfa33432d50293402a39454d60e070e4 in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d326adc ] SOLR-9305, SOLR-9390: Don't use directToLeaders updates in partition tests (see SOLR-9512) > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509483#comment-15509483 ] ASF subversion and git services commented on SOLR-9512: --- Commit bae66f7cca8cff796d142eb19585d8e79fae34f8 in lucene-solr's branch refs/heads/branch_6x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bae66f7 ] SOLR-9305, SOLR-9390: Don't use directToLeaders updates in partition tests (see SOLR-9512) > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506675#comment-15506675 ] Noble Paul commented on SOLR-9512: -- Does it make any sense to have the explicit flag {{directUpdatesToLeaders}} ? IMHO it should be the only supported behavior. Why would we ever send a request ever to another node when the shard leader is down? My proposal is as follows. * The LBHttpSolrClient is aware of down servers. So, if the leader for the shard is down we already know it and we can fail fast * If we have to fail because the shard leader is dead, we should try to explicitly read the {{state.json}} for that collection provided the cached state is older than a certain threshold (say 1 sec?). This means, the client may fire a request to ZK to every second to refresh the state. For such refreshes, check the version first to optimize ZK reads > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506575#comment-15506575 ] Alan Woodward commented on SOLR-9512: - OK, have reverted. For the moment, I'll set directUpdatesToLeaders=false on the two regularly failing test cases. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506551#comment-15506551 ] ASF subversion and git services commented on SOLR-9512: --- Commit a4293ce7c4e849b171430a34f36b830a84927a93 in lucene-solr's branch refs/heads/branch_6x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a4293ce ] Revert "SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down" This reverts commit f96017d9e10c665e7ab6b9161f2af08efc491946. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506552#comment-15506552 ] ASF subversion and git services commented on SOLR-9512: --- Commit bd3fc7f43ff54a174660b7ad51f031d2104f84b5 in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd3fc7f ] Revert "SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down" This reverts commit 3d130097b7768a8d753476ffe26b83db070c8e20. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506019#comment-15506019 ] Christine Poerschke commented on SOLR-9512: --- Hi [~romseygeek] and [~noble.paul] - am only now joining this ticket late here since i have been on vacation. The SOLR-9090 {{directUpdatesToLeadersOnly}} motivation/intention was for the flag to be not a hint but a directive and for updates to 'fail fast' if there is (temporarily or otherwise) no shard leader. Fail fast (and let the caller of the {{CloudSolrClient}} handle alarming and retries as it sees fit) as opposed to sending or retry-sending to a non-leader which would then forward to the leader (and potentially still fail eventually, eventually/not-fast-slowly). [~Marvin Justice] and I worked together on SOLR-9090 - Marvin, any thoughts on this ticket here? > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505956#comment-15505956 ] Noble Paul commented on SOLR-9512: -- Let's first discuss what is the fix and don't rush into solutions > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505951#comment-15505951 ] Alan Woodward commented on SOLR-9512: - OK, but lets deal with that in a separate issue, as I think it's a bit more complicated than just fixing cache invalidation for /update requests. The current invalidation and retry logic happens at the level of the whole request, but CSC is splitting updates up into several sub-requests, and we probably don't want to redo the whole thing if a single subshard has changed leader. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505858#comment-15505858 ] Noble Paul commented on SOLR-9512: -- bq.The invalidation clause above is only called when the request was successful, It shouldn't need to. The response must have the flag to invalidate the collection. if that code is not working as expected we should fix that. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505849#comment-15505849 ] Alan Woodward commented on SOLR-9512: - bq. because the new leader is not elected yet The invalidation clause above is only called when the request was successful, ie if a new leader is up. If leader election hasn't happened yet, then LBHttpSolrClient will throw an exception and the whole request will fail. The test case should illustrate this. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505419#comment-15505419 ] Noble Paul commented on SOLR-9512: -- bq. How so? You invalidate the cache if the first server did not serve the request. That's a problem. When the next request comes , it gets the fresh state which is exactly the same as the entry that was just invalidated because the new leader is not elected yet and the state. json is not yet updated in ZK. As we discussed before, the cache must be invalidated when a server says the version is stale > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504502#comment-15504502 ] Alan Woodward commented on SOLR-9512: - bq. The patch contains way more changes than you mentioned. I ... don't think it does? It makes the changes I described above to CloudSolrClient, and adds a test case. What else is there? bq. The following is not the way we should invalidate the cache. How so? > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504227#comment-15504227 ] Noble Paul commented on SOLR-9512: -- The following is not the way we should invalidate the cache. {code} if (response.getServer().equals(url) == false) { // we didn't hit our first-preference server, which means that our cached // collection state is no longer valid invalidateCollectionState(collection); } {code} > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504167#comment-15504167 ] Noble Paul commented on SOLR-9512: -- [~romseygeek] The patch contains way more changes than you mentioned. You committed it without any review from the people who are collaborating with you. If there are other people collaborating on a ticket, the general protocol is that you submit a patch with the changes explained and give some review time before committing stuff. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503651#comment-15503651 ] ASF subversion and git services commented on SOLR-9512: --- Commit f96017d9e10c665e7ab6b9161f2af08efc491946 in lucene-solr's branch refs/heads/branch_6x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f96017d ] SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503653#comment-15503653 ] ASF subversion and git services commented on SOLR-9512: --- Commit 3d130097b7768a8d753476ffe26b83db070c8e20 in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3d13009 ] SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503044#comment-15503044 ] Noble Paul commented on SOLR-9512: -- bq.Looking at HttpSolrCall it appears that it's only used in /select requests, It must be a serious bug. Without proper invalidation, the caching is useless > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > Attachments: SOLR-9512.patch > > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502871#comment-15502871 ] Noble Paul commented on SOLR-9512: -- bq. The old leader is down, a new leader has been selected, but the cache hasn't updated yet. In this case the update actually succeeds, as it's passed to the next node in the list and then forwarded on to the relevant leader. This is already taken care of. When SolrJ makes a request to the node, it responds with a flag to invalidate the cache. So the next request will get the updated state > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502696#comment-15502696 ] Alan Woodward commented on SOLR-9512: - Right, there are two scenarios: 1. The leader is down, and there's no replacement voted for yet, in which case things happen much as you describe above 2. The old leader is down, a new leader has been selected, but the cache hasn't updated yet. In this case the update actually succeeds, as it's passed to the next node in the list and then forwarded on to the relevant leader. In both cases, we need to invalidate the cache. Separately, there's a bit of cleanup we can do in the directUpdate() method call - at the moment we have two paths, dependent on whether or not we're using parallel updates or not, and they end up doing things like throwing slightly different exceptions for the same failure types. I'll open up another JIRA for that. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498680#comment-15498680 ] Noble Paul commented on SOLR-9512: -- So in your solution here is what happens # Instead of just passing one server , we pass all the nodes to LBHttpSolrClient (LBHSC). The shard leader should be the first in the list # LBHSC knows that the leader is a dead node (or it will soon know that). So it would pick up the next server in the list and makes a request there # The request would come back with an error (no leader) # CSC returns the call with an error "no leader" Is that right? > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498631#comment-15498631 ] Alan Woodward commented on SOLR-9512: - (apologies for the terrible metaphor there, it's early on a Saturday) > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498629#comment-15498629 ] Alan Woodward commented on SOLR-9512: - It's kicking the can to a live server, though, which will be able to tell us either a) yes, everything's fine now, you just need to invalidate your cache, or b) there's no leader for this shard, so fail the update. Either way, we get useful information, whereas at the moment we're trying to find out what's happening from a dead server. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498506#comment-15498506 ] Noble Paul commented on SOLR-9512: -- bq. there is a leader, it's just that we locally have the wrong one cached When we get an error , we must invalidate the cache. That's part of the solution. But the larger problem is that leader election takes a while and the state.json will have that information after sometime. The solution of sending the doc to another replica in the shard is just kicking the can down the road > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498495#comment-15498495 ] Alan Woodward commented on SOLR-9512: - I don't think so - there *is* a leader, it's just that we locally have the wrong one cached. And because we're sending our updates *only* to the previous (and now down) leader, we get a failure, when instead we ought to be sending this update on to the next leader. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498496#comment-15498496 ] Alan Woodward commented on SOLR-9512: - I don't think so - there *is* a leader, it's just that we locally have the wrong one cached. And because we're sending our updates *only* to the previous (and now down) leader, we get a failure, when instead we ought to be sending this update on to the next leader. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498485#comment-15498485 ] Noble Paul commented on SOLR-9512: -- It's just hiding the problem. The fact that there is no leader for that shard means that the request will fail eventually. > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495751#comment-15495751 ] Alan Woodward commented on SOLR-9512: - Having played with this a bit, I think adding extra retry logic to CloudSolrClient isn't the best solution; instead, I think we should make directUpdatesToLeaders a hint, rather than a directive, and just make sure that the leader is the first URL in the list passed to the load-balancer. We can then check in the response if the leader was in fact the shard that served that particular request, and if not, then we invalidate the collection cache. [~cpoerschke] you worked on SOLR-9090, does this make sense to you? > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493315#comment-15493315 ] Noble Paul commented on SOLR-9512: -- * Catch the Connectionrefused Exception * Check how old is the state. if it is older than a few seconds invalidate the cache and retry * if it is fresh, throw the error back > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders
[ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490788#comment-15490788 ] Alan Woodward commented on SOLR-9512: - Possibly related: SOLR-9090, SOLR-6312. I think the best case here is to catch a ConnectionRefusedException on a routed update and expire the collection cache entry before retrying? > CloudSolrClient's cluster state cache can break direct updates to leaders > - > > Key: SOLR-9512 > URL: https://issues.apache.org/jira/browse/SOLR-9512 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Alan Woodward > > This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The > process goes something like this: > Documents are added to the cluster via a CloudSolrClient, with > directUpdatesToLeadersOnly set to true. CSC caches its view of the > DocCollection. The leader then goes down, and is reassigned. Next time > documents are added, CSC checks its cache again, and gets the old view of the > DocCollection. It then tries to send the update directly to the old, now > down, leader, and we get ConnectionRefused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org