[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-11-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694736#comment-15694736
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit 3a3d6afef08e3c49c29cc9a015f4e7cca40cf52d in lucene-solr's branch 
refs/heads/branch_6x from [~noble.paul]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a3d6af ]

SOLR-9512: removed unused import


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>Assignee: Noble Paul
> Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-11-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694664#comment-15694664
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit 142461b395506efa01ec2509346bae755f1b2726 in lucene-solr's branch 
refs/heads/master from [~noble.paul]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=142461b ]

SOLR-9512: removed unused import


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>Assignee: Noble Paul
> Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-11-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694066#comment-15694066
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit 5650939a8d41b7bad584947a2c9dcedf3774b8de in lucene-solr's branch 
refs/heads/master from [~noble.paul]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5650939 ]

SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK
SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to 
leaders


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>Assignee: Noble Paul
> Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-11-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694063#comment-15694063
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit d87ffa4bf82c30e9a6f0bbb6b8c0087a5c07f9d6 in lucene-solr's branch 
refs/heads/master from [~noble.paul]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d87ffa4 ]

SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK
SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to 
leaders


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>Assignee: Noble Paul
> Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-11-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694027#comment-15694027
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit e309f9058985375076cac0ed982a158dd865b86a in lucene-solr's branch 
refs/heads/branch_6x from [~noble.paul]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e309f90 ]

SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK
SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to 
leaders


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>Assignee: Noble Paul
> Attachments: SOLR-9512.patch, SOLR-9512.patch, SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-10-18 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587088#comment-15587088
 ] 

Shalin Shekhar Mangar commented on SOLR-9512:
-

bq. Case 6: do i understand it right that we would keep failing the indexing 
requests but 'only' until eventually the client manages to reconnect to zk?

Yes, that is correct.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-10-17 Thread Christine Poerschke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583374#comment-15583374
 ] 

Christine Poerschke commented on SOLR-9512:
---

Thanks Shalin and Noble for analysing and summarising this.

The proposed cases 1-5 solution sounds good to me (though i have not actually 
looked at the code concerned to see what the implementation of that solution 
might look like).

Case 6: do i understand it right that we would keep failing the indexing 
requests but 'only' until eventually the client manages to reconnect to zk? I 
agree that asking a random solr instance for its latest cluster state could be 
problematic.


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-10-14 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576189#comment-15576189
 ] 

Shalin Shekhar Mangar commented on SOLR-9512:
-

Noble and I discussed this offline. Here is a summary of the problem and the 
solution:

There are five cases that we need to tackle. Assuming replica x is leader:
# Case 1: x is disconnected from zk, y becomes leader
** currently — x throws error on indexing, client fails and keeps trying to 
send requests to x and fail. This will continue until X re-connects and the 
client gets a stale state flag in the response.
# Case 2: x is dead, y becomes leader
** currently - client gets connect exception or NoResponseException (for 
in-flight requests) and client keeps retrying request to x. This will continue 
until x comes back online.
# Case 3: x is disconnected from zk, no one is leader
** currently -- client keeps sending requests to x which fail because x is 
disconnected from leader. This will continue until X re-connects and the client 
gets a stale state flag in the response.
# Case 4: x is dead, no one is leader yet
** currently - client gets connect exception or NoResponseException (for 
in-flight requests) and client keeps retrying request to x. This will continue 
until x comes back online.
# Case 5: x is alive but now y is leader
** currently -- client gets a stale state flag from x and refreshes cluster 
state to see y as the new leader. All further indexing requests are sent to y.
# Case 6: client is disconnected from zk
** currently -- client keeps indexing to x. If it receives a stale state error, 
it will try to refresh cluster state, fail and continue to send further 
requests to x, keep failing and keep trying to read from zk and be stuck in a 
cycle.

Cases 1-5 are solved by a single solution -- On ConnectException, 
NoHttpResponseException, Leader disconnected from zk error, client should fetch 
state from zk again. If client fetches from zk and does not get a new version 
then this should be marked in a flag and subsequent retries should only happen 
after N seconds are elapsed or if we know for a fact that version has changed 
since the last zk fetch was made. N could be something small as 2 seconds or so.

Case 6 is more difficult. Either we can keep failing the indexing requests or 
we can ask a random Solr instance to return the latest cluster state. This is 
kinda dangerous because it can open us up to very difficult to debug bugs so I 
am inclined to punt on this for now.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509487#comment-15509487
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit d326adc8bfa33432d50293402a39454d60e070e4 in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d326adc ]

SOLR-9305, SOLR-9390: Don't use directToLeaders updates in partition tests (see 
SOLR-9512)


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509483#comment-15509483
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit bae66f7cca8cff796d142eb19585d8e79fae34f8 in lucene-solr's branch 
refs/heads/branch_6x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bae66f7 ]

SOLR-9305, SOLR-9390: Don't use directToLeaders updates in partition tests (see 
SOLR-9512)


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506675#comment-15506675
 ] 

Noble Paul commented on SOLR-9512:
--

Does it make any sense to have the explicit flag {{directUpdatesToLeaders}} ? 
IMHO it should be the only supported behavior. Why would we ever send a request 
ever to another node when the shard leader is down?

My proposal is as follows.

* The LBHttpSolrClient is aware of down servers. So, if the leader for the 
shard is down we already know it and we can fail fast
* If we have to fail because the shard leader is dead, we should try to 
explicitly read the {{state.json}} for that collection provided the cached 
state is older than a certain threshold (say 1 sec?). This means, the client 
may fire a request to ZK to every second to refresh the state. For such 
refreshes, check the version first to optimize ZK reads

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506575#comment-15506575
 ] 

Alan Woodward commented on SOLR-9512:
-

OK, have reverted.  For the moment, I'll set directUpdatesToLeaders=false on 
the two regularly failing test cases.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506551#comment-15506551
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit a4293ce7c4e849b171430a34f36b830a84927a93 in lucene-solr's branch 
refs/heads/branch_6x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a4293ce ]

Revert "SOLR-9512: CloudSolrClient tries other replicas if a cached leader is 
down"

This reverts commit f96017d9e10c665e7ab6b9161f2af08efc491946.


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506552#comment-15506552
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit bd3fc7f43ff54a174660b7ad51f031d2104f84b5 in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd3fc7f ]

Revert "SOLR-9512: CloudSolrClient tries other replicas if a cached leader is 
down"

This reverts commit 3d130097b7768a8d753476ffe26b83db070c8e20.


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread Christine Poerschke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506019#comment-15506019
 ] 

Christine Poerschke commented on SOLR-9512:
---

Hi [~romseygeek] and [~noble.paul] - am only now joining this ticket late here 
since i have been on vacation.

The SOLR-9090 {{directUpdatesToLeadersOnly}} motivation/intention was for the 
flag to be not a hint but a directive and for updates to 'fail fast' if there 
is (temporarily or otherwise) no shard leader. Fail fast (and let the caller of 
the {{CloudSolrClient}} handle alarming and retries as it sees fit) as opposed 
to sending or retry-sending to a non-leader which would then forward to the 
leader (and potentially still fail eventually, eventually/not-fast-slowly).

[~Marvin Justice] and I worked together on SOLR-9090 - Marvin, any thoughts on 
this ticket here?

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505956#comment-15505956
 ] 

Noble Paul commented on SOLR-9512:
--

Let's first discuss what is the fix and don't rush into solutions

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505951#comment-15505951
 ] 

Alan Woodward commented on SOLR-9512:
-

OK, but lets deal with that in a separate issue, as I think it's a bit more 
complicated than just fixing cache invalidation for /update requests.  The 
current invalidation and retry logic happens at the level of the whole request, 
but CSC is splitting updates up into several sub-requests, and we probably 
don't want to redo the whole thing if a single subshard has changed leader.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505858#comment-15505858
 ] 

Noble Paul commented on SOLR-9512:
--

bq.The invalidation clause above is only called when the request was successful,

It shouldn't need to. The response must have the flag to invalidate the 
collection. if that code is not working as expected we should fix that. 

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-20 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505849#comment-15505849
 ] 

Alan Woodward commented on SOLR-9512:
-

bq. because the new leader is not elected yet

The invalidation clause above is only called when the request was successful, 
ie if a new leader is up.  If leader election hasn't happened yet, then 
LBHttpSolrClient will throw an exception and the whole request will fail.  The 
test case should illustrate this.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505419#comment-15505419
 ] 

Noble Paul commented on SOLR-9512:
--

bq. How so?

You invalidate the cache if the first server did not serve the request. That's 
a problem. When the next request comes , it gets the fresh state which is 
exactly the same as the entry that was just invalidated because the new leader 
is not elected yet and the state. json is not yet updated in ZK.  As we 
discussed before, the cache must be invalidated when a server says the version 
is stale

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504502#comment-15504502
 ] 

Alan Woodward commented on SOLR-9512:
-

bq. The patch contains way more changes than you mentioned.

I ... don't think it does?  It makes the changes I described above to 
CloudSolrClient, and adds a test case.  What else is there?

bq. The following is not the way we should invalidate the cache.

How so?

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504227#comment-15504227
 ] 

Noble Paul commented on SOLR-9512:
--

The following is not the way we should invalidate the cache. 
{code}
 if (response.getServer().equals(url) == false) {
// we didn't hit our first-preference server, which means that our 
cached
// collection state is no longer valid
invalidateCollectionState(collection);
  }
{code}


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504167#comment-15504167
 ] 

Noble Paul commented on SOLR-9512:
--

[~romseygeek] The patch contains way more changes than you mentioned. You 
committed it without any review from the people who are collaborating with you. 

If there are other people collaborating on a ticket, the general protocol is 
that you submit a patch with the changes explained and give some review time 
before committing stuff.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503651#comment-15503651
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit f96017d9e10c665e7ab6b9161f2af08efc491946 in lucene-solr's branch 
refs/heads/branch_6x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f96017d ]

SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503653#comment-15503653
 ] 

ASF subversion and git services commented on SOLR-9512:
---

Commit 3d130097b7768a8d753476ffe26b83db070c8e20 in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3d13009 ]

SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down


> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503044#comment-15503044
 ] 

Noble Paul commented on SOLR-9512:
--

bq.Looking at HttpSolrCall it appears that it's only used in /select requests,

It must be a serious bug. Without proper invalidation,  the caching is useless

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
> Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502871#comment-15502871
 ] 

Noble Paul commented on SOLR-9512:
--

bq. The old leader is down, a new leader has been selected, but the cache 
hasn't updated yet. In this case the update actually succeeds, as it's passed 
to the next node in the list and then forwarded on to the relevant leader.

This is already taken care of. When SolrJ makes a request to the node, it 
responds with a flag to invalidate the cache. So the next request will get the 
updated state

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-19 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502696#comment-15502696
 ] 

Alan Woodward commented on SOLR-9512:
-

Right, there are two scenarios:
1. The leader is down, and there's no replacement voted for yet, in which case 
things happen much as you describe above
2. The old leader is down, a new leader has been selected, but the cache hasn't 
updated yet.  In this case the update actually succeeds, as it's passed to the 
next node in the list and then forwarded on to the relevant leader.
In both cases, we need to invalidate the cache.

Separately, there's a bit of cleanup we can do in the directUpdate() method 
call - at the moment we have two paths, dependent on whether or not we're using 
parallel updates or not, and they end up doing things like throwing slightly 
different exceptions for the same failure types.  I'll open up another JIRA for 
that.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-17 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498680#comment-15498680
 ] 

Noble Paul commented on SOLR-9512:
--

So in your solution here is what happens

# Instead of just passing one server , we pass all the nodes to 
LBHttpSolrClient (LBHSC). The shard leader should be the first in the list
# LBHSC knows that the leader is a dead node (or it will soon know that). So it 
would pick up the next server in the list and makes a request there
# The request would come back with an error (no leader)
# CSC returns the call with an error "no leader"

Is that right?

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-17 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498631#comment-15498631
 ] 

Alan Woodward commented on SOLR-9512:
-

(apologies for the terrible metaphor there, it's early on a Saturday)

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-17 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498629#comment-15498629
 ] 

Alan Woodward commented on SOLR-9512:
-

It's kicking the can to a live server, though, which will be able to tell us 
either a) yes, everything's fine now, you just need to invalidate your cache, 
or b) there's no leader for this shard, so fail the update.  Either way, we get 
useful information, whereas at the moment we're trying to find out what's 
happening from a dead server.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-17 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498506#comment-15498506
 ] 

Noble Paul commented on SOLR-9512:
--

bq. there is a leader, it's just that we locally have the wrong one cached

When we get an error , we must invalidate the cache. That's part of the 
solution. But the larger problem is that leader election takes a while and the 
state.json will have that information after sometime. The solution of sending 
the doc to another replica in the shard is just kicking the can down the road

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-17 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498495#comment-15498495
 ] 

Alan Woodward commented on SOLR-9512:
-

I don't think so - there *is* a leader, it's just that we locally have the 
wrong one cached.  And because we're sending our updates *only* to the previous 
(and now down) leader, we get a failure, when instead we ought to be sending 
this update on to the next leader.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-17 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498496#comment-15498496
 ] 

Alan Woodward commented on SOLR-9512:
-

I don't think so - there *is* a leader, it's just that we locally have the 
wrong one cached.  And because we're sending our updates *only* to the previous 
(and now down) leader, we get a failure, when instead we ought to be sending 
this update on to the next leader.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-17 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498485#comment-15498485
 ] 

Noble Paul commented on SOLR-9512:
--

It's just hiding the problem. The fact that there is no leader for that shard 
means that the request will fail eventually. 

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-16 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495751#comment-15495751
 ] 

Alan Woodward commented on SOLR-9512:
-

Having played with this a bit, I think adding extra retry logic to 
CloudSolrClient isn't the best solution; instead, I think we should make 
directUpdatesToLeaders a hint, rather than a directive, and just make sure that 
the leader is the first URL in the list passed to the load-balancer.  We can 
then check in the response if the leader was in fact the shard that served that 
particular request, and if not, then we invalidate the collection cache.  
[~cpoerschke] you worked on SOLR-9090, does this make sense to you?

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-15 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493315#comment-15493315
 ] 

Noble Paul commented on SOLR-9512:
--

* Catch the Connectionrefused Exception
* Check how old is the state. if it is older than a few seconds invalidate the 
cache and retry
* if it is fresh, throw the error back  

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

2016-09-14 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490788#comment-15490788
 ] 

Alan Woodward commented on SOLR-9512:
-

Possibly related: SOLR-9090, SOLR-6312.

I think the best case here is to catch a ConnectionRefusedException on a routed 
update and expire the collection cache entry before retrying?

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -
>
> Key: SOLR-9512
> URL: https://issues.apache.org/jira/browse/SOLR-9512
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org