[jira] [Commented] (SOLR-11730) Test NodeLost / NodeAdded dynamics

2018-01-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334173#comment-16334173
 ] 

ASF subversion and git services commented on SOLR-11730:


Commit 3c1163cf0a14b2f17e08cc5a31a6bb6dc7659289 in lucene-solr's branch 
refs/heads/branch_7x from [~ab]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3c1163c ]

SOLR-11730 Add a nodeLost benchmark.


> Test NodeLost / NodeAdded dynamics
> --
>
> Key: SOLR-11730
> URL: https://issues.apache.org/jira/browse/SOLR-11730
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Andrzej Bialecki 
>Priority: Major
>
> Let's consider a "flaky node" scenario.
> A node is going up and down at short intervals (eg. due to a flaky network 
> cable). If the frequency of these events coincides with {{waitFor}} interval 
> in {{nodeLost}} trigger configuration, the node may never be reported to the 
> autoscaling framework as lost. Similarly it may never be reported as added 
> back if it's lost again within the {{waitFor}} period of {{nodeAdded}} 
> trigger.
> Other scenarios are possible here too, depending on timing:
> * node being constantly reported as lost
> * node being constantly reported as added
> One possible solution for the autoscaling triggers is that the framework 
> should keep a short-term ({{waitFor * 2}} long?) memory of a node state that 
> the trigger is tracking in order to eliminate flaky nodes (ie. those that 
> transitioned between states more than once within the period).
> Situation like this is detrimental to SolrCloud behavior regardless of 
> autoscaling actions, so it should probably be addressed at a node level by 
> eg. shutting down Solr node after the number of disconnects in a time window 
> reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11730) Test NodeLost / NodeAdded dynamics

2018-01-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334174#comment-16334174
 ] 

ASF subversion and git services commented on SOLR-11730:


Commit 6752e4c72f8f98c6ddca2669e4ac34aa93b19294 in lucene-solr's branch 
refs/heads/branch_7x from [~ab]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6752e4c ]

SOLR-11730: Collect more stats in the benchmark. Add simulation framework 
package docs.


> Test NodeLost / NodeAdded dynamics
> --
>
> Key: SOLR-11730
> URL: https://issues.apache.org/jira/browse/SOLR-11730
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Andrzej Bialecki 
>Priority: Major
>
> Let's consider a "flaky node" scenario.
> A node is going up and down at short intervals (eg. due to a flaky network 
> cable). If the frequency of these events coincides with {{waitFor}} interval 
> in {{nodeLost}} trigger configuration, the node may never be reported to the 
> autoscaling framework as lost. Similarly it may never be reported as added 
> back if it's lost again within the {{waitFor}} period of {{nodeAdded}} 
> trigger.
> Other scenarios are possible here too, depending on timing:
> * node being constantly reported as lost
> * node being constantly reported as added
> One possible solution for the autoscaling triggers is that the framework 
> should keep a short-term ({{waitFor * 2}} long?) memory of a node state that 
> the trigger is tracking in order to eliminate flaky nodes (ie. those that 
> transitioned between states more than once within the period).
> Situation like this is detrimental to SolrCloud behavior regardless of 
> autoscaling actions, so it should probably be addressed at a node level by 
> eg. shutting down Solr node after the number of disconnects in a time window 
> reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11730) Test NodeLost / NodeAdded dynamics

2018-01-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316378#comment-16316378
 ] 

ASF subversion and git services commented on SOLR-11730:


Commit a9fec9bf7caee2620d09086efde4a29b245aab7b in lucene-solr's branch 
refs/heads/master from [~ab]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a9fec9b ]

SOLR-11730: Collect more stats in the benchmark. Add simulation framework 
package docs.


> Test NodeLost / NodeAdded dynamics
> --
>
> Key: SOLR-11730
> URL: https://issues.apache.org/jira/browse/SOLR-11730
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Andrzej Bialecki 
>
> Let's consider a "flaky node" scenario.
> A node is going up and down at short intervals (eg. due to a flaky network 
> cable). If the frequency of these events coincides with {{waitFor}} interval 
> in {{nodeLost}} trigger configuration, the node may never be reported to the 
> autoscaling framework as lost. Similarly it may never be reported as added 
> back if it's lost again within the {{waitFor}} period of {{nodeAdded}} 
> trigger.
> Other scenarios are possible here too, depending on timing:
> * node being constantly reported as lost
> * node being constantly reported as added
> One possible solution for the autoscaling triggers is that the framework 
> should keep a short-term ({{waitFor * 2}} long?) memory of a node state that 
> the trigger is tracking in order to eliminate flaky nodes (ie. those that 
> transitioned between states more than once within the period).
> Situation like this is detrimental to SolrCloud behavior regardless of 
> autoscaling actions, so it should probably be addressed at a node level by 
> eg. shutting down Solr node after the number of disconnects in a time window 
> reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11730) Test NodeLost / NodeAdded dynamics

2018-01-03 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309580#comment-16309580
 ] 

ASF subversion and git services commented on SOLR-11730:


Commit 2da4ed17bae07593233f4e5610ce40a6a07f7c10 in lucene-solr's branch 
refs/heads/master from [~ab]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2da4ed1 ]

SOLR-11730 Add a nodeLost benchmark.


> Test NodeLost / NodeAdded dynamics
> --
>
> Key: SOLR-11730
> URL: https://issues.apache.org/jira/browse/SOLR-11730
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Andrzej Bialecki 
>
> Let's consider a "flaky node" scenario.
> A node is going up and down at short intervals (eg. due to a flaky network 
> cable). If the frequency of these events coincides with {{waitFor}} interval 
> in {{nodeLost}} trigger configuration, the node may never be reported to the 
> autoscaling framework as lost. Similarly it may never be reported as added 
> back if it's lost again within the {{waitFor}} period of {{nodeAdded}} 
> trigger.
> Other scenarios are possible here too, depending on timing:
> * node being constantly reported as lost
> * node being constantly reported as added
> One possible solution for the autoscaling triggers is that the framework 
> should keep a short-term ({{waitFor * 2}} long?) memory of a node state that 
> the trigger is tracking in order to eliminate flaky nodes (ie. those that 
> transitioned between states more than once within the period).
> Situation like this is detrimental to SolrCloud behavior regardless of 
> autoscaling actions, so it should probably be addressed at a node level by 
> eg. shutting down Solr node after the number of disconnects in a time window 
> reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11730) Test NodeLost / NodeAdded dynamics

2017-12-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301417#comment-16301417
 ] 

ASF subversion and git services commented on SOLR-11730:


Commit 0290c95c449d20eadbbd614860d0f739d131a62d in lucene-solr's branch 
refs/heads/branch_7x from [~ab]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0290c95 ]

SOLR-11730: Add simulated tests for nodeAdded / nodeLost dynamic in a large 
cluster.
Plus some other fixes:
* Fix leader election throttle and cluster state versioning in the simulator.
* PolicyHelper was still using a static ThreadLocal field, use ObjectCache 
isntead.


> Test NodeLost / NodeAdded dynamics
> --
>
> Key: SOLR-11730
> URL: https://issues.apache.org/jira/browse/SOLR-11730
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Andrzej Bialecki 
>
> Let's consider a "flaky node" scenario.
> A node is going up and down at short intervals (eg. due to a flaky network 
> cable). If the frequency of these events coincides with {{waitFor}} interval 
> in {{nodeLost}} trigger configuration, the node may never be reported to the 
> autoscaling framework as lost. Similarly it may never be reported as added 
> back if it's lost again within the {{waitFor}} period of {{nodeAdded}} 
> trigger.
> Other scenarios are possible here too, depending on timing:
> * node being constantly reported as lost
> * node being constantly reported as added
> One possible solution for the autoscaling triggers is that the framework 
> should keep a short-term ({{waitFor * 2}} long?) memory of a node state that 
> the trigger is tracking in order to eliminate flaky nodes (ie. those that 
> transitioned between states more than once within the period).
> Situation like this is detrimental to SolrCloud behavior regardless of 
> autoscaling actions, so it should probably be addressed at a node level by 
> eg. shutting down Solr node after the number of disconnects in a time window 
> reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11730) Test NodeLost / NodeAdded dynamics

2017-12-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301304#comment-16301304
 ] 

ASF subversion and git services commented on SOLR-11730:


Commit 091f45dd7b4c6685b1e787283ecc029994641f3e in lucene-solr's branch 
refs/heads/master from [~ab]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=091f45d ]

SOLR-11730: Add simulated tests for nodeAdded / nodeLost dynamic in a large 
cluster.
Plus some other fixes:
* Fix leader election throttle and cluster state versioning in the simulator.
* PolicyHelper was still using a static ThreadLocal field, use ObjectCache 
isntead.


> Test NodeLost / NodeAdded dynamics
> --
>
> Key: SOLR-11730
> URL: https://issues.apache.org/jira/browse/SOLR-11730
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Andrzej Bialecki 
>
> Let's consider a "flaky node" scenario.
> A node is going up and down at short intervals (eg. due to a flaky network 
> cable). If the frequency of these events coincides with {{waitFor}} interval 
> in {{nodeLost}} trigger configuration, the node may never be reported to the 
> autoscaling framework as lost. Similarly it may never be reported as added 
> back if it's lost again within the {{waitFor}} period of {{nodeAdded}} 
> trigger.
> Other scenarios are possible here too, depending on timing:
> * node being constantly reported as lost
> * node being constantly reported as added
> One possible solution for the autoscaling triggers is that the framework 
> should keep a short-term ({{waitFor * 2}} long?) memory of a node state that 
> the trigger is tracking in order to eliminate flaky nodes (ie. those that 
> transitioned between states more than once within the period).
> Situation like this is detrimental to SolrCloud behavior regardless of 
> autoscaling actions, so it should probably be addressed at a node level by 
> eg. shutting down Solr node after the number of disconnects in a time window 
> reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11730) Test NodeLost / NodeAdded dynamics

2017-12-07 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282027#comment-16282027
 ] 

Andrzej Bialecki  commented on SOLR-11730:
--

Simulations indicate that even with significant flakiness the framework may not 
take any actions if there are other events happening too, because even if a 
node lost trigger creates an event that event may be discarded due to the 
cooldown period. And after the cooldown period has passed the flaky node may be 
back up again, so the event would not be generated again.

> Test NodeLost / NodeAdded dynamics
> --
>
> Key: SOLR-11730
> URL: https://issues.apache.org/jira/browse/SOLR-11730
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Andrzej Bialecki 
>
> Let's consider a "flaky node" scenario.
> A node is going up and down at short intervals (eg. due to a flaky network 
> cable). If the frequency of these events coincides with {{waitFor}} interval 
> in {{nodeLost}} trigger configuration, the node may never be reported to the 
> autoscaling framework as lost. Similarly it may never be reported as added 
> back if it's lost again within the {{waitFor}} period of {{nodeAdded}} 
> trigger.
> Other scenarios are possible here too, depending on timing:
> * node being constantly reported as lost
> * node being constantly reported as added
> One possible solution for the autoscaling triggers is that the framework 
> should keep a short-term ({{waitFor * 2}} long?) memory of a node state that 
> the trigger is tracking in order to eliminate flaky nodes (ie. those that 
> transitioned between states more than once within the period).
> Situation like this is detrimental to SolrCloud behavior regardless of 
> autoscaling actions, so it should probably be addressed at a node level by 
> eg. shutting down Solr node after the number of disconnects in a time window 
> reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org