[ 
https://issues.apache.org/jira/browse/MESOS-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-2360:
----------------------------------
    Description: 
Triggered by an issue Alexander stumbled across in 
https://issues.apache.org/jira/browse/MESOS-2355, I wanted to find out on why 
the Slave was allowed to send out multiple, parallel registration requests.

When looking at the code, one part got my attention:

{noformat}
  // Retry registration if necessary.
  Duration next = std::min(
      duration * ((double) ::random() / RAND_MAX),
      REGISTER_RETRY_INTERVAL_MAX);
{noformat}

[src/slave/slave.cpp, slave::doReliableRegistration, line 1040 ff]

So this does allow {{next}} to be something equal or very close to 0. Such zero 
delay will cause immediate retries which might (with a bit of tough luck) again 
trigger an immediate retry a.s.o.. The delay will, for these cases, be 
determined by the full cycle-frequency of libprocess.

Why was this implemented without a flooring limit - say e.g. 1second?

While MESOS-2355 got a proper fix for this scenario in a test, the global issue 
remains to get clarified, I think.

Should we add such floor limit to prevent pointless (almost) concurrent 
registration requests?

  was:
Triggered by an issue Alexander stumbled across in 
https://issues.apache.org/jira/browse/MESOS-2355, I wanted to find out on why 
the Slave was allowed to send out multiple, parallel registration requests.

When looking at the code, one part got my attention:

{noformat}
  // Retry registration if necessary.
  Duration next = std::min(
      duration * ((double) ::random() / RAND_MAX),
      REGISTER_RETRY_INTERVAL_MAX);
{noformat}

[src/slave.cpp, slave::doReliableRegistration, line 1040 ff]

So this does allow {{next}} to be something equal or very close to 0. Such zero 
delay will cause immediate retries which might (with a bit of tough luck) again 
trigger an immediate retry a.s.o.. The delay will, for these cases, be 
determined by the full cycle-frequency of libprocess.

Why was this implemented without a flooring limit - say e.g. 1second?

While MESOS-2355 got a proper fix for this scenario in a test, the global issue 
remains to get clarified, I think.

Should we add such floor limit to prevent pointless (almost) concurrent 
registration requests?


> Slave may send multiple, almost concurrent registration requests to the 
> master.
> -------------------------------------------------------------------------------
>
>                 Key: MESOS-2360
>                 URL: https://issues.apache.org/jira/browse/MESOS-2360
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Till Toenshoff
>
> Triggered by an issue Alexander stumbled across in 
> https://issues.apache.org/jira/browse/MESOS-2355, I wanted to find out on why 
> the Slave was allowed to send out multiple, parallel registration requests.
> When looking at the code, one part got my attention:
> {noformat}
>   // Retry registration if necessary.
>   Duration next = std::min(
>       duration * ((double) ::random() / RAND_MAX),
>       REGISTER_RETRY_INTERVAL_MAX);
> {noformat}
> [src/slave/slave.cpp, slave::doReliableRegistration, line 1040 ff]
> So this does allow {{next}} to be something equal or very close to 0. Such 
> zero delay will cause immediate retries which might (with a bit of tough 
> luck) again trigger an immediate retry a.s.o.. The delay will, for these 
> cases, be determined by the full cycle-frequency of libprocess.
> Why was this implemented without a flooring limit - say e.g. 1second?
> While MESOS-2355 got a proper fix for this scenario in a test, the global 
> issue remains to get clarified, I think.
> Should we add such floor limit to prevent pointless (almost) concurrent 
> registration requests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to