[ 
https://issues.apache.org/jira/browse/AMBARI-15417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198626#comment-15198626
 ] 

Eric Yang commented on AMBARI-15417:
------------------------------------

During the initial design phase, when HMS was still around.  We visited this 
issue before, and we know that having complex layer of services that does retry 
is probably not a good scalable design.  When infrastructure layer becomes 
complex, the base layer of service will receive overwhelming of retries that 
service refuse to come up.  Facebook encounter this issue before:

http://highscalability.com/blog/2010/9/30/facebook-and-site-failures-caused-by-complex-weakly-interact.html

This was one of the primary factor that we proposed to have our own deployment 
system instead of Chef or Puppet because both of those systems are using 
parallelized distributed retries.  Such system only scale to 1700 nodes then 
fall on their side due to c10k problem:

https://en.wikipedia.org/wiki/C10k_problem

User can use a recompiled kernel to increase the concurrent connections, but 
that is probably not done by general public.  You are welcome to try, but take 
a piece from history book would save you a couple years.

In the original design, if the failure rate of a role is less than 20%, the 
system is allowed to proceed with rest remaining stack of deployment.  This 
provides ability to orchestrate deployment while majority of nodes meet 
requirement.  That logic was not preserved when Ambari started v1 rewrite.  I 
would recommend to use this design to enhance loosely coupled distributed 
system and without falling into retry ddos attack.

> Blueprint should have a flag to allow configuring use of RCO vs Retry method
> ----------------------------------------------------------------------------
>
>                 Key: AMBARI-15417
>                 URL: https://issues.apache.org/jira/browse/AMBARI-15417
>             Project: Ambari
>          Issue Type: Bug
>          Components: blueprints
>    Affects Versions: trunk
>            Reporter: bhuvnesh chaudhary
>
> With Blueprint deploy's, role command oder (RCO) is not honored.
> Currently, in order to mitigate failure for a service start due to 
> dependencies on other services, blueprint deploy uses retry mechanism to 
> ensure that the services are started and their prerequisite are met.
> However, retry mechanism in some cases can cause the install / start time to 
> take long and might need additional logic on component specific installation 
> to handle retries.
> In order to provide with flexibility, we should put up a flag in blueprints 
> which drive the required behavior. (Use RCO vs Use Retry)
> Say: The flag name is use_rco (Change what seems better))
> By default, the value of use_rco can be false and if someone wan't to 
> override it they can specify it as true in the blueprint.
> Note: Keeping it as false by default as it has been already there since 
> Ambari 2.1.0. Hopefully, even if we set this to true by default, it should 
> not impact customers except a few. But we can make this decision based on 
> communities opinion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to