[ 
https://issues.apache.org/jira/browse/SAMZA-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navina Ramesh reassigned SAMZA-843:
-----------------------------------

    Assignee: Navina Ramesh

> Slow start of Samza jobs with large number of containers
> --------------------------------------------------------
>
>                 Key: SAMZA-843
>                 URL: https://issues.apache.org/jira/browse/SAMZA-843
>             Project: Samza
>          Issue Type: Improvement
>    Affects Versions: 0.10.0
>            Reporter: Navina Ramesh
>            Assignee: Navina Ramesh
>             Fix For: 0.10.1
>
>
> We have noticed that when a job has large number of containers and is 
> deployed in Yarn, all the containers query the coordinator URL at the same 
> time, causing an almost herd-like effect. It takes a long time for the job to 
> reach a steady state, where all containers start processing messages and none 
> of them are seeing Socket Timeout exception from the Job Coordinator. This 
> effect is amplified further, if the AM machine is already heavily loaded.
> We could fix this in many ways.
> 1. We could have containers wait for random time period before querying the 
> Job Coordinator.
> 2. We could add a cache in the JobServlet so that the JobModel is not 
> refreshed with each request.
> 3. We could make the JobModel as an atomic reference that gets updated only 
> when the AM requires to restart a failed container. It is ok for the 
> containers to get slightly stale JobModel as long as the partition assignment 
> doesn't change. 
> While the above options are good ways to solve this problem, it does bring up 
> the question about why the containers should query the coordinator for the 
> JobModel (which creates a SPOF for the retrieving JobModel) when it can be 
> inferred by consuming from the Coordinator Stream directly. 
> We should consider an architecture where each container has an embedded job 
> coordinator module that only reads partition assignment messages. The 
> embedded job coordinator can act like a "slave" JC to the job coordinator 
> running in the AM. This will be a major architecture change that requires 
> more thought. Just wanted to put down the idea here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to