[
https://issues.apache.org/jira/browse/SAMZA-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Navina Ramesh reassigned SAMZA-843:
-----------------------------------
Assignee: Navina Ramesh
> Slow start of Samza jobs with large number of containers
> --------------------------------------------------------
>
> Key: SAMZA-843
> URL: https://issues.apache.org/jira/browse/SAMZA-843
> Project: Samza
> Issue Type: Improvement
> Affects Versions: 0.10.0
> Reporter: Navina Ramesh
> Assignee: Navina Ramesh
> Fix For: 0.10.1
>
>
> We have noticed that when a job has large number of containers and is
> deployed in Yarn, all the containers query the coordinator URL at the same
> time, causing an almost herd-like effect. It takes a long time for the job to
> reach a steady state, where all containers start processing messages and none
> of them are seeing Socket Timeout exception from the Job Coordinator. This
> effect is amplified further, if the AM machine is already heavily loaded.
> We could fix this in many ways.
> 1. We could have containers wait for random time period before querying the
> Job Coordinator.
> 2. We could add a cache in the JobServlet so that the JobModel is not
> refreshed with each request.
> 3. We could make the JobModel as an atomic reference that gets updated only
> when the AM requires to restart a failed container. It is ok for the
> containers to get slightly stale JobModel as long as the partition assignment
> doesn't change.
> While the above options are good ways to solve this problem, it does bring up
> the question about why the containers should query the coordinator for the
> JobModel (which creates a SPOF for the retrieving JobModel) when it can be
> inferred by consuming from the Coordinator Stream directly.
> We should consider an architecture where each container has an embedded job
> coordinator module that only reads partition assignment messages. The
> embedded job coordinator can act like a "slave" JC to the job coordinator
> running in the AM. This will be a major architecture change that requires
> more thought. Just wanted to put down the idea here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)