Jake Maes created SAMZA-1508:
--------------------------------

             Summary: JobRunner should not return success until the job is 
healthy
                 Key: SAMZA-1508
                 URL: https://issues.apache.org/jira/browse/SAMZA-1508
             Project: Samza
          Issue Type: Bug
            Reporter: Jake Maes
            Assignee: Jake Maes


It can be frustrating for users when run-app.sh returns success before the job 
was fully running.

This happens because the JobRunner currently waits for JobStatus=RUNNING, but 
in Yarn for example, that happens when the AM is launched, not when all the 
containers are launched.
What can go wrong?
1. The job could stay stuck waiting for containers that it cant get because of 
capacity issues or an outage.
2. The job containers may immediately fail due to a runtime error.
In both cases, the user may go on their merry way because run-app.sh returned 
successfully, even though the job is already dead. They may not get alerted for 
some time.
How do we fix?
There are a few ways to fix it. Each one progressively harder but progressively 
better:
1. Make JobRunner reach out to AM and monitor the needed containers metric 
until it reaches 0
2. Expose a new healthy endpoint in the AM which is only set to true when a 
heartbeat has been received from each of the containers. Have the JobRunner 
wait on this (with a timeout)
3. Expose a hook where users can write custom logic to determine job health

I think #1 is the most bang for buck and the implementation for #1 can easily 
be extended for #2 later.

Other notes:
I don't think this is needed for standalone, since users are directly deploying 
the processors and can monitor the processes directly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to