[ 
https://issues.apache.org/jira/browse/SAMZA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263627#comment-16263627
 ] 

ASF GitHub Bot commented on SAMZA-1508:
---------------------------------------

GitHub user jmakes opened a pull request:

    https://github.com/apache/samza/pull/367

    SAMZA-1508: JobRunner should not return success until the job is healthy

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jmakes/samza samza-1508

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/samza/pull/367.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #367
    
----
commit f92ed990b7fabc550e357cf29c5e1a4ccee5af9c
Author: Jacob Maes <[email protected]>
Date:   2017-11-23T00:44:16Z

    SAMZA-1508: JobRunner should not return success until the job is healthy

----


> JobRunner should not return success until the job is healthy
> ------------------------------------------------------------
>
>                 Key: SAMZA-1508
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1508
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jake Maes
>            Assignee: Jake Maes
>
> It can be frustrating for users when run-app.sh returns success before the 
> job was fully running.
> This happens because the JobRunner currently waits for JobStatus=RUNNING, but 
> in Yarn for example, that happens when the AM is launched, not when all the 
> containers are launched.
> What can go wrong?
> 1. The job could stay stuck waiting for containers that it cant get because 
> of capacity issues or an outage.
> 2. The job containers may immediately fail due to a runtime error.
> In both cases, the user may go on their merry way because run-app.sh returned 
> successfully, even though the job is already dead. They may not get alerted 
> for some time.
> How do we fix?
> There are a few ways to fix it. Each one progressively harder but 
> progressively better:
> 1. Make JobRunner reach out to AM and monitor the needed containers metric 
> until it reaches 0
> 2. Expose a new healthy endpoint in the AM which is only set to true when a 
> heartbeat has been received from each of the containers. Have the JobRunner 
> wait on this (with a timeout)
> 3. Expose a hook where users can write custom logic to determine job health
> I think #1 is the most bang for buck and the implementation for #1 can easily 
> be extended for #2 later.
> Other notes:
> I don't think this is needed for standalone, since users are directly 
> deploying the processors and can monitor the processes directly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to