Hi Tom,

This sounds like a bug. ApplicationRunner should return the correct status
when the processor has shut down. We fixed a similar standalone bug
recently, are you already using Samza 1.0.
If this is reproducible / happens again, a thread dump + logs would also be
very helpful for debugging and verifying if the issue is already fixed.

Thanks,
Prateek

On Fri, Mar 22, 2019 at 7:23 AM Tom Davis <t...@recursivedream.com> wrote:

>
> Prateek Maheshwari <prateek...@gmail.com> writes:
>
> > Hi Tom,
> >
> > This would depend on what your k8s container orchestration logic looks
> > like. For example, in YARN, 'status' returns 'not running' after 'start'
> > until all the containers requested from the AM are 'running'. We also
> > leverage YARN to restart containers/job automatically on failures (within
> > some bounds). Additionally, we set up a monitoring alert that goes off if
> > the number of running containers stays lower than the number of expected
> > containers for extended periods of time (~ 5 minutes).
> >
> > Are you saying that you noticed that the LocalApplicationRunner status
> > returns 'running' even if its stream processor / SamzaContainer has
> stopped
> > processing?
> >
>
> Yeah, this is what I mean. We have a health check for the overall
> ApplicationStatus but if the containers enter a failed state that
> doesn't result in a shut down of the runner itself. An example from last
> night: Kafka became unavailable at some point and Samza failed to write
> checkpoints for a while, ultimately leading to container failures. The
> last log line is:
>
> o.a.s.c.SamzaContainer - Shutdown is no-op since the container is already
> in
> state: FAILED
>
> This doesn't cause the Pod to be killed, though, so we just silently
> stop processing events. How do you determine the number of expected
> containers? Or are you speaking of containers in terms of YARN and not
> Samza processors?
>
> >
> > - Prateek
> >
> > On Fri, Mar 15, 2019 at 7:26 AM Tom Davis <t...@recursivedream.com>
> wrote:
> >
> >> I'm using the LocalApplicationRunner and had added a liveness check
> >> around the `status` method. The app is running in Kubernetes so, in
> >> theory, it could be restarted if exceptions happened during processing.
> >> However, it seems that "container failure" is divorced from "app
> >> failure" because the app continues to run even after all the task
> >> containers have shut down. Is there a better way to check for
> >> application health? Is there a way to shut down the application if all
> >> containers have failed? Should I simply ensure exceptions never escape
> >> operators? Thanks!
> >>
>

Reply via email to