[
https://issues.apache.org/jira/browse/AURORA-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097892#comment-14097892
]
Bill Farner commented on AURORA-653:
------------------------------------
I suggest you completely decouple the monitoring from Aurora. You should be
defensive and trust no data provided by Aurora.
Instead, you should directly monitor the applications. {{Announcer()}} helps a
lot here. Rather than asking Aurora what it thinks about the status of a job,
you track the instances in your service discovery system (e.g. ZooKeeper).
This gives you visibility into number of instances that actually made it down
to the executor. Next, you can poll stats from the processes themselves; we do
this by communicating with a well-known named port {{http}}. This goes a step
further to actually validate that your processes are doing something useful.
If you have your applications expose a process uptime metric, you can use
resets on that counter to detect a flapping process (this goes yet _another_
step further to watch for flapping within thermos restarts).
> Make Job.instances value available through the IDL
> --------------------------------------------------
>
> Key: AURORA-653
> URL: https://issues.apache.org/jira/browse/AURORA-653
> Project: Aurora
> Issue Type: Story
> Components: Scheduler
> Reporter: Erik van Roode
>
> Why:
> I would like to be able to determine the health of an app, as in the ratio
> of how many instances are running
> and how many instances were configured to run (Job.instances).
> I can find the "Active" number in various places, but I cannot find the
> "Configured" anywhere.
> As far as I can tell every instance of "Instances/instanceCount" is actually
> the number of active jobs/tasks.
> Eg, JobConfiguration contains instanceCount, but it is not Job.instances.
> I created a job with 5 instances,
> killed one, and instanceCount dropped to 4.
--
This message was sent by Atlassian JIRA
(v6.2#6252)