[ 
https://issues.apache.org/jira/browse/AURORA-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097892#comment-14097892
 ] 

Bill Farner commented on AURORA-653:
------------------------------------

I suggest you completely decouple the monitoring from Aurora.  You should be 
defensive and trust no data provided by Aurora.

Instead, you should directly monitor the applications.  {{Announcer()}} helps a 
lot here.  Rather than asking Aurora what it thinks about the status of a job, 
you track the instances in your service discovery system (e.g. ZooKeeper).  
This gives you visibility into number of instances that actually made it down 
to the executor.  Next, you can poll stats from the processes themselves; we do 
this by communicating with a well-known named port {{http}}.  This goes a step 
further to actually validate that your processes are doing something useful.  
If you have your applications expose a process uptime metric, you can use 
resets on that counter to detect a flapping process (this goes yet _another_ 
step further to watch for flapping within thermos restarts).

> Make Job.instances value available through the IDL
> --------------------------------------------------
>
>                 Key: AURORA-653
>                 URL: https://issues.apache.org/jira/browse/AURORA-653
>             Project: Aurora
>          Issue Type: Story
>          Components: Scheduler
>            Reporter: Erik van Roode
>
> Why:
>   I would like to be able to determine the health of an app, as in the ratio 
> of how many instances are running
> and how many instances were configured to run (Job.instances).
>   I can find the "Active" number in various places, but I cannot find the 
> "Configured" anywhere.
> As far as I can tell every instance of "Instances/instanceCount" is actually 
> the number of active jobs/tasks.
>    Eg, JobConfiguration contains instanceCount, but it is not Job.instances. 
> I created a job with 5 instances,
> killed one, and instanceCount dropped to 4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to