[
https://issues.apache.org/jira/browse/AURORA-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kevin Sweeney updated AURORA-224:
---------------------------------
Comment: was deleted
(was: The end-to-end test fails when I attempt to go to the recently-released
mesos-0.17.0
Log output from the scheduler is here https://paste.apache.org/gNPu
The branch I'm using is on origin at
https://git-wip-us.apache.org/repos/asf?p=incubator-aurora.git;a=shortlog;h=refs/heads/kts/mesos-0.17.0
The command to repro the errors is
{noformat}
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
{noformat}
(That will hang and then you can do)
{noformat}
vagrant ssh aurora-scheduler
less /var/log/aurora-scheduler-stderr.log
{noformat}
[~vinodkone] [~jieyu] can you take a look?)
> Make health checking more configurable in updater
> -------------------------------------------------
>
> Key: AURORA-224
> URL: https://issues.apache.org/jira/browse/AURORA-224
> Project: Aurora
> Issue Type: Story
> Components: Client
> Reporter: Kevin Sweeney
> Assignee: Kevin Sweeney
>
> Right now the updater considers an instance that passed its health check once
> but later fails as unconditionally failed [1] and restarts it. During startup
> a service could conceivably respond affirmatively to /health and then later
> timeout its requests. Consider making the behavior of the HTTP health checker
> more configurable during updates.
> [1]
> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/client/api/instance_watcher.py#L91
> {code}
> def maybe_set_instance_unhealthy(instance_id, retriable):
> # An instance that was previously healthy and currently unhealthy has
> failed.
> if instance_id in instance_states:
> log.info('Instance %s is unhealthy' % instance_id)
> instance_states[instance_id].set_healthy(False)
> # If the restart threshold has expired or if the instance cannot be
> retried it is unhealthy.
> elif now > expected_healthy_by or not retriable:
> log.info('Instance %s was not reported healthy within %d seconds' % (
> instance_id, self._restart_threshold))
> instance_states[instance_id] = Instance(finished=True)
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)