> On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 66
> > <https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66>
> >
> >     FWIW i actually meant to suggest that the snooze has no concept of time 
> > at all.  If the file is there, don't perform health checks.  When you want 
> > to re-enable health checks, delete the file.  Happy to hear what others 
> > think about that.
> 
> Zameer Manji wrote:
>     Doesn't this mean user error can disable health checks forever? I think 
> we should treat disabling health checking as an exceptional state (since the 
> proceses has opted in to health checking) and therefore require user action 
> (increasing mtime) to continue to stay in this state.
> 
> Kevin Sweeney wrote:
>     Presumably we can trust the user here - health checking is after all 
> opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
> greatly simplifies the code on our end.
> 
> Zameer Manji wrote:
>     I'm very worried about introducing a feature which can allow unhealthy 
> instances live forever. Furthermore this information isn't exposed on the 
> Aurora UI or Observer UI so there isn't an easy way to check if an instance 
> has health checking disabled or not. A user can be completely oblivious that 
> health checking of their instance has been disabled.
> 
> David Pan wrote:
>     The motivation behind the health check disabler is for another project I 
> am working on, which is the JVM Heap Dumper.  Long story short, the JVM Heap 
> Dumper needs to disable health checks for the task that is being heap dumped. 
>  Under normal circumstances, the heap dumper can re-enable the health checks. 
>  However, if the heap dumper dies unexpectedly, the health checks will remain 
> disabled forever if we don't implement time control.  We don't want the 
> disabling and enabling of health checks to be a manual process.
> 
> Kevin Sweeney wrote:
>     I still agree with Bill here - the simpler implementation (skip health 
> check if the file exists) is easier to reason about and a cleaner API.
>     
>     A process running in the sandbox could implement some form of timeout 
> logic itself (for example, the application could expose an endpoint that 
> forks the equivalent of "touch .healthchecksnooze; sleep 60; rm 
> .healthchecksnooze") but I like the idea of keeping the executor API here 
> simple to explain and reason about.
> 
> David Pan wrote:
>     Overall, wouldn't it be simpler for the executor API to handle the time 
> control, rather than requiring an identical process in every task's sandbox 
> to handle the time control?
> 
> Joshua Cohen wrote:
>     I feel like we should err on the side of correctness here, rather than 
> simplicity? The dangers of someone accidentally leaving health checks 
> disabled indefinitely (on a service that has opted in to health checks) are 
> not insignificant.

I feel like this discussion warrants a short meeting.  I'll set something up on 
the calendar.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
-----------------------------------------------------------


On Oct. 9, 2014, 1:56 a.m., David Pan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26383/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 1:56 a.m.)
> 
> 
> Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The health check disabler allows health checks for a job to be snoozed 
> temporarily by touching a snooze file in the job's sandbox.  The appropriate 
> unit tests were modified/added.
> 
> The corresponding JIRA ticket is the following:
> https://issues.apache.org/jira/browse/AURORA-795
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 4980411c847d12655cbb363404707ebd9f0bd163 
>   src/test/python/apache/aurora/executor/common/BUILD 
> c7f7a003c865d479ba6e3cd7b5349322f884f653 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
> 
> Diff: https://reviews.apache.org/r/26383/diff/
> 
> 
> Testing
> -------
> 
> On vagrant in ~/aurora, I ran
> ./pants src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> David Pan
> 
>

Reply via email to