> On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote: > > src/main/python/apache/aurora/executor/common/health_checker.py, line 66 > > <https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66> > > > > FWIW i actually meant to suggest that the snooze has no concept of time > > at all. If the file is there, don't perform health checks. When you want > > to re-enable health checks, delete the file. Happy to hear what others > > think about that. > > Zameer Manji wrote: > Doesn't this mean user error can disable health checks forever? I think > we should treat disabling health checking as an exceptional state (since the > proceses has opted in to health checking) and therefore require user action > (increasing mtime) to continue to stay in this state. > > Kevin Sweeney wrote: > Presumably we can trust the user here - health checking is after all > opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch > greatly simplifies the code on our end. > > Zameer Manji wrote: > I'm very worried about introducing a feature which can allow unhealthy > instances live forever. Furthermore this information isn't exposed on the > Aurora UI or Observer UI so there isn't an easy way to check if an instance > has health checking disabled or not. A user can be completely oblivious that > health checking of their instance has been disabled. > > David Pan wrote: > The motivation behind the health check disabler is for another project I > am working on, which is the JVM Heap Dumper. Long story short, the JVM Heap > Dumper needs to disable health checks for the task that is being heap dumped. > Under normal circumstances, the heap dumper can re-enable the health checks. > However, if the heap dumper dies unexpectedly, the health checks will remain > disabled forever if we don't implement time control. We don't want the > disabling and enabling of health checks to be a manual process. > > Kevin Sweeney wrote: > I still agree with Bill here - the simpler implementation (skip health > check if the file exists) is easier to reason about and a cleaner API. > > A process running in the sandbox could implement some form of timeout > logic itself (for example, the application could expose an endpoint that > forks the equivalent of "touch .healthchecksnooze; sleep 60; rm > .healthchecksnooze") but I like the idea of keeping the executor API here > simple to explain and reason about. > > David Pan wrote: > Overall, wouldn't it be simpler for the executor API to handle the time > control, rather than requiring an identical process in every task's sandbox > to handle the time control? > > Joshua Cohen wrote: > I feel like we should err on the side of correctness here, rather than > simplicity? The dangers of someone accidentally leaving health checks > disabled indefinitely (on a service that has opted in to health checks) are > not insignificant.
I feel like this discussion warrants a short meeting. I'll set something up on the calendar. - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26383/#review55985 ----------------------------------------------------------- On Oct. 9, 2014, 1:56 a.m., David Pan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/26383/ > ----------------------------------------------------------- > > (Updated Oct. 9, 2014, 1:56 a.m.) > > > Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji. > > > Repository: aurora > > > Description > ------- > > The health check disabler allows health checks for a job to be snoozed > temporarily by touching a snooze file in the job's sandbox. The appropriate > unit tests were modified/added. > > The corresponding JIRA ticket is the following: > https://issues.apache.org/jira/browse/AURORA-795 > > > Diffs > ----- > > src/main/python/apache/aurora/executor/common/health_checker.py > 4980411c847d12655cbb363404707ebd9f0bd163 > src/test/python/apache/aurora/executor/common/BUILD > c7f7a003c865d479ba6e3cd7b5349322f884f653 > src/test/python/apache/aurora/executor/common/test_health_checker.py > aa36415fa891fc523a3a376ffeca5d3cd5ceabec > > Diff: https://reviews.apache.org/r/26383/diff/ > > > Testing > ------- > > On vagrant in ~/aurora, I ran > ./pants src/test/python/apache/aurora/executor:: > > > Thanks, > > David Pan > >