[
https://issues.apache.org/jira/browse/ACCUMULO-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Elser updated ACCUMULO-3569:
---------------------------------
Fix Version/s: (was: 1.7.0)
1.8.0
> Automatically restart accumulo processes intelligently
> ------------------------------------------------------
>
> Key: ACCUMULO-3569
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3569
> Project: Accumulo
> Issue Type: Bug
> Components: scripts
> Reporter: John Vines
> Fix For: 1.8.0
>
> Attachments:
> 0001-ACCUMULO-3569-initial-pass-at-integrating-auto-resta.patch
>
>
> On occasion process will die, for a variety of reasons. Some reasons are
> critical whereas others may be due to momentary blips. There are a variety of
> reasons, but not all of the reasons warrant keeping the server down and
> requiring human attention.
> With that, I would like to propose a watcher process, which is an option
> component that wraps the calls to the various processes (tserver, master,
> etc.). This process can watch the processes, get their exit codes, read their
> logs, etc. and make intelligent decisions about how to behave. This behavior
> would include coarse detection of failure types (will discuss below) and a
> configurable response behavior around how many attempts should be made in a
> given window before giving up entirely.
> As for failure types, there are a few arch ones that seem to be regularly
> repeating that I think are prime candidates for an initial approach-
> Zookeeper lock lost - this can happen for a variety of reasons, mostly
> related to network issues or server (tserver or zk node) congestion. These
> are some of the most common errors and are typically transient. However, if
> these occur with great frequency then it's a sign of a larger issue that
> needs to be handled by an administrator.
> Jvm OOM - There are two spaces where these really seem to occur - a system
> that's just poorly configured and dies shortly after it starts up and then
> there is the case where the system gets slammed in just the right way where
> objects in our code and/or the iterator stack may push the JVM just over the
> limits. In the former case, this will fail quickly and relatively rapidly
> when being restarted, whereas the latter case is something that will occur
> rarely and will want attention, but doesn't warrant keeping the node offline
> in the meantime.
> Standard shutdown - this is just a case that occurs where we don't want it to
> automatically interact because we want it to go down. Just a design
> consideration.
> Unexpected exceptions - this is a catch all for everything else. We can
> attempt to enumerate them, but they're less common. This would be something
> configured to have less tolerance for, but just because a server goes down
> due to a random software bug doesn't mean that server should be removed from
> the cluster unless it happens repeatedly (because then it's a sign of a
> hardware/system issue). But we should provide the ability to keep resources
> available in this space.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)