[jira] [Updated] (ACCUMULO-3569) Automatically restart accumulo processes intelligently

Josh Elser (JIRA) Sat, 04 Apr 2015 22:37:40 -0700

     [ 
https://issues.apache.org/jira/browse/ACCUMULO-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Elser updated ACCUMULO-3569:
---------------------------------
    Fix Version/s:     (was: 1.7.0)
                   1.8.0

> Automatically restart accumulo processes intelligently
> ------------------------------------------------------
>
>                 Key: ACCUMULO-3569
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3569
>             Project: Accumulo
>          Issue Type: Bug
>          Components: scripts
>            Reporter: John Vines
>             Fix For: 1.8.0
>
>         Attachments: 
> 0001-ACCUMULO-3569-initial-pass-at-integrating-auto-resta.patch
>
>
> On occasion process will die, for a variety of reasons. Some reasons are 
> critical whereas others may be due to momentary blips. There are a variety of 
> reasons, but not all of the reasons warrant keeping the server down and 
> requiring human attention.
> With that, I would like to propose a watcher process, which is an option 
> component that wraps the calls to the various processes (tserver, master, 
> etc.). This process can watch the processes, get their exit codes, read their 
> logs, etc. and make intelligent decisions about how to behave. This behavior 
> would include coarse detection of failure types (will discuss below) and a 
> configurable response behavior around how many attempts should be made in a 
> given window before giving up entirely.
> As for failure types, there are a few arch ones that seem to be regularly 
> repeating that I think are prime candidates for an initial approach-
> Zookeeper lock lost - this can happen for a variety of reasons, mostly 
> related to network issues or server (tserver or zk node) congestion. These 
> are some of the most common errors and are typically transient. However, if 
> these occur with great frequency then it's a sign of a larger issue that 
> needs to be handled by an administrator.
> Jvm OOM - There are two spaces where these really seem to occur - a system 
> that's just poorly configured and dies shortly after it starts up and then 
> there is the case where the system gets slammed in just the right way where 
> objects in our code and/or the iterator stack may push the JVM just over the 
> limits. In the former case, this will fail quickly and relatively rapidly 
> when being restarted, whereas the latter case is something that will occur 
> rarely and will want attention, but doesn't warrant keeping the node offline 
> in the meantime.
> Standard shutdown - this is just a case that occurs where we don't want it to 
> automatically interact because we want it to go down. Just a design 
> consideration.
> Unexpected exceptions - this is a catch all for everything else. We can 
> attempt to enumerate them, but they're less common. This would be something 
> configured to have less tolerance for, but just because a server goes down 
> due to a random software bug doesn't mean that server should be removed from 
> the cluster unless it happens repeatedly (because then it's a sign of a 
> hardware/system issue). But we should provide the ability to keep resources 
> available in this space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ACCUMULO-3569) Automatically restart accumulo processes intelligently

Reply via email to