On Mi, 2015-10-28 at 11:55 -0700, Frank Filz wrote:
> We have had various discussions over the years as to how to best handle out
> of memory conditions.
> 
> In the meantime, our code is littered with attempts to handle the situation,
> however, it is not clear to me these really solve anything. If we don't have
> 100% recoverability, likely we just delay the crash. Even if we manage to
> avoid crashing, we may wobble along not really handling things well, causing
> retry storms and such (that just dig us in deeper). Another possibility is
> we return an error to the client that gets translated into EIO or some other
> error the application isn't prepared to handle.
> 
> If instead, we just aborted, the HA systems most of us run under would
> restart Ganesha. The clients would see some delay, but there should be no
> visible errors to the clients. Depending on how well grace period/state
> recovery is implemented (and in particular how well it's integrated with
> other file servers such as CIFS/SMB or across a cluster), there could be
> some openings for lock violation (someone is able to steal a lock from one
> of our clients while Ganesha is down).
> 
> Aborting would have several advantages. First, it would immediately clear up
> any memory leaks. Second, if there was some transient activity that resulted
> in high memory utilization, that might also be cleared up. Third, it would
> avoid retry storms and such that might just aggravate the low memory
> condition. In addition, it would force the sysadmin to deal with a workload
> that overloaded the server, possibly by adding additional nodes in a
> clustered environment, or adding memory to the server.
> 
> No matter what we decide to do, another thing we need to look at is more
> memory throttling. Cache inode has a limit on the number of inodes. This is
> helpful, but is incomplete. Other candidates for memory throttling would be:
> 
> Number of clients
> Number of state (opens, locks, delegations, layouts) (per client and/or
> global)
> Size of ACLs and number of ACLs cached
> 
> I'm sure there's more, discuss.
> 
> Frank
> 
Regardless of what's decided on how to react to out of mem conditions,
we must check and detect them, fast and reliable, always.
It is not acceptable to silently accept such a condition and risk to
crash or modify other memory areas.

Besides, the argument that an early abort would be advantageous for
memory leak conditions is hopefully a joke.
Memory leaks must be "closed" regardless of whether we abort or try to
recover.
I think the major goal must be system stability and reliability.
It is not a sign of a reliable system if it's restarted for every little
unusual situation.

Swen

> 
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
> 



------------------------------------------------------------------------------
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Reply via email to