Re: [Nfs-ganesha-devel] Topic for discussion - Out of Memory Handling

Frank Filz Wed, 28 Oct 2015 21:34:32 -0700

So my question is what allocations do we attempt to recover from? And what
does that recovery look like? And how do we make sure Ganesha is actually
running in a sane way if we do recover? And are we just kicking the can to
the next allocation which chooses not to recover? It seems like if we are
going to try and keep running, we should do so in almost all cases, using
abort only for those cases that are just way too complex to recover from
(for example, there is an out of memory condition in unlock if lock owners
aren't supported where it can become impossible to get to a correct set of
locks).

Frank

From: Marc Eshel [mailto:es...@us.ibm.com] 
Sent: Wednesday, October 28, 2015 7:38 PM
To: Frank Filz <ffilz...@mindspring.com>
Cc: nfs-ganesha-devel@lists.sourceforge.net
Subject: Re: [Nfs-ganesha-devel] Topic for discussion - Out of Memory
Handling

I don't believe that we need to restart Ganesha on every out of memory calls
for many reasons, but I will agree that we can have two types or calls one
that can accept no memory rc and one that terminate Ganesha if the call is
not successful.   
Marc. 

From:        "Frank Filz" <ffilz...@mindspring.com
<mailto:ffilz...@mindspring.com> > 
To:        <nfs-ganesha-devel@lists.sourceforge.net
<mailto:nfs-ganesha-devel@lists.sourceforge.net> > 
Date:        10/28/2015 11:55 AM 
Subject:        [Nfs-ganesha-devel] Topic for discussion - Out of Memory
Handling 

  _____  

We have had various discussions over the years as to how to best handle out
of memory conditions.

In the meantime, our code is littered with attempts to handle the situation,
however, it is not clear to me these really solve anything. If we don't have
100% recoverability, likely we just delay the crash. Even if we manage to
avoid crashing, we may wobble along not really handling things well, causing
retry storms and such (that just dig us in deeper). Another possibility is
we return an error to the client that gets translated into EIO or some other
error the application isn't prepared to handle.

If instead, we just aborted, the HA systems most of us run under would
restart Ganesha. The clients would see some delay, but there should be no
visible errors to the clients. Depending on how well grace period/state
recovery is implemented (and in particular how well it's integrated with
other file servers such as CIFS/SMB or across a cluster), there could be
some openings for lock violation (someone is able to steal a lock from one
of our clients while Ganesha is down).

Aborting would have several advantages. First, it would immediately clear up
any memory leaks. Second, if there was some transient activity that resulted
in high memory utilization, that might also be cleared up. Third, it would
avoid retry storms and such that might just aggravate the low memory
condition. In addition, it would force the sysadmin to deal with a workload
that overloaded the server, possibly by adding additional nodes in a
clustered environment, or adding memory to the server.

No matter what we decide to do, another thing we need to look at is more
memory throttling. Cache inode has a limit on the number of inodes. This is
helpful, but is incomplete. Other candidates for memory throttling would be:

Number of clients
Number of state (opens, locks, delegations, layouts) (per client and/or
global)
Size of ACLs and number of ACLs cached

I'm sure there's more, discuss.

Frank

---
This email has been checked for viruses by Avast antivirus software.
 <https://www.avast.com/antivirus> https://www.avast.com/antivirus

----------------------------------------------------------------------------
--
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
<mailto:Nfs-ganesha-devel@lists.sourceforge.net> 
 <https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel>
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

------------------------------------------------------------------------------

_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Re: [Nfs-ganesha-devel] Topic for discussion - Out of Memory Handling

Reply via email to