Re: [Nfs-ganesha-devel] Topic for discussion - Out of Memory Handling

Frank Filz Mon, 02 Nov 2015 12:46:25 -0800

Well, it’s not necessarily actually a vote… I just want to make sure folks 
articulate their points.




Node reboot is definitely outside Ganesha’s scope. I think that is left up to 
the server admin (for non-clustered servers) and the clustering solution for 
clustered servers.



Agree there are many reasons for out of memory. We definitely need to do some 
things to work on finding any leaks we might (and probably do) have. We also 
need to find reasonable ways to limit unbounded table growth, with metrics and 
means to tune so people can figure out what their workload actually requires 
and tune appropriately.



Frank



From: Marc Eshel [mailto:es...@us.ibm.com]
Sent: Monday, November 2, 2015 11:56 AM
To: Frank Filz <ffilz...@mindspring.com>
Cc: nfs-ganesha-devel@lists.sourceforge.net
Subject: RE: [Nfs-ganesha-devel] Topic for discussion - Out of Memory Handling



Yes, it looks like I am outvoted, memory management is complicated. Let me 
first say that under no condition we should reboot the node any action should 
be limited to the Ganesha process. When we fail to get heap memory than yes 
kill the process, it would be nice at that point to get as much information as 
possible to debug the problem, it can be a leak or memory corruption, so we 
might need some memory in reserve to collect the information. We should manage 
Ganesha cache in a way that will not cause it to run out of memory so if we are 
getting memory to extend a cache we should not abort before try to reduce the 
cache size.
Marc.



From:        "Frank Filz" <ffilz...@mindspring.com 
<mailto:ffilz...@mindspring.com> >
To:        Marc Eshel/Almaden/IBM@IBMUS
Cc:        <nfs-ganesha-devel@lists.sourceforge.net 
<mailto:nfs-ganesha-devel@lists.sourceforge.net> >
Date:        11/02/2015 11:24 AM
Subject:        RE: [Nfs-ganesha-devel] Topic for discussion - Out of Memory 
Handling

  _____




There seems to be overwhelming support for log and abort on out of memory, but 
before I just say “you’re outvoted”, I’d like to understand which ENOMEM 
situations you feel are worth trying to recover from rather than abort. I’m 
especially interested in what you think might be going on in the system that 
will raise an ENOMEM, but that we will quickly recover to a point where we stop 
getting ENOMEM (because if we handle the error, but we just continue to get 
ENOMEM for a long period of time, nothing will be accomplished).

In the meantime, I’d rather look at where we can productively throttle memory 
usage so we never actually get ENOMEM in the first place.

Frank

From: Marc Eshel [ <mailto:es...@us.ibm.com> mailto:es...@us.ibm.com]
Sent: Wednesday, October 28, 2015 7:38 PM
To: Frank Filz <ffilz...@mindspring.com <mailto:ffilz...@mindspring.com> >
Cc: nfs-ganesha-devel@lists.sourceforge.net 
<mailto:nfs-ganesha-devel@lists.sourceforge.net>
Subject: Re: [Nfs-ganesha-devel] Topic for discussion - Out of Memory Handling

I don't believe that we need to restart Ganesha on every out of memory calls 
for many reasons, but I will agree that we can have two types or calls one that 
can accept no memory rc and one that terminate Ganesha if the call is not 
successful.
Marc.



From:        "Frank Filz" < <mailto:ffilz...@mindspring.com> 
ffilz...@mindspring.com>
To:        < <mailto:nfs-ganesha-devel@lists.sourceforge.net> 
nfs-ganesha-devel@lists.sourceforge.net>
Date:        10/28/2015 11:55 AM
Subject:        [Nfs-ganesha-devel] Topic for discussion - Out of Memory 
Handling

  _____





We have had various discussions over the years as to how to best handle out
of memory conditions.

In the meantime, our code is littered with attempts to handle the situation,
however, it is not clear to me these really solve anything. If we don't have
100% recoverability, likely we just delay the crash. Even if we manage to
avoid crashing, we may wobble along not really handling things well, causing
retry storms and such (that just dig us in deeper). Another possibility is
we return an error to the client that gets translated into EIO or some other
error the application isn't prepared to handle.

If instead, we just aborted, the HA systems most of us run under would
restart Ganesha. The clients would see some delay, but there should be no
visible errors to the clients. Depending on how well grace period/state
recovery is implemented (and in particular how well it's integrated with
other file servers such as CIFS/SMB or across a cluster), there could be
some openings for lock violation (someone is able to steal a lock from one
of our clients while Ganesha is down).

Aborting would have several advantages. First, it would immediately clear up
any memory leaks. Second, if there was some transient activity that resulted
in high memory utilization, that might also be cleared up. Third, it would
avoid retry storms and such that might just aggravate the low memory
condition. In addition, it would force the sysadmin to deal with a workload
that overloaded the server, possibly by adding additional nodes in a
clustered environment, or adding memory to the server.

No matter what we decide to do, another thing we need to look at is more
memory throttling. Cache inode has a limit on the number of inodes. This is
helpful, but is incomplete. Other candidates for memory throttling would be:

Number of clients
Number of state (opens, locks, delegations, layouts) (per client and/or
global)
Size of ACLs and number of ACLs cached

I'm sure there's more, discuss.

Frank


---
This email has been checked for viruses by Avast antivirus software.
 <https://www.avast.com/antivirus> https://www.avast.com/antivirus


------------------------------------------------------------------------------
_______________________________________________
Nfs-ganesha-devel mailing list
 <mailto:Nfs-ganesha-devel@lists.sourceforge.net> 
Nfs-ganesha-devel@lists.sourceforge.net
 <https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel> 
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel



  _____


This email has been checked for viruses by Avast antivirus software.
 <https://www.avast.com/antivirus> www.avast.com





---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

------------------------------------------------------------------------------

_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Re: [Nfs-ganesha-devel] Topic for discussion - Out of Memory Handling

Reply via email to