On Do, 2015-10-29 at 08:44 +0100, Swen Schillig wrote: > On Mi, 2015-10-28 at 21:33 -0700, Frank Filz wrote: > > So my question is what allocations do we attempt to recover from? And > > what does that recovery look like? And how do we make sure Ganesha is > > actually running in a sane way if we do recover? And are we just > > kicking the can to the next allocation which chooses not to recover? > > It seems like if we are going to try and keep running, we should do so > > in almost all cases, using abort only for those cases that are just > > way too complex to recover from (for example, there is an out of > > memory condition in unlock if lock owners aren’t supported where it > > can become impossible to get to a correct set of locks). > > > > > > > > Frank > > > That sounds like a good strategy to me. > +1 > > Swen Besides, memory allocations or other operations which could possibly end in blocking or otherwise fatal situations must be avoided under a lock-condition anyway. We must try to prevent such "no-way" out situations.
Swen > > > > > > From: Marc Eshel [mailto:es...@us.ibm.com] > > Sent: Wednesday, October 28, 2015 7:38 PM > > To: Frank Filz <ffilz...@mindspring.com> > > Cc: nfs-ganesha-devel@lists.sourceforge.net > > Subject: Re: [Nfs-ganesha-devel] Topic for discussion - Out of Memory > > Handling > > > > > > > > > > I don't believe that we need to restart Ganesha on every out of memory > > calls for many reasons, but I will agree that we can have two types or > > calls one that can accept no memory rc and one that terminate Ganesha > > if the call is not successful. > > Marc. > > > > > > > > From: "Frank Filz" <ffilz...@mindspring.com> > > To: <nfs-ganesha-devel@lists.sourceforge.net> > > Date: 10/28/2015 11:55 AM > > Subject: [Nfs-ganesha-devel] Topic for discussion - Out of > > Memory Handling > > > > > > ______________________________________________________________________ > > > > > > > > We have had various discussions over the years as to how to best > > handle out > > of memory conditions. > > > > In the meantime, our code is littered with attempts to handle the > > situation, > > however, it is not clear to me these really solve anything. If we > > don't have > > 100% recoverability, likely we just delay the crash. Even if we manage > > to > > avoid crashing, we may wobble along not really handling things well, > > causing > > retry storms and such (that just dig us in deeper). Another > > possibility is > > we return an error to the client that gets translated into EIO or some > > other > > error the application isn't prepared to handle. > > > > If instead, we just aborted, the HA systems most of us run under would > > restart Ganesha. The clients would see some delay, but there should be > > no > > visible errors to the clients. Depending on how well grace > > period/state > > recovery is implemented (and in particular how well it's integrated > > with > > other file servers such as CIFS/SMB or across a cluster), there could > > be > > some openings for lock violation (someone is able to steal a lock from > > one > > of our clients while Ganesha is down). > > > > Aborting would have several advantages. First, it would immediately > > clear up > > any memory leaks. Second, if there was some transient activity that > > resulted > > in high memory utilization, that might also be cleared up. Third, it > > would > > avoid retry storms and such that might just aggravate the low memory > > condition. In addition, it would force the sysadmin to deal with a > > workload > > that overloaded the server, possibly by adding additional nodes in a > > clustered environment, or adding memory to the server. > > > > No matter what we decide to do, another thing we need to look at is > > more > > memory throttling. Cache inode has a limit on the number of inodes. > > This is > > helpful, but is incomplete. Other candidates for memory throttling > > would be: > > > > Number of clients > > Number of state (opens, locks, delegations, layouts) (per client > > and/or > > global) > > Size of ACLs and number of ACLs cached > > > > I'm sure there's more, discuss. > > > > Frank > > > > > > --- > > This email has been checked for viruses by Avast antivirus software. > > https://www.avast.com/antivirus > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Nfs-ganesha-devel mailing list > > Nfs-ganesha-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel > > > > > > > > > > > > > > > > > > ______________________________________________________________________ > > Avast logo > > This email has been checked for > > viruses by Avast antivirus > > software. > > www.avast.com > > > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Nfs-ganesha-devel mailing list > > Nfs-ganesha-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Nfs-ganesha-devel mailing list > Nfs-ganesha-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel ------------------------------------------------------------------------------ _______________________________________________ Nfs-ganesha-devel mailing list Nfs-ganesha-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel