On Do, 2015-10-29 at 08:44 +0100, Swen Schillig wrote:
> On Mi, 2015-10-28 at 21:33 -0700, Frank Filz wrote:
> > So my question is what allocations do we attempt to recover from? And
> > what does that recovery look like? And how do we make sure Ganesha is
> > actually running in a sane way if we do recover? And are we just
> > kicking the can to the next allocation which chooses not to recover?
> > It seems like if we are going to try and keep running, we should do so
> > in almost all cases, using abort only for those cases that are just
> > way too complex to recover from (for example, there is an out of
> > memory condition in unlock if lock owners aren’t supported where it
> > can become impossible to get to a correct set of locks).
> > 
> >  
> > 
> > Frank
> > 
> That sounds like a good strategy to me.
> +1
> 
> Swen
Besides, memory allocations or other operations which could possibly end
in blocking or otherwise fatal situations must be avoided 
under a lock-condition anyway.
We must try to prevent such "no-way" out situations.

Swen
> >  
> > 
> > From: Marc Eshel [mailto:es...@us.ibm.com] 
> > Sent: Wednesday, October 28, 2015 7:38 PM
> > To: Frank Filz <ffilz...@mindspring.com>
> > Cc: nfs-ganesha-devel@lists.sourceforge.net
> > Subject: Re: [Nfs-ganesha-devel] Topic for discussion - Out of Memory
> > Handling
> > 
> > 
> >  
> > 
> > I don't believe that we need to restart Ganesha on every out of memory
> > calls for many reasons, but I will agree that we can have two types or
> > calls one that can accept no memory rc and one that terminate Ganesha
> > if the call is not successful.   
> > Marc. 
> > 
> > 
> > 
> > From:        "Frank Filz" <ffilz...@mindspring.com> 
> > To:        <nfs-ganesha-devel@lists.sourceforge.net> 
> > Date:        10/28/2015 11:55 AM 
> > Subject:        [Nfs-ganesha-devel] Topic for discussion - Out of
> > Memory Handling 
> > 
> >                                    
> > ______________________________________________________________________
> > 
> > 
> > 
> > We have had various discussions over the years as to how to best
> > handle out
> > of memory conditions.
> > 
> > In the meantime, our code is littered with attempts to handle the
> > situation,
> > however, it is not clear to me these really solve anything. If we
> > don't have
> > 100% recoverability, likely we just delay the crash. Even if we manage
> > to
> > avoid crashing, we may wobble along not really handling things well,
> > causing
> > retry storms and such (that just dig us in deeper). Another
> > possibility is
> > we return an error to the client that gets translated into EIO or some
> > other
> > error the application isn't prepared to handle.
> > 
> > If instead, we just aborted, the HA systems most of us run under would
> > restart Ganesha. The clients would see some delay, but there should be
> > no
> > visible errors to the clients. Depending on how well grace
> > period/state
> > recovery is implemented (and in particular how well it's integrated
> > with
> > other file servers such as CIFS/SMB or across a cluster), there could
> > be
> > some openings for lock violation (someone is able to steal a lock from
> > one
> > of our clients while Ganesha is down).
> > 
> > Aborting would have several advantages. First, it would immediately
> > clear up
> > any memory leaks. Second, if there was some transient activity that
> > resulted
> > in high memory utilization, that might also be cleared up. Third, it
> > would
> > avoid retry storms and such that might just aggravate the low memory
> > condition. In addition, it would force the sysadmin to deal with a
> > workload
> > that overloaded the server, possibly by adding additional nodes in a
> > clustered environment, or adding memory to the server.
> > 
> > No matter what we decide to do, another thing we need to look at is
> > more
> > memory throttling. Cache inode has a limit on the number of inodes.
> > This is
> > helpful, but is incomplete. Other candidates for memory throttling
> > would be:
> > 
> > Number of clients
> > Number of state (opens, locks, delegations, layouts) (per client
> > and/or
> > global)
> > Size of ACLs and number of ACLs cached
> > 
> > I'm sure there's more, discuss.
> > 
> > Frank
> > 
> > 
> > ---
> > This email has been checked for viruses by Avast antivirus software.
> > https://www.avast.com/antivirus
> > 
> > 
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Nfs-ganesha-devel mailing list
> > Nfs-ganesha-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ______________________________________________________________________
> > Avast logo
> > This email has been checked for
> > viruses by Avast antivirus
> > software. 
> > www.avast.com 
> > 
> > 
> > 
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Nfs-ganesha-devel mailing list
> > Nfs-ganesha-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
> 
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel



------------------------------------------------------------------------------
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Reply via email to