On Fri, Nov 07, 2014 at 09:59:32AM +0100, Niels de Vos wrote: > On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote: > > On Thu, 6 Nov 2014 22:02:29 +0100 > > Niels de Vos <nde...@redhat.com> wrote: > > > > > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote: > > > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote: > > > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote: > > > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote: > > > > >>>On Mon, 3 Nov 2014 13:57:08 +0100 > > > > >>>Jakub Hrozek <jhro...@redhat.com> wrote: > > > > >>> > > > > >>>>Hi, > > > > >>>> > > > > >>>>we had short discussion on $SUBJECT with Simo on IRC already, > > > > >>>>but there are multiple people involved from multiple timezones, > > > > >>>>so I think a mailing list thread would be better trackable. > > > > >>>> > > > > >>>>Can we add another memory cache file to SSSD, that would track > > > > >>>>initgroups/getgrouplist results for the NSS responder? I realize > > > > >>>>initgroups is a bit different operation than getpw{uid,nam} and > > > > >>>>getgr{gid,nam} but what if the new memcache was only used by > > > > >>>>the NSS responder and at the same time invalidated when > > > > >>>>initgroups is initiated by the PAM responder to ensure the > > > > >>>>memcache is up-to-date? > > > > >>> > > > > >>>Can you describe the use case before jumping into a proposed > > > > >>>solution ? > > > > >> > > > > >>Many getgrouplist() or initgroups() calls in a quick succession. > > > > >>One user is GlusterFS -- I'm not quite sure what the reason is > > > > >>there, maybe Vijay can elaborate. > > > > > > > > > > > > > GlusterFS server invokes getgrouplist() to identify gids associated > > > > with an user on whose behalf a rpc request has been sent over the > > > > wire. There is a gid caching layer in GlusterFS and getgrouplist() > > > > does get called only if there is a gid cache miss. In the worst > > > > case, getgrouplist() can be invoked for every rpc request that > > > > GlusterFS receives and that seems to be the case in a deployment > > > > where we found that sssd was being busy. I am not certain about the > > > > sequence of operations that can cause the cache to be missed. > > > > > > > > Adding Niels who is more familiar with the gid resolution & caching > > > > features in GlusterFS. > > > > > > Just to add some background information on the getgrouplist(). > > > GlusterFS uses several processes that can call getgrouplist(): > > > - NFS-server, a single process per system > > > - brick, a process per exported filesystem/directory, potentally > > > several per system > > > > > > [Here, a Gluster environment has many systems (vm/physical). Each > > > system normally runs the NFS-server, and a number of brick > > > processes. The layout of the volume is important, but it is very > > > common to have one or more distributed volumes that use multiple > > > bricks on the same system (and many other systems).] > > > > > > The need for resolving the groups of a user comes in when users belong > > > to many groups. The RPC protocols can not carry a huge list of groups, > > > so the resolving can be done on the server side when the protocol hits > > > its limits (> 16 for NFS, approx. > 93 for GlusterFS). > > > > > > Upon using a Gluster volume, certain operations are sent to all the > > > bricks (i.e. some directory related operations). I can imagine that > > > a network share which is used by many users, trigger many > > > getgrouplist() calls in different brick processes at the (almost) > > > same time. > > > > > > For reference, the usage of getgrouplist() in the brick process can be > > > found here: > > > - > > > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24 > > > > > > The gid_resolve() function get called in case the brick process should > > > resolve the groups (and ignore the list of groups from the protocol). > > > It uses the gidcache functions from a private library: > > > - > > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h > > > - > > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c > > > > > > The default time for the gidcache to expire is 2 seconds. Users should > > > be able to configure this to 30 seconds (or anything else) with: > > > > > > # gluster volume set <VOLUME> server.gid-timeout 30 > > > > > > > > > I think this should explain the use-case sufficiently, but let me know > > > if there are any remaining questions. It might well be possible to > > > make this code more sssd friendly. I'm sure that we as Gluster > > > developers are open to any suggestions. > > > > > > TBH this looks a little bit strange, other filesystems (as well as the > > kernel) create a credentials token when a user first authenticate and > > keep these credentials attached to the user session for the duration. > > Why does GlusterFS keeps hammering the system requesting the same > > information again and again ? > > The GlusterFS protocol itself is very much stateless, similar to NFSv3. > We need all the groups of the user on the server-side (brick) to allow > the backing filesystem (mostly XFS) perform the permission checking. In > the current GlusterFS protocol, there is no user authentication. (Well, > there has been work done on adding support for SSL, maybe that could be > used for tracking sessions on a per-client, not user, basis.) > > Just for clarity, a GlusterFS client (like a fuse-mount, or the > samba/vfs_glusterfs module) is used by many different users. The client > builds the connection to the volume. After that, all users with access > to the fuse-mount or samba-share are using the same client connection. > > By default the client sends a list of groups in each RPC request, and > the server-side trusts the list the client provides. However, for > environments where these lists are too small to hold all the groups, > there is an option to do the group resolving on the server side. This is > the "server.manage-gids" volume option, which acts very much like the > "rpc.mountd --manage-gids" functionality for NFS. > > > Keep in mind that the use of getgrouplist() is an inherently costly > > operation. Even adding caches, the system cannot cache for long because > > it needs to return updated results eventually. Only the application > > know when a user session terminates and/or the list needs to be > > refreshed, so "caching" for this type of operation should be done > > mostly on the application side. > > I assume that your "application side" here is the brick process that > runs on the same system as sssd. As mentioned above, the brick processes > do cache the result of getgrouplist(). It may well be possible that the > default expiry of 2 seconds is too short for many environments. But > users can change that timeout easily with the "server.gid-timeout" > volume option.
I guess that might be a viable option to work around the problem for the user who initially reported it, but it also doesn't align with what I saw in the logs..the sssd_nss logs showed 4000 initgroup requests over two minutes from maybe about 10 users.. > > From my understanding of this thread, we (the Gluster Community) have > two things to do: > > 1. Clearly document side-effects that can be caused by enabling the > "server.manage-gids" option, and suggest increasing the > "server.gid-timeout" value (maybe change the default?). > > 2. Think about improving the GlusterFS protocol(s) and introduce some > kind of credentials token that is linked with the groups of a user. > Token expiry should invalidate the group-cache. One option would be > to use Kerberos like NFS (RPCSEC_GSS). > > > Does this all make sense to others too? I'm adding gluster-devel@ to CC > so that others can chime in and this topic won't be forgotton. > > Thanks, > Niels And on the SSSD side, we need to think about an initgroups cache. So far I filed ticket https://fedorahosted.org/sssd/ticket/2485 listing the two options Simo outlined earlier. GlusterFS is not the only project that requested faster initgroups caching, Alexander's slapi-nis would also benefit from the new cache (Although with slapi-nis we also have a bit conflicting RFE to stop using NSS interfaces and go to SSSD directly, but that's something for us to solve..) _______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel