Re: [gpfsug-discuss] Filesystem access issues via CES NFS

Talamo Ivano Giuseppe (PSI) Wed, 23 Oct 2019 02:50:06 -0700

Dear all,

We are actually in the process of upgrading our CES cluster to 5.0.3-3 but we 
have doubts about how to proceed.
Considering that the CES cluster is in production and heavily used, our plan is 
to add a new node with 5.0.3-3 to the cluster that is currently 5.0.2.1.


And we would like to proceed in a cautious way, so that the new node would not 
take any IP and just one day per week (when we will declare to be “at risk”) we 
would move some IPs to it. After some weeks of tests if we would see no problem 
we would upgrade the rest of the cluster.

But reading these doc [1] it seems that we cannot have multiple GPFS/SMB 
version in the same cluster. So in that case we could not have a 
testing/acceptance phase but could only make the full blind jump. Can someone 
confirm or negate this?

Thanks,
Ivano

[1] 
https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1ins_updatingsmb.htm

On 04.10.19, 12:55, "[email protected] on behalf of 
Malahal R Naineni" <[email protected] on behalf of 
[email protected]> wrote:

    You can use 5.0.3.3 . There is no fix for the sssd issue yet though. I will 
work with Ganesha upstream community pretty soon.
     
    Regards, Malahal.
    
    ----- Original message -----
    From: Leonardo Sala <[email protected]>
    To: gpfsug main discussion list <[email protected]>, 
"Malahal R Naineni" <[email protected]>, <[email protected]>
    Cc:
    Subject: [EXTERNAL] Re: [gpfsug-discuss] Filesystem access issues via CES 
NFS
    Date: Fri, Oct 4, 2019 12:02 PM
     
    Dear Malahal,
    thanks for the answer. Concerning SSSD, we are also using it, should we use 
5.0.2-PTF3? We would like to avoid using 5.0.2.2, as it has issues with recent 
RHEL 7.6 kernels [*] and we are impacted: do you suggest to use 5.0.3.3?
    cheers
    leo
     
    [*] 
https://www.ibm.com/support/pages/ibm-spectrum-scale-gpfs-releases-42313-or-later-and-5022-or-later-have-issues-where-kernel-crashes-rhel76-0
    Paul Scherrer Institut
    Dr. Leonardo Sala
    Group Leader High Performance Computing
    Deputy Section Head Science IT
    Science IT
    WHGA/106
    5232 Villigen PSI
    Switzerland
    
    Phone: +41 56 310 3369
    [email protected]
    www.psi.ch <http://www.psi.ch>
    On 03.10.19 19:15, Malahal R Naineni wrote:
    >> @Malahal: Looks like you have written the netgroup caching code, feel 
free to ask for further details if required.
     
    Hi Ulrich, Ganesha uses innetgr() call for netgroup information and sssd 
has too many issues in its implementation. Redhat said that they are going to 
fix sssd synchronization issues in RHEL8. It is in my plate to serialize 
innergr() call in Ganesha to
     match kernel NFS server usage! I expect the sssd issue to give 
EACCESS/EPERM kind of issue but not EINVAL though.
     
    If you are using sssd, you must be getting into a sssd issue. Ganesha has a 
host-ip cache fix in 5.0.2 PTF3. Please make sure you use ganesha version 
V2.5.3-ibm030.01 if you are using netgroups (shipped with 5.0.2 PTF3 but can be 
used with Scale 5.0.1
     or later)
     
    Regards, Malahal.
    
     
    
    ----- Original message -----
    From: Ulrich Sibiller 
    <[email protected]> <mailto:[email protected]>
    Sent by: 
    [email protected] 
<mailto:[email protected]>
    To: [email protected]
    Cc:
    Subject: Re: [gpfsug-discuss] Filesystem access issues via CES NFS
    Date: Thu, Dec 13, 2018 7:32 PM
     
    On 23.11.2018 14:41, Andreas Mattsson wrote:
    > Yes, this is repeating.
    >
    > We’ve ascertained that it has nothing to do at all with file operations 
on the GPFS side.
    >
    > Randomly throughout the filesystem mounted via NFS, ls or file access 
will give
    >
    > ”
    >
    >  > ls: reading directory /gpfs/filessystem/test/testdir: Invalid argument
    >
    > “
    >
    > Trying again later might work on that folder, but might fail somewhere 
else.
    >
    > We have tried exporting the same filesystem via a standard kernel NFS 
instead of the CES
    > Ganesha-NFS, and then the problem doesn’t exist.
    >
    > So it is definitely related to the Ganesha NFS server, or its interaction 
with the file system.
    >  > Will see if I can get a tcpdump of the issue.
    
    We see this, too. We cannot trigger it. Fortunately I have managed to 
capture some logs with
    debugging enabled. I have now dug into the ganesha 2.5.3 code and I think 
the netgroup caching is
    the culprit.
    
    Here some FULL_DEBUG output:
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250]
    export_check_access :EXPORT :M_DBG :Check for address 1.2.3.4 for export id 
1 path /gpfsexport
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250] client_match
    :EXPORT :M_DBG :Match V4: 0xcf7fe0 NETGROUP_CLIENT: netgroup1 
(options=421021e2root_squash   , RWrw,
    3--, ---, TCP, ----, Manage_Gids   , -- Deleg, anon_uid=    -2, anon_gid=   
 -2, sys)
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250] nfs_ip_name_get
    :DISP :F_DBG :Cache get hit for 1.2.3.4->client1.domain
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250] client_match
    :EXPORT :M_DBG :Match V4: 0xcfe320 NETGROUP_CLIENT: netgroup2 
(options=421021e2root_squash   , RWrw,
    3--, ---, TCP, ----, Manage_Gids   , -- Deleg, anon_uid=    -2, anon_gid=   
 -2, sys)
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250] nfs_ip_name_get
    :DISP :F_DBG :Cache get hit for 1.2.3.4->client1.domain
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250] client_match
    :EXPORT :M_DBG :Match V4: 0xcfe380 NETGROUP_CLIENT: netgroup3 
(options=421021e2root_squash   , RWrw,
    3--, ---, TCP, ----, Manage_Gids   , -- Deleg, anon_uid=    -2, anon_gid=   
 -2, sys)
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250] nfs_ip_name_get
    :DISP :F_DBG :Cache get hit for 1.2.3.4->client1.domain
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250]
    export_check_access :EXPORT :M_DBG :EXPORT          (options=03303002       
       ,     ,    ,
          ,               , -- Deleg,                ,                )
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250]
    export_check_access :EXPORT :M_DBG :EXPORT_DEFAULTS 
(options=42102002root_squash   , ----, 3--, ---,
    TCP, ----, Manage_Gids   ,         , anon_uid=    -2, anon_gid=    -2, sys)
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250]
    export_check_access :EXPORT :M_DBG :default options 
(options=03303002root_squash   , ----, 34-, UDP,
    TCP, ----, No Manage_Gids, -- Deleg, anon_uid=    -2, anon_gid=    -2, 
none, sys)
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250]
    export_check_access :EXPORT :M_DBG :Final options   
(options=42102002root_squash   , ----, 3--, ---,
    TCP, ----, Manage_Gids   , -- Deleg, anon_uid=    -2, anon_gid=    -2, sys)
    2018-12-13 11:53:41 : epoch 0009008d : server1 : 
gpfs.ganesha.nfsd-258762[work-250] nfs_rpc_execute
    :DISP :INFO :DISP: INFO: Client ::ffff:1.2.3.4 is not allowed to access 
Export_Id 1 /gpfsexport,
    vers=3, proc=18
    
    The client "client1" is definitely a member of the "netgroup1". But the 
NETGROUP_CLIENT lookups for
    "netgroup2" and "netgroup3" can only happen if the netgroup caching code 
reports that "client1" is
    NOT a member of "netgroup1".
    
    I have also opened a support case at IBM for this.
    
    @Malahal: Looks like you have written the netgroup caching code, feel free 
to ask for further
    details if required.
    
    Kind regards,
    
    Ulrich Sibiller
    
    --
    Dipl.-Inf. Ulrich Sibiller           science + computing ag
    System Administration                    Hagellocher Weg 73
                                         72070 Tuebingen, Germany
                               
    https://atos.net/de/deutschland/sc <https://atos.net/de/deutschland/sc>
    --
    Science + Computing AG
    Vorstandsvorsitzender/Chairman of the board of management:
    Dr. Martin Matzke
    Vorstand/Board of Management:
    Matthias Schempp, Sabine Hohenstein
    Vorsitzender des Aufsichtsrats/
    Chairman of the Supervisory Board:
    Philippe Miltin
    Aufsichtsrat/Supervisory Board:
    Martin Wibbe, Ursula Morgenstern
    Sitz/Registered Office: Tuebingen
    Registergericht/Registration Court: Stuttgart
    Registernummer/Commercial Register No.: HRB 382196
    _______________________________________________
    gpfsug-discuss mailing list
    gpfsug-discuss at spectrumscale.org
    http://gpfsug.org/mailman/listinfo/gpfsug-discuss
     
    
    
     
    
       _______________________________________________
    gpfsug-discuss mailing list
    gpfsug-discuss at spectrumscale.org
    http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    
    
    
    
     
    
    
    

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Filesystem access issues via CES NFS

Reply via email to