Re: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads

Aaron Knister Fri, 16 Mar 2018 07:52:41 -0700

Ah. You, my friend, have been struck by a smooth criminal. And by smoothcriminal I mean systemd. I ran into this last week and spent many hoursbanging my head against the wall trying to figure it out.

systemd by default limits cgroups to I think 512 tasks and since athread counts as a task that's likely what you're running into.

Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf andthen reboot (and yes, I mean reboot. changing it live doesn't seempossible because of the infinite wisdom of the systemd developers).

The pid limit of a given slice/unit cgroup may already be overriden tosomething more reasonable than the 512 default so if, for example, youwere logging in and startng it via ssh the limit may be different thanif its started from the gpfs.service unit because mmfsd effectively isrunning in different cgroups in each case.


Hope that helps!

-Aaron

On 3/16/18 10:25 AM, [email protected] wrote:

Hello GPFS Team,

We are observing strange behavior of GPFS during startup on SLES12 node.
In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a baseand when GPFS starts for the first time on this node, it complains about
too little NSD threads:

..
2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing.{Version: 4.2.3.7 Built: Feb 15 2018 11:38:38} ...
2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ...

2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ...

..

2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ...

2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ...

2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ...
*_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413more threads, exceeds max thread count 1024_*
2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting down.
2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could notinitialize network shared disks
2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11

2018-03-16_13:11:30.701+0100: runmmfs starting

Removing old /var/adm/ras/mmfs.log.* files:
2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 secondsbefore restarting mmfsd
2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup:event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup
GPFS starts loop and tries to respawn mmfsd periodically:
*_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 secondsbefore restarting mmfsd_*
It seems that this issue can be resolved by doing mmshutdown. Later,when we manually perform mmstartup the problem is gone.
We are running GPFS 4.2.3.7 and all nodes except VLP1 are running SLES11SP4. Only on VLP1 we installed SLES12 SP3.
The test cluster looks as below:

Node  Daemon node name  IP address       Admin node name  Designation

-----------------------------------------------------------------------
1 VLP0.cs-intern 192.168.101.210 VLP0.cs-internquorum-manager-snmp_collector
    2   VLP1.cs-intern    192.168.101.211  VLP1.cs-intern   quorum-manager

    3   TBP0.cs-intern    192.168.101.215  TBP0.cs-intern   quorum

    4   IDP0.cs-intern    192.168.101.110  IDP0.cs-intern

    5   IDP1.cs-intern    192.168.101.111  IDP1.cs-intern

    6   IDP2.cs-intern    192.168.101.112  IDP2.cs-intern

    7   IDP3.cs-intern    192.168.101.113  IDP3.cs-intern

    8   ICP0.cs-intern    192.168.101.10   ICP0.cs-intern

    9   ICP1.cs-intern    192.168.101.11   ICP1.cs-intern

   10   ICP2.cs-intern    192.168.101.12   ICP2.cs-intern

   11   ICP3.cs-intern    192.168.101.13   ICP3.cs-intern

   12   ICP4.cs-intern    192.168.101.14   ICP4.cs-intern

   13   ICP5.cs-intern    192.168.101.15   ICP5.cs-intern

We have enabled traces and reproduced the issue as follows:
1.When GPFS daemon was in a respawn loop, we have started traces, allfiles from this period you can find in uploaded archive under*_1_nsd_threads_problem_* directory
2.We have manually stopped the “respawn” loop on VLP1 by executingmmshutdown and start GPFS manually by mmstartup. All traces from thisexecution can be found in archive file under *_2_mmshutdown_mmstartup_*directory
All data related to this problem is uploaded to our ftp to file:
ftp.ts.fujitsu.com/CS-Diagnose/IBM<ftp://ftp.ts.fujitsu.com/CS-Diagnose/IBM>, (fe_cs_oem, 12Monkeys)item435_nsd_threads.tar.gz
Could you please have a look at this problem?

Best regards,

Tomasz Wolski



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads

Reply via email to