Hi Aaron, Thanks for the hint! :)
Best regards, Tomasz Wolski > -----Original Message----- > From: [email protected] [mailto:gpfsug-discuss- > [email protected]] On Behalf Of Aaron Knister > Sent: Friday, March 16, 2018 3:52 PM > To: [email protected] > Subject: Re: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread > configuration needs more threads > > Ah. You, my friend, have been struck by a smooth criminal. And by smooth > criminal I mean systemd. I ran into this last week and spent many hours > banging my head against the wall trying to figure it out. > > systemd by default limits cgroups to I think 512 tasks and since a thread > counts as a task that's likely what you're running into. > > Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and then > reboot (and yes, I mean reboot. changing it live doesn't seem possible > because of the infinite wisdom of the systemd developers). > > The pid limit of a given slice/unit cgroup may already be overriden to > something more reasonable than the 512 default so if, for example, you > were logging in and startng it via ssh the limit may be different than if its > started from the gpfs.service unit because mmfsd effectively is running in > different cgroups in each case. > > Hope that helps! > > -Aaron > > On 3/16/18 10:25 AM, [email protected] wrote: > > Hello GPFS Team, > > > > We are observing strange behavior of GPFS during startup on SLES12 node. > > > > In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a > > base and when GPFS starts for the first time on this node, it > > complains about > > > > too little NSD threads: > > > > .. > > > > 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. > > {Version: 4.2.3.7 Built: Feb 15 2018 11:38:38} ... > > > > 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... > > > > 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... > > > > .. > > > > 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... > > > > 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... > > > > 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... > > > > *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 > > more threads, exceeds max thread count 1024_* > > > > 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting > down. > > > > 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not > > initialize network shared disks > > > > 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 > > > > 2018-03-16_13:11:30.701+0100: runmmfs starting > > > > Removing old /var/adm/ras/mmfs.log.* files: > > > > 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > > before restarting mmfsd > > > > 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: > > event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup > > > > GPFS starts loop and tries to respawn mmfsd periodically: > > > > *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 > seconds > > before restarting mmfsd_* > > > > It seems that this issue can be resolved by doing mmshutdown. Later, > > when we manually perform mmstartup the problem is gone. > > > > We are running GPFS 4.2.3.7 and all nodes except VLP1 are running > > SLES11 SP4. Only on VLP1 we installed SLES12 SP3. > > > > The test cluster looks as below: > > > > Node Daemon node name IP address Admin node name Designation > > > > ---------------------------------------------------------------------- > > - > > > > 1 VLP0.cs-intern 192.168.101.210 VLP0.cs-intern > > quorum-manager-snmp_collector > > > > 2 VLP1.cs-intern 192.168.101.211 VLP1.cs-intern > > quorum-manager > > > > 3 TBP0.cs-intern 192.168.101.215 TBP0.cs-intern quorum > > > > 4 IDP0.cs-intern 192.168.101.110 IDP0.cs-intern > > > > 5 IDP1.cs-intern 192.168.101.111 IDP1.cs-intern > > > > 6 IDP2.cs-intern 192.168.101.112 IDP2.cs-intern > > > > 7 IDP3.cs-intern 192.168.101.113 IDP3.cs-intern > > > > 8 ICP0.cs-intern 192.168.101.10 ICP0.cs-intern > > > > 9 ICP1.cs-intern 192.168.101.11 ICP1.cs-intern > > > > 10 ICP2.cs-intern 192.168.101.12 ICP2.cs-intern > > > > 11 ICP3.cs-intern 192.168.101.13 ICP3.cs-intern > > > > 12 ICP4.cs-intern 192.168.101.14 ICP4.cs-intern > > > > 13 ICP5.cs-intern 192.168.101.15 ICP5.cs-intern > > > > We have enabled traces and reproduced the issue as follows: > > > > 1.When GPFS daemon was in a respawn loop, we have started traces, all > > files from this period you can find in uploaded archive under > > *_1_nsd_threads_problem_* directory > > > > 2.We have manually stopped the “respawn” loop on VLP1 by executing > > mmshutdown and start GPFS manually by mmstartup. All traces from this > > execution can be found in archive file under > *_2_mmshutdown_mmstartup > > _*directory > > > > All data related to this problem is uploaded to our ftp to file: > > > > ftp.ts.fujitsu.com/CS-Diagnose/IBM > > <ftp://ftp.ts.fujitsu.com/CS-Diagnose/IBM>, (fe_cs_oem, 12Monkeys) > > item435_nsd_threads.tar.gz > > > > Could you please have a look at this problem? > > > > Best regards, > > > > Tomasz Wolski > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
