After some puzzling debugging on our new Broadwell servers, all of which slowly 
became brick-like upon after getting stuck starting GPFS, we discovered that 
this was already a known issue in the FAQ.  Adding "nosmap" to the kernel 
command line in grub prevents SMAP from seeing the kernel-userspace memory 
interactions of GPFS as a reason to slowly grind all cores to a standstill, 
apparently spinning on stuck locks(?).  (Big thanks go to RedHat for turning us 
on to the answer when we opened a case.)

>From https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html, 
>section 3.2:

   Note:  In order for IBM Spectrum Scale on RHEL 7 to run on the Haswell 
processor
*       Disable the Supervisor Mode Access Prevention (smap) kernel parameter
*       Reboot the RHEL 7 node before using GPFS


Some observations worth noting:

1.      We've been running for a year with Haswell processors and have hundreds 
of Haswell RHEL7 nodes which do not exhibit this problem.  So maybe this only 
really affects Broadwell CPUs?

2.      It would be very nice for SpectrumScale to take a peek at /proc/cpuinfo 
and /proc/cmdline before starting up, and refuse to break the host when it has 
affected processors and kernel without "nosmap".  Instead, an error message 
describing the fix would have made my day.

3.      I'm going to have to start using a script to diff the FAQ for these 
gotchas, unless anyone knows of a better way to subscribe just to updates to 
this doc.

Thanks,
Paul Sanchez

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to