On 24/06/2025 19:00, Truong Vu wrote:

There is an undocumented option for this purpose. You can issue mmdelnode -f on the node bad node. This cleans up leftover configuration and stop/start services if needed.

Thanks, one to remember then. Though to be honest having used GPFS now
for 18+ years that is the first time I have needed it.

Specifically dual EPYC 9555
If tsgskkm is hung, you may hit a known gskit issue. Can you
manually apply the workaround and see if it works?

Insert the following lines to file /usr/lpp/mmfs/lib/gsk8/C/icc/icclib/ICCSIG.txt
>
ICC_SHIFT=3 ICC_TRNG=TRNG_ALT4

Insert the following lines to file /usr/lpp/mmfs/lib/gsk8/N/icc/icclib/ICCSIG.txt
>
ICC_TRNG=TRNG_ALT4


That did the trick.

I quick Google shows this being an issue back in 2020 (on this list) with GPFS 4.2 on AMD Epyc. And also this APAR from 2023

https://www.ibm.com/support/pages/apar/IJ43790

The suggest fix is a little different too.

However I already have some AMD EPYC 7513 based servers on the system running 5.1.9-6 (to be upgraded real soon now to 5.2.2-1) which are according to lscpu CPU family 25. I have no recollection of doing anything special and I don't notice the fix in the files.

Can you post lscpu output?


See below, my educated guess is that this is CPU family 26 and whatever fix IBM introduced for CPU a family 25 doesn't work on Zen 5 CPU's.


JAB.


Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   256
  On-line CPU(s) list:    0-255
Vendor ID:                AuthenticAMD
  BIOS Vendor ID:         AMD
  Model name:             AMD EPYC 9555 64-Core Processor
    BIOS Model name:      AMD EPYC 9555 64-Core Processor
    CPU family:           26
    Model:                2
    Thread(s) per core:   2
    Core(s) per socket:   64
    Socket(s):            2
    Stepping:             1
    Frequency boost:      enabled
    CPU(s) scaling MHz:   72%
    CPU max MHz:          4409.3750
    CPU min MHz:          1500.0000
    BogoMIPS:             6390.74
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go od amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdran d lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid cqm rdt_a avx512f avx512dq rdseed a dx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnni avx512 _bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold a vic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bu s_lock_detect movdiri movdir64b overflow_recov succor smca avx512_vp2intersect flush_l1d debug_swap amd_lbr_pmc_freeze
Virtualization features:
  Virtualization:         AMD-V
Caches (sum of all):
  L1d:                    6 MiB (128 instances)
  L1i:                    4 MiB (128 instances)
  L2:                     128 MiB (128 instances)
  L3:                     512 MiB (16 instances)
NUMA:
  NUMA node(s):           2
  NUMA node0 CPU(s):      0-63,128-191
  NUMA node1 CPU(s):      64-127,192-255
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected


--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Reply via email to