Over the Christmas holiday we experienced two nodes on our cluster that froze up with the errors below:
Dec 27 09:43:13 dqshtc14 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [pvfs2-client-co:4867] Dec 27 09:43:13 dqshtc14 kernel: Modules linked in: pvfs2(U) mpt2sas scsi_transport_sas raid_class mptctl mptbase ipmi_devintf dell_rbu autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding 8021q garp stp llc ipv6 power_meter sg shpchp bnx2x libcrc32c mdio dcdbas microcode sb_edac edac_core iTCO_wdt iTCO_vendor_support ext4 mbcache jbd2 sd_mod crc_t10dif ahci wmi megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Our cluster has been running long enough and with a load that is heavy enough that I would have thought we would have seen this already if it is a systemic problem. After some Googling and reading we found a lot of these types of errors being reported on a variety of Linux distros, none involving PVFS. However, no solutions were provided either. Has anyone in the PVFS community seen these errors before? Is this I bug in the PVFS client, in the kernel, or something else? We are running RHEL 6.4, kernel 2.6.32, OrangeFS 2.8.7. Thank you! -Roger ----------------------------------------------------------- Roger V. Moye Systems Analyst III XSEDE Campus Champion University of Texas - MD Anderson Cancer Center Division of Quantitative Sciences Pickens Academic Tower - FCT4.6109 Houston, Texas (713) 792-2134 -----------------------------------------------------------
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
