Hello, We've got a problem here we hope someone can help us with. We've have a few 1.8.5 OSS nodes which seems to get locked up Lustre-wise on our tcp clients from time to time. This is a recent phenomena - we are not sure, but we think it may be related to a particular workload. Our o2ib clients don't seem to have any trouble.
'lfs df' shows "Resource temporarily unavailable" for all OSTs on the affected OSS on all tcp clients when this happens. When we look on the OSS itself we see secoknal_sd and ll_ost_io processes consuming cycles: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 10954 root 16 0 0 0 0 R 66.6 0.0 515:48.58 socknal_sd02 > 11370 root 16 0 0 0 0 R 64.9 0.0 241:21.83 ll_ost_io_91 > 10959 root 19 0 0 0 0 R 49.7 0.0 111:53.27 socknal_sd07 There are plenty of cycles free on each core of the OSS, though. We do see that plenty of lustre logs were dumped, as well after service threads were inactive for 20 minutes. I haven't been able to learn much from 'lctl debug_file' yet. Further, we can see from 'netstat -t' that the Recv-Q count is increasing on the client connections - never decreasing. Send-Q count is zero for all but two clients, where seem to be a constant non-zero value (few-several hundred K). Anyway, it seems like the socknal and/or ll_ost_io_91 processes above are just stuck doing nothing productive. Syslog messages aren't telling me why. Has anyone seen anything like this? We know that after rebooting the OSS our tcp clients will start working again. Thanks, Craig Prescott UF HPC Center _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
