Hi all, in our 1.6.7.2 - Debian- Kernel 2.6.22 Cluster, 2 Servers with 2 and 3 OSTs have become somewhat blocking in the sense that commands like "lfs df" will have to wait for ca. 30s when reaching these OSTs in the list. Some of our clients do not have this problem, some have these contact(?) problems with the one server, some with the other, and it is time dependent. I have run "lfs df" without problem five times, only on the sixth run it would halt.
What really distinguishes these lame OSS machines from all others is that each has one ll_ost_123 thread that takes up one cpu core entirely. Since our servers have 8 Cores, 8GB RAM, each, I didn't think this would actually impede Lustre operations. Btw, I have "options ost oss_num_threads=256" in the modprobe-conf on these servers. There is no entry in the clients logs connected with this behavior. One of the said OSS has had 2 of its 3 OSTs attached somewhat later than the first one. Hence, the younger 2 appear later in a listing of OSTs as you would get out of "lfs df". None of the clients would stop for these OSTs. I conclude that I am not dealing with network problems. Now for the OSS-logs, there are indeed 'new' error messages. Nov 3 18:49:58 OSS kernel: Lustre: 13086:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_R EQs for peer client...@tcp detected: 4; the oldest (ffff81010fc15000) timed out 0 secs ago Nov 3 18:55:32 OSS kernel: LustreError: 13323:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 r...@ffff8102005cbc00 x155576/t0 o106->@NET_0x200000a0c4487_UUID:15/16 lens 232/296 e 0 to 1 dl 1257270939 ref 2 fl Rpc:/2/0 rc 0/0 Nov 3 18:55:32 OSS kernel: LustreError: 13323:0:(events.c:66:request_out_callback()) Skipped 68485395 pr evious similar messages Status -5 means Linux error code -5 = I/O error ? Silent disk corruption? Of course I don't have any other indications of hard disk failure. There was a power outage, however. Only it was already one week ago, and we did not see this behavior before today. Is there anything I can do to get rid of these annoying ll_ost threads in the running system? Of course I'm not sure they are the root of the problem... Regards, Thomas _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
