Hi all, I've encountered a LustreError that might have triggered an unwanted failover of a MGS/MGD -HA-pair of servers. I'm not sure about the latter, but at least I have not found a trace of that error via Google, so it might be worth considering. And it occurred in this form only the two times the heartbeat monitoring failed shortly afterwards:
kern.log.1:Jul 20 06:47:19 kernel: LustreError: 27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 kern.log.1:Jul 20 06:47:41 kernel: LustreError: 27713:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 There was no Lustre log activity the day before that, the last entry before being an eviction of a client at Jul 25 19:31:09 The system is running Lustre 1.6.3, kernel 2.6.22, Debian Etch. There are some more 'acquire timeout ' messages dating from Jul 24+25, however not for 'key 0' but for key 4209, 4409, ..., whatever this may mean. No "fatal" consequences then. On Jul 27, the same happened again, kern.log:Jul 27 06:47:17 kernel: LustreError: 24327:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 kern.log:Jul 27 06:47:37 kernel: LustreError: 22381:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 This time it took heartbeat only second to loose its IP: lrmd[10373]: 2008/07/27_06:47:31 WARN: IPaddr:monitor process (PID 26903) timed out (try 1). Killing with signal SIGT ERM (15). On another system running Lustre 1.6.5, without any heartbeat errors, it was: kern.log:Jul 27 06:47:20 kernel: LustreError: 4627:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout exceeded for key 0 kern.log:Jul 27 06:47:37 kernel: LustreError: 3581:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout exceeded for key 0 Of course these temporal coincidences look verrrrry suspicious. So far, I have no idea what kind of weird script might be running at these times causing all the trouble, still I'm already looking forward to next Sunday ;-) But it would be nice if somebody could explain these Lustre errors, and perhaps assure me that these Lustre errors are entirely harmless or cannot possibly have any influence on the stability of the system. Thanks, Thomas _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
