On Jan 22, 2009 14:05 -0600, Jeremy Mann wrote: > We have been running Lustre for a few years now and today was the first > time I came upon something I haven't seen before. The lustre partition was > mounted and I could access files within it, however the minute I started > opening the large files, it became unstable and hung. The system load shot > up to 33 (on the headnode client) and Lustre was using approximately 6 GB > of memory. I stopped all of our services that write into the Lustre > partition and unmounted /lustre. Tailing the logs during this process, I > saw: > > LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 > from cancel RPC: canceling anyway > LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped > 308135 previous similar messages > LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -108 > LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped > 308135 previous similar messages > LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 > from cancel RPC: canceling anyway > LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped > 710099 previous similar messages > LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -108 > LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped > 710099 previous similar messages
With so many skipped messages, it appears this node is in a tight loop for some reason. Is this client mounted on the same node as the MDS perhaps? That isn't an excuse for hitting such a problem, but might explain why it was in such a tight loop that it was DOS-ing your filesystem. > Over and over again. A few minutes later, Lustre unmounted and freed up > the 6GB of memory it was using. I didn't see anything wrong with our OSTs > and remounted the Lustre partition on the headnode and now everything is > back to normal. I'm wondering what could have caused this in the first > place? > > Rocks 5 (RHEL5), Lustre 1.6.5.1, Kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smp If it is 1.6.5.1 it might be the statahead bug. Please check archives for many discussions for workarouds. There was also a recent patch (not in any release yet) to fix the lock dynamic LRU sizing code to use less CPU, which may have contributed to this problem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
