We have been running Lustre for a few years now and today was the first time I came upon something I haven't seen before. The lustre partition was mounted and I could access files within it, however the minute I started opening the large files, it became unstable and hung. The system load shot up to 33 (on the headnode client) and Lustre was using approximately 6 GB of memory. I stopped all of our services that write into the Lustre partition and unmounted /lustre. Tailing the logs during this process, I saw:
LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 308135 previous similar messages LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 308135 previous similar messages LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 710099 previous similar messages LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 710099 previous similar messages Over and over again. A few minutes later, Lustre unmounted and freed up the 6GB of memory it was using. I didn't see anything wrong with our OSTs and remounted the Lustre partition on the headnode and now everything is back to normal. I'm wondering what could have caused this in the first place? Rocks 5 (RHEL5), Lustre 1.6.5.1, Kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smp -- Jeremy Mann [email protected] University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
