Oleg Drokin wrote: > > What would be useful here is if you can enable dlm tracing (echo > +dlm_trace >/proc/sys/lnet/debug) > on some of those 1.6 nodes (also if you are running with no debug > enabled at all, > also enable rpc_trace and info levels) and also enable "dump on > eviction" feature. > (echo 1 >/proc/sys/lustre/dump_on_eviction). > Then when next eviction happens, there would be some useful debug data > dumped on the client, > that you can attach to a bugzilla bug along with server-side eviction > message (processed > with "lctl dl" command first).
OK, will do. The main problem is reproducing the error: our users have unreasonably insisted that we run their jobs using known-good 1.4 clients and even if I grab their code to run on isolated test nodes _most_ runs are fine. > >> We are also seeing some userspace file operations fail with the error >> "No locks available". These don't generate any logging on the client >> so >> I don't have exact timing. It's possible that they are associated with >> further "### lock callback timer expired" server logs. > > This error code typically means an application attempting to do some i/ > o and Lustre > has no lock for the i/o area for some reason anymore (it is normally > obtained > once read or write path is entered), and that could be related to > evictions too > (locks are revoked at eviction time). I should have mentioned that we are also seeing many errors of the form "LustreError: 19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel()) received cancel for unknown lock cookie." Checking back, these would seem to pre-date the introduction of 1.6 clients and even after we upgraded clients I can see them associated with both 1.4 and 1.6 clients. They may indicate something else relevant about the filesystems or workload. Cheers, Simon. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
