we have the same problem when running namd in lustre sometimes, the console log suggest file lock expired, but I don't know why.
On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <[email protected]> wrote: > Hi Richard, > > If the cause of the I/O errors is Lustre there will be some message in the > logs. I am seeing similar problem with some applications that run on our > cluster. The symptoms are always the same, just before application crashes > with I/O error node gets evicted with a message like that: > LustreError: 167-0: This client was evicted by ddn_data-OST000f; in > progress operations using this service will fail. > > The OSS that mounts the OST from the above message has following line in the > log: > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock > callback timer expired after 101s: evicting client at 10.143....@tcp ns: > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44 > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376 > > Can you please check your logs for similar messages? > > Best regards > > Wojciech > > On 22 July 2010 23:43, Andreas Dilger <[email protected]> wrote: >> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote: >> > I have a problem with the Scalable molecular dynamics software NAMD. It >> > write restart files once in a while. But sometime the binary write >> > crashes. The when it crashes is not constant. The only constant thing is >> > it happens when it writes on our Lustre file system. When it write on >> > something else, it is fine. I can't seem find any errors in any of the >> > /var/log/messages. Anyone had any problems with NAMD? >> >> Rarely has anyone complained about Lustre not providing error messages >> when there is a problem, so if there is nothing in /var/log/messages on >> either the client or the server then it is hard to know whether it is a >> Lustre problem or not... >> >> If possible, you could try running the application under strace (limited >> to the IO calls, or it would be much too much data) to see which system call >> the error is coming from. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Technical Lead >> Oracle Corporation Canada Inc. >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
