There are many kinds of reasons that a server evicts a client, maybe network error, maybe ptlrpcd bug, but according to my experience, the only chance to see the I/O error is running namd in lustre filesystem, I can see some other "evict" events sometimes, but none of them results in I/O error. So besides the "evict client", there may be something else causing the "I/O error".
On Fri, Jul 23, 2010 at 6:54 PM, Wojciech Turek <[email protected]> wrote: > There is a similar thread on this mailing list: > http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients# > Also there is a bug open which reports similar problem: > https://bugzilla.lustre.org/show_bug.cgi?id=23190 > > > > On 23 July 2010 10:02, Larry <[email protected]> wrote: >> >> we have the same problem when running namd in lustre sometimes, the >> console log suggest file lock expired, but I don't know why. >> >> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <[email protected]> wrote: >> > Hi Richard, >> > >> > If the cause of the I/O errors is Lustre there will be some message in >> > the >> > logs. I am seeing similar problem with some applications that run on our >> > cluster. The symptoms are always the same, just before application >> > crashes >> > with I/O error node gets evicted with a message like that: >> > LustreError: 167-0: This client was evicted by ddn_data-OST000f; in >> > progress operations using this service will fail. >> > >> > The OSS that mounts the OST from the above message has following line in >> > the >> > log: >> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock >> > callback timer expired after 101s: evicting client at 10.143....@tcp >> > ns: >> > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44 >> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT >> > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 >> > remote: >> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376 >> > >> > Can you please check your logs for similar messages? >> > >> > Best regards >> > >> > Wojciech >> > >> > On 22 July 2010 23:43, Andreas Dilger <[email protected]> wrote: >> >> >> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote: >> >> > I have a problem with the Scalable molecular dynamics software NAMD. >> >> > It >> >> > write restart files once in a while. But sometime the binary write >> >> > crashes. The when it crashes is not constant. The only constant thing >> >> > is >> >> > it happens when it writes on our Lustre file system. When it write on >> >> > something else, it is fine. I can't seem find any errors in any of >> >> > the >> >> > /var/log/messages. Anyone had any problems with NAMD? >> >> >> >> Rarely has anyone complained about Lustre not providing error messages >> >> when there is a problem, so if there is nothing in /var/log/messages on >> >> either the client or the server then it is hard to know whether it is a >> >> Lustre problem or not... >> >> >> >> If possible, you could try running the application under strace >> >> (limited >> >> to the IO calls, or it would be much too much data) to see which system >> >> call >> >> the error is coming from. >> >> >> >> Cheers, Andreas >> >> -- >> >> Andreas Dilger >> >> Lustre Technical Lead >> >> Oracle Corporation Canada Inc. >> >> >> >> _______________________________________________ >> >> Lustre-discuss mailing list >> >> [email protected] >> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >> > >> > >> > _______________________________________________ >> > Lustre-discuss mailing list >> > [email protected] >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >> > > > > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
