Re: [Lustre-discuss] I/O errors with NAMD

Larry Fri, 23 Jul 2010 07:42:17 -0700

There are many kinds of reasons that a server evicts a client, maybe
network error, maybe ptlrpcd bug, but according to my experience, the
only chance to see the I/O error is running namd in lustre filesystem,
I can see some other "evict"  events sometimes, but none of them
results in I/O error. So besides the "evict client", there may be
something else causing the "I/O error".


On Fri, Jul 23, 2010 at 6:54 PM, Wojciech Turek <[email protected]> wrote:
> There is a similar thread on this mailing list:
> http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients#
> Also there is a bug open which reports similar problem:
> https://bugzilla.lustre.org/show_bug.cgi?id=23190
>
>
>
> On 23 July 2010 10:02, Larry <[email protected]> wrote:
>>
>> we have  the same problem when running namd in lustre sometimes, the
>> console log suggest file lock expired, but I don't know why.
>>
>> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <[email protected]> wrote:
>> > Hi Richard,
>> >
>> > If the cause of the I/O errors is Lustre there will be some message in
>> > the
>> > logs. I am seeing similar problem with some applications that run on our
>> > cluster. The symptoms are always the same, just before application
>> > crashes
>> > with I/O error node gets evicted with a message like that:
>> >  LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
>> > progress operations using this service will fail.
>> >
>> > The OSS that mounts the OST from the above message has following line in
>> > the
>> > log:
>> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
>> > callback timer expired after 101s: evicting client at 10.143....@tcp
>> > ns:
>> > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
>> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
>> > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20
>> > remote:
>> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
>> >
>> > Can you please check your logs for similar messages?
>> >
>> > Best regards
>> >
>> > Wojciech
>> >
>> > On 22 July 2010 23:43, Andreas Dilger <[email protected]> wrote:
>> >>
>> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
>> >> > I have a problem with the Scalable molecular dynamics software NAMD.
>> >> > It
>> >> > write restart files once in a while. But sometime the binary write
>> >> > crashes. The when it crashes is not constant. The only constant thing
>> >> > is
>> >> > it happens when it writes on our Lustre file system. When it write on
>> >> > something else, it is fine. I can't seem find any errors in any of
>> >> > the
>> >> > /var/log/messages. Anyone had any problems with NAMD?
>> >>
>> >> Rarely has anyone complained about Lustre not providing error messages
>> >> when there is a problem, so if there is nothing in /var/log/messages on
>> >> either the client or the server then it is hard to know whether it is a
>> >> Lustre problem or not...
>> >>
>> >> If possible, you could try running the application under strace
>> >> (limited
>> >> to the IO calls, or it would be much too much data) to see which system
>> >> call
>> >> the error is coming from.
>> >>
>> >> Cheers, Andreas
>> >> --
>> >> Andreas Dilger
>> >> Lustre Technical Lead
>> >> Oracle Corporation Canada Inc.
>> >>
>> >> _______________________________________________
>> >> Lustre-discuss mailing list
>> >> [email protected]
>> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> >
>> >
>> >
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > [email protected]
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> >
>> >
>
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] I/O errors with NAMD

Reply via email to