Re: [Pvfs2-users] PVFS server crashing randomly over InfiniBand

Matthieu Dorier Sun, 27 Mar 2011 05:32:09 -0700

Here is GDB's answer:

(gdb) list *0x46f55a
0x46f55a is in error (src/io/bmi/bmi_ib/util.c:31).
26        va_start(ap, fmt);
27        vsprintf(s, fmt, ap);
28        va_end(ap);
29        gossip_err("Error: %s.\n", s);
30        gossip_backtrace();
31        exit(1);
32    }
33
34    void __attribute__((noreturn,format(printf,1,2))) __hidden
35    error_errno(const char *fmt, ...)


Matthieu

2011/3/27 Michael Moore <[email protected]>

> Hi Matthieu,
>
> If you could print the source code line associated with that crash address
> that will help get us started. Something like:
> gdb <path to pvfs2-server binary>
> list *0x46f55a
>
> Then with that and the info from Kyle we can work on getting it resolved.
>
> As a side note, if you have the opportunity you should upgrade your
> installation to 2.8.3 (under the name OrangeFS at orangefs.org) which has
> additional functionality and bug fixes although I don't believe any of the
> fixes are applicable to this issue.
>
> Michael
>
>
> On Sat, Mar 26, 2011 at 5:35 PM, Kyle Schochenmaier <[email protected]>wrote:
>
>> HI Matthieu -
>>
>> The last time I worked on this we ran into this problem and I think we
>> narrowed it down to a mopid reuse issue, we tried to insert some thread
>> locking mechanisms into the mopid 'cache' but I dont think it ever got
>> resolved.  This was years ago and only occurred under very heavy load of
>> relatively small messages.
>>
>> That would be the place to start I would imagine.
>>
>> Cheers,
>> Kyle Schochenmaier
>>
>>
>> On Sat, Mar 26, 2011 at 4:21 PM, Matthieu Dorier <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I'm trying to evaluate the performance of my PVFS installation over an
>>> InfiniBand network, but from time to time a server crashes with this trace
>>> in the log:
>>>
>>> [E 03/26 21:58] Error: encourage_recv_incoming: mop_id 12952a0 in
>>> RTS_DONE message not found.
>>> [E 03/26 21:58]     [bt] /usr/sbin/pvfs2-server(error+0xca) [0x46f55a]
>>> [E 03/26 21:58]     [bt] /usr/sbin/pvfs2-server [0x46c88c]
>>> [E 03/26 21:58]     [bt] /usr/sbin/pvfs2-server [0x46e485]
>>> [E 03/26 21:58]     [bt] /usr/sbin/pvfs2-server(BMI_testunexpected+0x384)
>>> [0x421004]
>>> [E 03/26 21:58]     [bt] /usr/sbin/pvfs2-server [0x41cf4a]
>>> [E 03/26 21:58]     [bt] /lib/libpthread.so.0 [0x7f6422ff0fc7]
>>> [E 03/26 21:58]     [bt] /lib/libc.so.6(clone+0x6d) [0x7f642295164d]
>>>
>>> I've seen that some other users reported this kind of error in some
>>> archives of the mailing list, but didn't find any answer to solve the
>>> problem. Any idea how to solve this problem?
>>>
>>> If it can be of any use: I'm working with 16 PVFS servers (IO server and
>>> metadata server at the same time), and I'm benchmarking with the IOR
>>> program, for now I have 648 processes writing 8MB each in a shared file with
>>> a transfer size that corresponds to the strip size (64KB).
>>>
>>> Thank you,
>>>
>>> Matthieu
>>>
>>> --
>>> Matthieu Dorier
>>> ENS Cachan, Brittany (Computer Science dpt.)
>>> IRISA Rennes, Office E324
>>> http://perso.eleves.bretagne.ens-cachan.fr/~mdori307/wiki/
>>>
>>> _______________________________________________
>>> Pvfs2-users mailing list
>>> [email protected]
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>>
>>
>> _______________________________________________
>> Pvfs2-users mailing list
>> [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>>
>


-- 
Matthieu Dorier
ENS Cachan, Brittany (Computer Science dpt.)
IRISA Rennes, Office E324
http://perso.eleves.bretagne.ens-cachan.fr/~mdori307/wiki/

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] PVFS server crashing randomly over InfiniBand

Reply via email to