Thanks! It may take me a while to track down where to go from here, I'll let you know what I find. Any input from others who have seen this issue would be great.
Michael On Sun, Mar 27, 2011 at 8:28 AM, Matthieu Dorier < [email protected]> wrote: > Here is GDB's answer: > > (gdb) list *0x46f55a > 0x46f55a is in error (src/io/bmi/bmi_ib/util.c:31). > 26 va_start(ap, fmt); > 27 vsprintf(s, fmt, ap); > 28 va_end(ap); > 29 gossip_err("Error: %s.\n", s); > 30 gossip_backtrace(); > 31 exit(1); > 32 } > 33 > 34 void __attribute__((noreturn,format(printf,1,2))) __hidden > 35 error_errno(const char *fmt, ...) > > Matthieu > > > 2011/3/27 Michael Moore <[email protected]> > >> Hi Matthieu, >> >> If you could print the source code line associated with that crash address >> that will help get us started. Something like: >> gdb <path to pvfs2-server binary> >> list *0x46f55a >> >> Then with that and the info from Kyle we can work on getting it resolved. >> >> As a side note, if you have the opportunity you should upgrade your >> installation to 2.8.3 (under the name OrangeFS at orangefs.org) which has >> additional functionality and bug fixes although I don't believe any of the >> fixes are applicable to this issue. >> >> Michael >> >> >> On Sat, Mar 26, 2011 at 5:35 PM, Kyle Schochenmaier >> <[email protected]>wrote: >> >>> HI Matthieu - >>> >>> The last time I worked on this we ran into this problem and I think we >>> narrowed it down to a mopid reuse issue, we tried to insert some thread >>> locking mechanisms into the mopid 'cache' but I dont think it ever got >>> resolved. This was years ago and only occurred under very heavy load of >>> relatively small messages. >>> >>> That would be the place to start I would imagine. >>> >>> Cheers, >>> Kyle Schochenmaier >>> >>> >>> On Sat, Mar 26, 2011 at 4:21 PM, Matthieu Dorier < >>> [email protected]> wrote: >>> >>>> Hello, >>>> >>>> I'm trying to evaluate the performance of my PVFS installation over an >>>> InfiniBand network, but from time to time a server crashes with this trace >>>> in the log: >>>> >>>> [E 03/26 21:58] Error: encourage_recv_incoming: mop_id 12952a0 in >>>> RTS_DONE message not found. >>>> [E 03/26 21:58] [bt] /usr/sbin/pvfs2-server(error+0xca) [0x46f55a] >>>> [E 03/26 21:58] [bt] /usr/sbin/pvfs2-server [0x46c88c] >>>> [E 03/26 21:58] [bt] /usr/sbin/pvfs2-server [0x46e485] >>>> [E 03/26 21:58] [bt] >>>> /usr/sbin/pvfs2-server(BMI_testunexpected+0x384) [0x421004] >>>> [E 03/26 21:58] [bt] /usr/sbin/pvfs2-server [0x41cf4a] >>>> [E 03/26 21:58] [bt] /lib/libpthread.so.0 [0x7f6422ff0fc7] >>>> [E 03/26 21:58] [bt] /lib/libc.so.6(clone+0x6d) [0x7f642295164d] >>>> >>>> I've seen that some other users reported this kind of error in some >>>> archives of the mailing list, but didn't find any answer to solve the >>>> problem. Any idea how to solve this problem? >>>> >>>> If it can be of any use: I'm working with 16 PVFS servers (IO server and >>>> metadata server at the same time), and I'm benchmarking with the IOR >>>> program, for now I have 648 processes writing 8MB each in a shared file >>>> with >>>> a transfer size that corresponds to the strip size (64KB). >>>> >>>> Thank you, >>>> >>>> Matthieu >>>> >>>> -- >>>> Matthieu Dorier >>>> ENS Cachan, Brittany (Computer Science dpt.) >>>> IRISA Rennes, Office E324 >>>> http://perso.eleves.bretagne.ens-cachan.fr/~mdori307/wiki/ >>>> >>>> _______________________________________________ >>>> Pvfs2-users mailing list >>>> [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>>> >>> >>> _______________________________________________ >>> Pvfs2-users mailing list >>> [email protected] >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> >>> >> > > > -- > Matthieu Dorier > ENS Cachan, Brittany (Computer Science dpt.) > IRISA Rennes, Office E324 > http://perso.eleves.bretagne.ens-cachan.fr/~mdori307/wiki/ >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
