Hi Mi - What network interface are you using? I saw some similar panics years ago when running GAMESS over bmi-openib where mopid's where getting reused incorrectly.
IIRC GAMESS has a very basic IO pattern, n simultaneous reads on n files, followed by a final write ? I'm assuming you make it through the read/compute process prior to the panic? Kyle Schochenmaier On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou <[email protected]> wrote: > Hi Michael, > > Yes, the panic is reproducible, I got it about 10 minutes after this > application (GAMESS) started. I ran GAMESS on 16 cores on 2 nodes. It > was an MPI program, so I think only the master node is doing the writing > and hence the panic only occurred on the master node. > > When this happens, the pvfs2-server daemon disappear. Comparing > timestamps in the log, looks like the pvfs2-server errored first and > then client got "job_time out" problem. > > I am attaching the logs on both server and client and the config file on > the server. We have 3 identical pvfs2-server nodes (pvfs01-03), but > seems the problem only happens on pvfs02. > > Your advice is greatly appreciated. > > Thanks, > > Mi > On Tue, 2011-07-12 at 20:54 -0500, Michael Moore wrote: > > Hi Mi, > > > > I don't think there have been any applicable commits since 06/28 to > > Orange-Branch that would address this issue. Is the panic consistently > > reproducible? If so, what workload leads to the panic? Single client > > with writes to a single file? I'll look at the logs to see if anything > > stands out, otherwise I may need to locally reproduce the issue to > > track down what's going on. > > > > Thanks for reporting the issue, > > Michael > > > > On Tue, Jul 12, 2011 at 5:43 PM, Mi Zhou <[email protected]> wrote: > > Hi, > > > > I checked out the code from the cvs branch on 6/28, I don't > > see an > > immediate kernel panic any more, but still got kernel panic > > after some > > intensive write to the file system (pls see attached screen > > shot). > > > > This is what is in the server log: > > pvfs02: [E 07/12/2011 16:05:30] Error: > > encourage_recv_incoming: mop_id > > 17bb01c0 in RTS_DONE message not found. > > pvfs02: [E 07/12/2011 16:05:30] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d3a] > > pvfs02: [E 07/12/2011 16:05:30] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x446f30] > > pvfs02: [E 07/12/2011 16:05:30] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x448a79] > > pvfs02: [E 07/12/2011 16:05:30] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected > > +0x383) > > [0x445263] > > pvfs02: [E 07/12/2011 16:05:30] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x47b27c] > > pvfs02: [E 07/12/2011 16:05:30] > > [bt] /lib64/libpthread.so.0 > > [0x3ba820673d] > > pvfs02: [E 07/12/2011 16:05:30] > > [bt] /lib64/libc.so.6(clone > > +0x6d) [0x3ba7ad44bd] > > > > > > > > And this is the client log: > > > > [E 16:10:17.359727] job_time_mgr_expire: job time out: > > cancelling bmi > > operation, job_id: 173191. > > [E 16:10:19.371926] Warning: ib_tcp_client_connect: connect to > > server > > pvfs02:3337: Connection refused. > > [E 16:10:19.371943] Receive immediately failed: Connection > > refused > > [E 16:10:21.382875] Warning: ib_tcp_client_connect: connect to > > server > > pvfs02:3337: Connection refused. > > [E 16:10:21.382888] Receive immediately failed: Connection > > refused > > > > We have pvfs01-03 as running the pvfs-server. Both client and > > server are > > on centos 5 x63_64, kernel version 2.6.18-238.9.1.el5. > > > > Any advice? > > > > Thanks, > > > > Mi > > > > > > > > On Thu, 2011-07-07 at 12:33 -0500, Ted Hesselroth wrote: > > > That did resolve the problem. Thanks. > > > > > > On 7/7/2011 11:19 AM, Michael Moore wrote: > > > > Hi Ted, > > > > > > > > There was a regression when adding support for newer > > kernels that made it in > > > > to the 2.8.4 release. I believe that's the issue you're > > seeing (a kernel > > > > panic immediately on modprobe/insmod). The next release > > will include that > > > > fix. Until then, if you can check out the latest version > > of the code from > > > > CVS, it should resolve the issue. The CVS branch is > > Orange-Branch, full > > > > directions for CVS checkout at > > http://www.orangefs.org/support/ > > > > > > > > We are currently running the kernel module with the latest > > code on CentOS 5 > > > > and SL 6 systems. Let me know how it goes. > > > > > > > > For anyone interested, the commit to resolve the issue > > was: > > > > > > > http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853 > > > > > > > > Michael > > > > > > > > > > > > On Thu, Jul 7, 2011 at 11:36 AM, Ted > > Hesselroth<[email protected]> wrote: > > > > > > > >> I have built the kernel module from orangefs-2.8.4 source > > against a 64-bit > > > >> 2.6.18-238.12.1 linux kernel source, and against a 32-bit > > 2.6.18-238.9.1 > > > >> source. In both cases, the kernel hung when the module > > was inserted with > > > >> insmod. The first did report "kernel: Oops: 0000 [1] > > SMP". The distributions > > > >> are Scientific Linux 5.x, which is rpm-based and similar > > to Centos. > > > >> > > > >> Are there kernels for this scenario for which the build > > is known to work? > > > >> The server build and install went fine, but I would like > > to configure some > > > >> clients to access orangefs through a mount point. > > > >> > > > >> Thanks. > > > >> > > > >> ______________________________**_________________ > > > >> Pvfs2-users mailing list > > > >> > > Pvfs2-users@beowulf-**underground.org< > [email protected]> > > > >> > > http://www.beowulf-** > underground.org/mailman/**listinfo/pvfs2-users< > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users> > > > >> > > > > > > > > > _______________________________________________ > > > Pvfs2-users mailing list > > > [email protected] > > > > > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > -- > > > > Mi Zhou > > System Integration Engineer > > Information Sciences > > St. Jude Children's Research Hospital > > 262 Danny Thomas Pl. MS 312 > > Memphis, TN 38105 > > 901.595.5771 > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > -- > > Mi Zhou > System Integration Engineer > Information Sciences > St. Jude Children's Research Hospital > 262 Danny Thomas Pl. MS 312 > Memphis, TN 38105 > 901.595.5771 > > _______________________________________________ > Pvfs2-users mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
