Hi Mi -

What network interface are you using?
I saw some similar panics years ago when running GAMESS over bmi-openib
where mopid's where getting reused incorrectly.

IIRC GAMESS has a very basic IO pattern, n simultaneous reads on n files,
followed by a final write ?
  I'm assuming you make it through the read/compute process prior to the
panic?

Kyle Schochenmaier


On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou <[email protected]> wrote:

> Hi Michael,
>
> Yes, the panic is reproducible, I got it about 10 minutes after this
> application (GAMESS) started. I ran GAMESS on 16 cores on 2 nodes. It
> was an MPI program, so I think only the master node is doing the writing
> and hence the panic only occurred on the master node.
>
> When this happens, the pvfs2-server daemon disappear. Comparing
> timestamps in the log, looks like the pvfs2-server errored first and
> then client got "job_time out" problem.
>
> I am attaching the logs on both server and client and the config file on
> the server. We have 3 identical pvfs2-server nodes (pvfs01-03), but
> seems the problem only happens on pvfs02.
>
> Your advice is greatly appreciated.
>
> Thanks,
>
> Mi
> On Tue, 2011-07-12 at 20:54 -0500, Michael Moore wrote:
> > Hi Mi,
> >
> > I don't think there have been any applicable commits since 06/28 to
> > Orange-Branch that would address this issue. Is the panic consistently
> > reproducible? If so, what workload leads to the panic? Single client
> > with writes to a single file? I'll look at the logs to see if anything
> > stands out, otherwise I may need to locally reproduce the issue to
> > track down what's going on.
> >
> > Thanks for reporting the issue,
> > Michael
> >
> > On Tue, Jul 12, 2011 at 5:43 PM, Mi Zhou <[email protected]> wrote:
> >         Hi,
> >
> >         I checked out the code from the cvs branch on 6/28, I don't
> >         see an
> >         immediate kernel panic any more, but still got kernel panic
> >         after some
> >         intensive write to the file system (pls see attached screen
> >         shot).
> >
> >         This is what is in the server log:
> >         pvfs02: [E 07/12/2011 16:05:30] Error:
> >         encourage_recv_incoming: mop_id
> >         17bb01c0 in RTS_DONE message not found.
> >         pvfs02: [E 07/12/2011 16:05:30]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d3a]
> >         pvfs02: [E 07/12/2011 16:05:30]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x446f30]
> >         pvfs02: [E 07/12/2011 16:05:30]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x448a79]
> >         pvfs02: [E 07/12/2011 16:05:30]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected
> >         +0x383)
> >         [0x445263]
> >         pvfs02: [E 07/12/2011 16:05:30]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x47b27c]
> >         pvfs02: [E 07/12/2011 16:05:30]
> >         [bt] /lib64/libpthread.so.0
> >         [0x3ba820673d]
> >         pvfs02: [E 07/12/2011 16:05:30]
> >         [bt] /lib64/libc.so.6(clone
> >         +0x6d) [0x3ba7ad44bd]
> >
> >
> >
> >         And this is the client log:
> >
> >         [E 16:10:17.359727] job_time_mgr_expire: job time out:
> >         cancelling bmi
> >         operation, job_id: 173191.
> >         [E 16:10:19.371926] Warning: ib_tcp_client_connect: connect to
> >         server
> >         pvfs02:3337: Connection refused.
> >         [E 16:10:19.371943] Receive immediately failed: Connection
> >         refused
> >         [E 16:10:21.382875] Warning: ib_tcp_client_connect: connect to
> >         server
> >         pvfs02:3337: Connection refused.
> >         [E 16:10:21.382888] Receive immediately failed: Connection
> >         refused
> >
> >         We have pvfs01-03 as running the pvfs-server. Both client and
> >         server are
> >         on centos 5 x63_64, kernel version 2.6.18-238.9.1.el5.
> >
> >         Any advice?
> >
> >         Thanks,
> >
> >         Mi
> >
> >
> >
> >         On Thu, 2011-07-07 at 12:33 -0500, Ted Hesselroth wrote:
> >         > That did resolve the problem. Thanks.
> >         >
> >         > On 7/7/2011 11:19 AM, Michael Moore wrote:
> >         > > Hi Ted,
> >         > >
> >         > > There was a regression when adding support for newer
> >         kernels that made it in
> >         > > to the 2.8.4 release. I believe that's the issue you're
> >         seeing (a kernel
> >         > > panic immediately on modprobe/insmod). The next release
> >         will include that
> >         > > fix. Until then, if you can check out the latest version
> >         of the code from
> >         > > CVS, it should resolve the issue. The CVS branch is
> >         Orange-Branch, full
> >         > > directions for CVS checkout at
> >         http://www.orangefs.org/support/
> >         > >
> >         > > We are currently running the kernel module with the latest
> >         code on CentOS 5
> >         > > and SL 6 systems. Let me know how it goes.
> >         > >
> >         > > For anyone interested, the commit to resolve the issue
> >         was:
> >         > >
> >
> http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853
> >         > >
> >         > > Michael
> >         > >
> >         > >
> >         > > On Thu, Jul 7, 2011 at 11:36 AM, Ted
> >         Hesselroth<[email protected]>  wrote:
> >         > >
> >         > >> I have built the kernel module from orangefs-2.8.4 source
> >         against a 64-bit
> >         > >> 2.6.18-238.12.1 linux kernel source, and against a 32-bit
> >         2.6.18-238.9.1
> >         > >> source. In both cases, the kernel hung when the module
> >         was inserted with
> >         > >> insmod. The first did report "kernel: Oops: 0000 [1]
> >         SMP". The distributions
> >         > >> are Scientific Linux 5.x, which is rpm-based and similar
> >         to Centos.
> >         > >>
> >         > >> Are there kernels for this scenario for which the build
> >         is known to work?
> >         > >> The server build and install went fine, but I would like
> >         to configure some
> >         > >> clients to access orangefs through a mount point.
> >         > >>
> >         > >> Thanks.
> >         > >>
> >         > >> ______________________________**_________________
> >         > >> Pvfs2-users mailing list
> >         > >>
> >         Pvfs2-users@beowulf-**underground.org<
> [email protected]>
> >         > >>
> >         http://www.beowulf-**
> underground.org/mailman/**listinfo/pvfs2-users<
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users>
> >         > >>
> >         > >
> >
> >         > _______________________________________________
> >         > Pvfs2-users mailing list
> >         > [email protected]
> >
> >         >
> >         http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >         >
> >
> >         --
> >
> >         Mi Zhou
> >         System Integration Engineer
> >         Information Sciences
> >         St. Jude Children's Research Hospital
> >         262 Danny Thomas Pl. MS 312
> >         Memphis, TN 38105
> >         901.595.5771
> >
> >         Email Disclaimer:  www.stjude.org/emaildisclaimer
> >
> --
>
> Mi Zhou
> System Integration Engineer
> Information Sciences
> St. Jude Children's Research Hospital
> 262 Danny Thomas Pl. MS 312
> Memphis, TN 38105
> 901.595.5771
>
> _______________________________________________
> Pvfs2-users mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to