Hi Mi, I haven't had a chance to reproduce the issue you're seeing yet. However, the panic seems to point in the direction of this patch. Could you apply the attached patch and let me know if it resolves or changes the panic you're seeing? For reference, if you need it,
>From the top-level pvfs2 directory of the source (if from CVS) or the orangefs directory (if the release) run patch -p0 < ~/<path to attached path>. If applying against the release, it will apply with a little bit of an offset. Let me know how it goes. Thanks, Michael On Wed, Jul 13, 2011 at 11:39 AM, Mi Zhou <[email protected]> wrote: > Hi Kyle, > > Thanks for sharing your experience. > Yes, we are using openib. I'm not very familiar with GAMESS itself, just > running it to benchmark with our current file system. So I am not sure > what stage it got before the panic. I'll have our users look at the > logs. I'm attaching the log here as well. > > Have you ever found a solution or workaround for that? > > Thanks, > > Mi > > On Wed, 2011-07-13 at 09:50 -0500, Kyle Schochenmaier wrote: > > Hi Mi - > > > > > > What network interface are you using? > > I saw some similar panics years ago when running GAMESS over > > bmi-openib where mopid's where getting reused incorrectly. > > > > > > IIRC GAMESS has a very basic IO pattern, n simultaneous reads on n > > files, followed by a final write ? > > I'm assuming you make it through the read/compute process prior to > > the panic? > > > > > > Kyle Schochenmaier > > > > > > On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou <[email protected]> wrote: > > Hi Michael, > > > > Yes, the panic is reproducible, I got it about 10 minutes > > after this > > application (GAMESS) started. I ran GAMESS on 16 cores on 2 > > nodes. It > > was an MPI program, so I think only the master node is doing > > the writing > > and hence the panic only occurred on the master node. > > > > When this happens, the pvfs2-server daemon disappear. > > Comparing > > timestamps in the log, looks like the pvfs2-server errored > > first and > > then client got "job_time out" problem. > > > > I am attaching the logs on both server and client and the > > config file on > > the server. We have 3 identical pvfs2-server nodes > > (pvfs01-03), but > > seems the problem only happens on pvfs02. > > > > Your advice is greatly appreciated. > > > > Thanks, > > > > Mi > > > > On Tue, 2011-07-12 at 20:54 -0500, Michael Moore wrote: > > > Hi Mi, > > > > > > I don't think there have been any applicable commits since > > 06/28 to > > > Orange-Branch that would address this issue. Is the panic > > consistently > > > reproducible? If so, what workload leads to the panic? > > Single client > > > with writes to a single file? I'll look at the logs to see > > if anything > > > stands out, otherwise I may need to locally reproduce the > > issue to > > > track down what's going on. > > > > > > Thanks for reporting the issue, > > > Michael > > > > > > On Tue, Jul 12, 2011 at 5:43 PM, Mi Zhou > > <[email protected]> wrote: > > > Hi, > > > > > > I checked out the code from the cvs branch on 6/28, > > I don't > > > see an > > > immediate kernel panic any more, but still got > > kernel panic > > > after some > > > intensive write to the file system (pls see attached > > screen > > > shot). > > > > > > This is what is in the server log: > > > pvfs02: [E 07/12/2011 16:05:30] Error: > > > encourage_recv_incoming: mop_id > > > 17bb01c0 in RTS_DONE message not found. > > > pvfs02: [E 07/12/2011 16:05:30] > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) > > [0x449d3a] > > > pvfs02: [E 07/12/2011 16:05:30] > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x446f30] > > > pvfs02: [E 07/12/2011 16:05:30] > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x448a79] > > > pvfs02: [E 07/12/2011 16:05:30] > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected > > > +0x383) > > > [0x445263] > > > pvfs02: [E 07/12/2011 16:05:30] > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x47b27c] > > > pvfs02: [E 07/12/2011 16:05:30] > > > [bt] /lib64/libpthread.so.0 > > > [0x3ba820673d] > > > pvfs02: [E 07/12/2011 16:05:30] > > > [bt] /lib64/libc.so.6(clone > > > +0x6d) [0x3ba7ad44bd] > > > > > > > > > > > > And this is the client log: > > > > > > [E 16:10:17.359727] job_time_mgr_expire: job time > > out: > > > cancelling bmi > > > operation, job_id: 173191. > > > [E 16:10:19.371926] Warning: ib_tcp_client_connect: > > connect to > > > server > > > pvfs02:3337: Connection refused. > > > [E 16:10:19.371943] Receive immediately failed: > > Connection > > > refused > > > [E 16:10:21.382875] Warning: ib_tcp_client_connect: > > connect to > > > server > > > pvfs02:3337: Connection refused. > > > [E 16:10:21.382888] Receive immediately failed: > > Connection > > > refused > > > > > > We have pvfs01-03 as running the pvfs-server. Both > > client and > > > server are > > > on centos 5 x63_64, kernel version > > 2.6.18-238.9.1.el5. > > > > > > Any advice? > > > > > > Thanks, > > > > > > Mi > > > > > > > > > > > > On Thu, 2011-07-07 at 12:33 -0500, Ted Hesselroth > > wrote: > > > > That did resolve the problem. Thanks. > > > > > > > > On 7/7/2011 11:19 AM, Michael Moore wrote: > > > > > Hi Ted, > > > > > > > > > > There was a regression when adding support for > > newer > > > kernels that made it in > > > > > to the 2.8.4 release. I believe that's the issue > > you're > > > seeing (a kernel > > > > > panic immediately on modprobe/insmod). The next > > release > > > will include that > > > > > fix. Until then, if you can check out the latest > > version > > > of the code from > > > > > CVS, it should resolve the issue. The CVS branch > > is > > > Orange-Branch, full > > > > > directions for CVS checkout at > > > http://www.orangefs.org/support/ > > > > > > > > > > We are currently running the kernel module with > > the latest > > > code on CentOS 5 > > > > > and SL 6 systems. Let me know how it goes. > > > > > > > > > > For anyone interested, the commit to resolve the > > issue > > > was: > > > > > > > > > > > http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853 > > > > > > > > > > Michael > > > > > > > > > > > > > > > On Thu, Jul 7, 2011 at 11:36 AM, Ted > > > Hesselroth<[email protected]> wrote: > > > > > > > > > >> I have built the kernel module from > > orangefs-2.8.4 source > > > against a 64-bit > > > > >> 2.6.18-238.12.1 linux kernel source, and > > against a 32-bit > > > 2.6.18-238.9.1 > > > > >> source. In both cases, the kernel hung when the > > module > > > was inserted with > > > > >> insmod. The first did report "kernel: Oops: > > 0000 [1] > > > SMP". The distributions > > > > >> are Scientific Linux 5.x, which is rpm-based > > and similar > > > to Centos. > > > > >> > > > > >> Are there kernels for this scenario for which > > the build > > > is known to work? > > > > >> The server build and install went fine, but I > > would like > > > to configure some > > > > >> clients to access orangefs through a mount > > point. > > > > >> > > > > >> Thanks. > > > > >> > > > > >> > > ______________________________**_________________ > > > > >> Pvfs2-users mailing list > > > > >> > > > > > Pvfs2-users@beowulf-**underground.org< > [email protected]> > > > > >> > > > > > http://www.beowulf-** > underground.org/mailman/**listinfo/pvfs2-users< > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users> > > > > >> > > > > > > > > > > > > _______________________________________________ > > > > Pvfs2-users mailing list > > > > [email protected] > > > > > > > > > > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > > > > -- > > > > > > Mi Zhou > > > System Integration Engineer > > > Information Sciences > > > St. Jude Children's Research Hospital > > > 262 Danny Thomas Pl. MS 312 > > > Memphis, TN 38105 > > > 901.595.5771 > > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > > > > > -- > > > > > > Mi Zhou > > System Integration Engineer > > Information Sciences > > St. Jude Children's Research Hospital > > 262 Danny Thomas Pl. MS 312 > > Memphis, TN 38105 > > 901.595.5771 > > > > > > _______________________________________________ > > Pvfs2-users mailing list > > [email protected] > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > -- > > Mi Zhou > System Integration Engineer > Information Sciences > St. Jude Children's Research Hospital > 262 Danny Thomas Pl. MS 312 > Memphis, TN 38105 > 901.595.5771 >
kernel-lock-op-init.patch
Description: Binary data
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
