My apologies, that was the wrong version of the patch (notably missing ;s). Try this one, please.
Thanks, Michael On Thu, Jul 14, 2011 at 1:22 PM, Michael Moore <[email protected]> wrote: > Hi Mi, > > I haven't had a chance to reproduce the issue you're seeing yet. However, > the panic seems to point in the direction of this patch. Could you apply the > attached patch and let me know if it resolves or changes the panic you're > seeing? For reference, if you need it, > > From the top-level pvfs2 directory of the source (if from CVS) or the > orangefs directory (if the release) run patch -p0 < ~/<path to attached > path>. If applying against the release, it will apply with a little bit of > an offset. Let me know how it goes. > > Thanks, > Michael > > > On Wed, Jul 13, 2011 at 11:39 AM, Mi Zhou <[email protected]> wrote: > >> Hi Kyle, >> >> Thanks for sharing your experience. >> Yes, we are using openib. I'm not very familiar with GAMESS itself, just >> running it to benchmark with our current file system. So I am not sure >> what stage it got before the panic. I'll have our users look at the >> logs. I'm attaching the log here as well. >> >> Have you ever found a solution or workaround for that? >> >> Thanks, >> >> Mi >> >> On Wed, 2011-07-13 at 09:50 -0500, Kyle Schochenmaier wrote: >> > Hi Mi - >> > >> > >> > What network interface are you using? >> > I saw some similar panics years ago when running GAMESS over >> > bmi-openib where mopid's where getting reused incorrectly. >> > >> > >> > IIRC GAMESS has a very basic IO pattern, n simultaneous reads on n >> > files, followed by a final write ? >> > I'm assuming you make it through the read/compute process prior to >> > the panic? >> > >> > >> > Kyle Schochenmaier >> > >> > >> > On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou <[email protected]> wrote: >> > Hi Michael, >> > >> > Yes, the panic is reproducible, I got it about 10 minutes >> > after this >> > application (GAMESS) started. I ran GAMESS on 16 cores on 2 >> > nodes. It >> > was an MPI program, so I think only the master node is doing >> > the writing >> > and hence the panic only occurred on the master node. >> > >> > When this happens, the pvfs2-server daemon disappear. >> > Comparing >> > timestamps in the log, looks like the pvfs2-server errored >> > first and >> > then client got "job_time out" problem. >> > >> > I am attaching the logs on both server and client and the >> > config file on >> > the server. We have 3 identical pvfs2-server nodes >> > (pvfs01-03), but >> > seems the problem only happens on pvfs02. >> > >> > Your advice is greatly appreciated. >> > >> > Thanks, >> > >> > Mi >> > >> > On Tue, 2011-07-12 at 20:54 -0500, Michael Moore wrote: >> > > Hi Mi, >> > > >> > > I don't think there have been any applicable commits since >> > 06/28 to >> > > Orange-Branch that would address this issue. Is the panic >> > consistently >> > > reproducible? If so, what workload leads to the panic? >> > Single client >> > > with writes to a single file? I'll look at the logs to see >> > if anything >> > > stands out, otherwise I may need to locally reproduce the >> > issue to >> > > track down what's going on. >> > > >> > > Thanks for reporting the issue, >> > > Michael >> > > >> > > On Tue, Jul 12, 2011 at 5:43 PM, Mi Zhou >> > <[email protected]> wrote: >> > > Hi, >> > > >> > > I checked out the code from the cvs branch on 6/28, >> > I don't >> > > see an >> > > immediate kernel panic any more, but still got >> > kernel panic >> > > after some >> > > intensive write to the file system (pls see attached >> > screen >> > > shot). >> > > >> > > This is what is in the server log: >> > > pvfs02: [E 07/12/2011 16:05:30] Error: >> > > encourage_recv_incoming: mop_id >> > > 17bb01c0 in RTS_DONE message not found. >> > > pvfs02: [E 07/12/2011 16:05:30] >> > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) >> > [0x449d3a] >> > > pvfs02: [E 07/12/2011 16:05:30] >> > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x446f30] >> > > pvfs02: [E 07/12/2011 16:05:30] >> > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x448a79] >> > > pvfs02: [E 07/12/2011 16:05:30] >> > > >> > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected >> > > +0x383) >> > > [0x445263] >> > > pvfs02: [E 07/12/2011 16:05:30] >> > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x47b27c] >> > > pvfs02: [E 07/12/2011 16:05:30] >> > > [bt] /lib64/libpthread.so.0 >> > > [0x3ba820673d] >> > > pvfs02: [E 07/12/2011 16:05:30] >> > > [bt] /lib64/libc.so.6(clone >> > > +0x6d) [0x3ba7ad44bd] >> > > >> > > >> > > >> > > And this is the client log: >> > > >> > > [E 16:10:17.359727] job_time_mgr_expire: job time >> > out: >> > > cancelling bmi >> > > operation, job_id: 173191. >> > > [E 16:10:19.371926] Warning: ib_tcp_client_connect: >> > connect to >> > > server >> > > pvfs02:3337: Connection refused. >> > > [E 16:10:19.371943] Receive immediately failed: >> > Connection >> > > refused >> > > [E 16:10:21.382875] Warning: ib_tcp_client_connect: >> > connect to >> > > server >> > > pvfs02:3337: Connection refused. >> > > [E 16:10:21.382888] Receive immediately failed: >> > Connection >> > > refused >> > > >> > > We have pvfs01-03 as running the pvfs-server. Both >> > client and >> > > server are >> > > on centos 5 x63_64, kernel version >> > 2.6.18-238.9.1.el5. >> > > >> > > Any advice? >> > > >> > > Thanks, >> > > >> > > Mi >> > > >> > > >> > > >> > > On Thu, 2011-07-07 at 12:33 -0500, Ted Hesselroth >> > wrote: >> > > > That did resolve the problem. Thanks. >> > > > >> > > > On 7/7/2011 11:19 AM, Michael Moore wrote: >> > > > > Hi Ted, >> > > > > >> > > > > There was a regression when adding support for >> > newer >> > > kernels that made it in >> > > > > to the 2.8.4 release. I believe that's the issue >> > you're >> > > seeing (a kernel >> > > > > panic immediately on modprobe/insmod). The next >> > release >> > > will include that >> > > > > fix. Until then, if you can check out the latest >> > version >> > > of the code from >> > > > > CVS, it should resolve the issue. The CVS branch >> > is >> > > Orange-Branch, full >> > > > > directions for CVS checkout at >> > > http://www.orangefs.org/support/ >> > > > > >> > > > > We are currently running the kernel module with >> > the latest >> > > code on CentOS 5 >> > > > > and SL 6 systems. Let me know how it goes. >> > > > > >> > > > > For anyone interested, the commit to resolve the >> > issue >> > > was: >> > > > > >> > > >> > >> http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853 >> > > > > >> > > > > Michael >> > > > > >> > > > > >> > > > > On Thu, Jul 7, 2011 at 11:36 AM, Ted >> > > Hesselroth<[email protected]> wrote: >> > > > > >> > > > >> I have built the kernel module from >> > orangefs-2.8.4 source >> > > against a 64-bit >> > > > >> 2.6.18-238.12.1 linux kernel source, and >> > against a 32-bit >> > > 2.6.18-238.9.1 >> > > > >> source. In both cases, the kernel hung when the >> > module >> > > was inserted with >> > > > >> insmod. The first did report "kernel: Oops: >> > 0000 [1] >> > > SMP". The distributions >> > > > >> are Scientific Linux 5.x, which is rpm-based >> > and similar >> > > to Centos. >> > > > >> >> > > > >> Are there kernels for this scenario for which >> > the build >> > > is known to work? >> > > > >> The server build and install went fine, but I >> > would like >> > > to configure some >> > > > >> clients to access orangefs through a mount >> > point. >> > > > >> >> > > > >> Thanks. >> > > > >> >> > > > >> >> > ______________________________**_________________ >> > > > >> Pvfs2-users mailing list >> > > > >> >> > > >> > Pvfs2-users@beowulf-**underground.org< >> [email protected]> >> > > > >> >> > > >> > http://www.beowulf-** >> underground.org/mailman/**listinfo/pvfs2-users< >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users> >> > > > >> >> > > > > >> > > >> > > > _______________________________________________ >> > > > Pvfs2-users mailing list >> > > > [email protected] >> > > >> > > > >> > > >> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >> > > > >> > > >> > > -- >> > > >> > > Mi Zhou >> > > System Integration Engineer >> > > Information Sciences >> > > St. Jude Children's Research Hospital >> > > 262 Danny Thomas Pl. MS 312 >> > > Memphis, TN 38105 >> > > 901.595.5771 >> > > >> > > Email Disclaimer: www.stjude.org/emaildisclaimer >> > > >> > >> > -- >> > >> > >> > Mi Zhou >> > System Integration Engineer >> > Information Sciences >> > St. Jude Children's Research Hospital >> > 262 Danny Thomas Pl. MS 312 >> > Memphis, TN 38105 >> > 901.595.5771 >> > >> > >> > _______________________________________________ >> > Pvfs2-users mailing list >> > [email protected] >> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >> > >> > >> > >> -- >> >> Mi Zhou >> System Integration Engineer >> Information Sciences >> St. Jude Children's Research Hospital >> 262 Danny Thomas Pl. MS 312 >> Memphis, TN 38105 >> 901.595.5771 >> > >
kernel-lock-op-init.patch
Description: Binary data
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
