Mi, There are a couple issues here, one is the kernel panic and the other is the server segfault. The clocks between the two hosts appear to be out of sync based on the logs. Can you infer or have you observed what the pattern of failure is (e.g server crash -> client I/O failure -> kernel panic)? I'm still looking at the kernel panic just trying to better understand which code paths are involved.
Thanks, Michael On Mon, Jul 18, 2011 at 10:57 AM, Mi Zhou <[email protected]> wrote: > Michael, > > Thanks for looking into this. > > I applied the patch, recompiled the kernel module, still got "kernel > panic" after a while. But the errors on the client and server seem to be > different than before: > > Client log: > > [E 23:48:00.514201] PVFS Client Daemon Started. Version > 2.8.4-orangefs-2011-07-18-043429 > [D 23:48:00.535157] [INFO]: Mapping pointer 0x2ad023ca8000 for I/O. > [D 23:48:00.536534] [INFO]: Mapping pointer 0x3756000 for I/O. > [E 23:55:53.162291] job_time_mgr_expire: job time out: cancelling flow > operation, job_id: 107661. > [E 23:55:53.162340] fp_multiqueue_cancel: flow proto cancel called on > 0x3bf4d78 > [E 23:55:53.162349] fp_multiqueue_cancel: I/O error occurred > [E 23:55:53.162358] handle_io_error: flow proto error cleanup started on > 0x3bf4d78: Operation cancelled (possibly due to timeout) > [E 23:55:53.162400] handle_io_error: flow proto 0x3bf4d78 canceled 1 > operations, will clean up. > [E 23:55:53.167283] mem_to_bmi_callback_fn: I/O error occurred > [E 23:55:53.167293] handle_io_error: flow proto 0x3bf4d78 error cleanup > finished: Operation cancelled (possibly due to timeout) > [E 23:55:55.201498] Child process with pid 17377 was killed by an > uncaught signal 6 > [E 23:55:55.203703] PVFS Client Daemon Started. Version > 2.8.4-orangefs-2011-07-18-043429 > > > Server log: > > [E 07/18/2011 01:41:23] Error: ib_check_cq: unknown send state (unknown) > (0) of sq 0x25be0e0. > [E 07/18/2011 01:41:23] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d3a] > [E 07/18/2011 01:41:23] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x447180] > [E 07/18/2011 01:41:23] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x448a79] > [E 07/18/2011 01:41:23] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected+0x383) > [0x445263] > [E 07/18/2011 01:41:23] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x47b27c] > [E 07/18/2011 01:41:23] [bt] /lib64/libpthread.so.0 > [0x3ba820673d] > [E 07/18/2011 01:41:23] [bt] /lib64/libc.so.6(clone+0x6d) > [0x3ba7ad44bd] > > > Thanks, > > Mi > > > On Thu, 2011-07-14 at 12:53 -0500, Michael Moore wrote: > > My apologies, that was the wrong version of the patch (notably > > missing ;s). Try this one, please. > > > > Thanks, > > Michael > > > > On Thu, Jul 14, 2011 at 1:22 PM, Michael Moore <[email protected]> > > wrote: > > Hi Mi, > > > > I haven't had a chance to reproduce the issue you're seeing > > yet. However, the panic seems to point in the direction of > > this patch. Could you apply the attached patch and let me know > > if it resolves or changes the panic you're seeing? For > > reference, if you need it, > > > > From the top-level pvfs2 directory of the source (if from CVS) > > or the orangefs directory (if the release) run patch -p0 < > > ~/<path to attached path>. If applying against the release, it > > will apply with a little bit of an offset. Let me know how it > > goes. > > > > Thanks, > > Michael > > > > > > > > On Wed, Jul 13, 2011 at 11:39 AM, Mi Zhou <[email protected]> > > wrote: > > Hi Kyle, > > > > Thanks for sharing your experience. > > Yes, we are using openib. I'm not very familiar with > > GAMESS itself, just > > running it to benchmark with our current file system. > > So I am not sure > > what stage it got before the panic. I'll have our > > users look at the > > logs. I'm attaching the log here as well. > > > > Have you ever found a solution or workaround for that? > > > > Thanks, > > > > Mi > > > > > > On Wed, 2011-07-13 at 09:50 -0500, Kyle Schochenmaier > > wrote: > > > Hi Mi - > > > > > > > > > What network interface are you using? > > > I saw some similar panics years ago when running > > GAMESS over > > > bmi-openib where mopid's where getting reused > > incorrectly. > > > > > > > > > IIRC GAMESS has a very basic IO pattern, n > > simultaneous reads on n > > > files, followed by a final write ? > > > I'm assuming you make it through the read/compute > > process prior to > > > the panic? > > > > > > > > > Kyle Schochenmaier > > > > > > > > > On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou > > <[email protected]> wrote: > > > Hi Michael, > > > > > > Yes, the panic is reproducible, I got it > > about 10 minutes > > > after this > > > application (GAMESS) started. I ran GAMESS > > on 16 cores on 2 > > > nodes. It > > > was an MPI program, so I think only the > > master node is doing > > > the writing > > > and hence the panic only occurred on the > > master node. > > > > > > When this happens, the pvfs2-server daemon > > disappear. > > > Comparing > > > timestamps in the log, looks like the > > pvfs2-server errored > > > first and > > > then client got "job_time out" problem. > > > > > > I am attaching the logs on both server and > > client and the > > > config file on > > > the server. We have 3 identical pvfs2-server > > nodes > > > (pvfs01-03), but > > > seems the problem only happens on pvfs02. > > > > > > Your advice is greatly appreciated. > > > > > > Thanks, > > > > > > Mi > > > > > > On Tue, 2011-07-12 at 20:54 -0500, Michael > > Moore wrote: > > > > Hi Mi, > > > > > > > > I don't think there have been any > > applicable commits since > > > 06/28 to > > > > Orange-Branch that would address this > > issue. Is the panic > > > consistently > > > > reproducible? If so, what workload leads > > to the panic? > > > Single client > > > > with writes to a single file? I'll look at > > the logs to see > > > if anything > > > > stands out, otherwise I may need to > > locally reproduce the > > > issue to > > > > track down what's going on. > > > > > > > > Thanks for reporting the issue, > > > > Michael > > > > > > > > On Tue, Jul 12, 2011 at 5:43 PM, Mi Zhou > > > <[email protected]> wrote: > > > > Hi, > > > > > > > > I checked out the code from the > > cvs branch on 6/28, > > > I don't > > > > see an > > > > immediate kernel panic any more, > > but still got > > > kernel panic > > > > after some > > > > intensive write to the file system > > (pls see attached > > > screen > > > > shot). > > > > > > > > This is what is in the server log: > > > > pvfs02: [E 07/12/2011 16:05:30] > > Error: > > > > encourage_recv_incoming: mop_id > > > > 17bb01c0 in RTS_DONE message not > > found. > > > > pvfs02: [E 07/12/2011 16:05:30] > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) > > > [0x449d3a] > > > > pvfs02: [E 07/12/2011 16:05:30] > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x446f30] > > > > pvfs02: [E 07/12/2011 16:05:30] > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x448a79] > > > > pvfs02: [E 07/12/2011 16:05:30] > > > > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected > > > > +0x383) > > > > [0x445263] > > > > pvfs02: [E 07/12/2011 16:05:30] > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x47b27c] > > > > pvfs02: [E 07/12/2011 16:05:30] > > > > [bt] /lib64/libpthread.so.0 > > > > [0x3ba820673d] > > > > pvfs02: [E 07/12/2011 16:05:30] > > > > [bt] /lib64/libc.so.6(clone > > > > +0x6d) [0x3ba7ad44bd] > > > > > > > > > > > > > > > > And this is the client log: > > > > > > > > [E 16:10:17.359727] > > job_time_mgr_expire: job time > > > out: > > > > cancelling bmi > > > > operation, job_id: 173191. > > > > [E 16:10:19.371926] Warning: > > ib_tcp_client_connect: > > > connect to > > > > server > > > > pvfs02:3337: Connection refused. > > > > [E 16:10:19.371943] Receive > > immediately failed: > > > Connection > > > > refused > > > > [E 16:10:21.382875] Warning: > > ib_tcp_client_connect: > > > connect to > > > > server > > > > pvfs02:3337: Connection refused. > > > > [E 16:10:21.382888] Receive > > immediately failed: > > > Connection > > > > refused > > > > > > > > We have pvfs01-03 as running the > > pvfs-server. Both > > > client and > > > > server are > > > > on centos 5 x63_64, kernel version > > > 2.6.18-238.9.1.el5. > > > > > > > > Any advice? > > > > > > > > Thanks, > > > > > > > > Mi > > > > > > > > > > > > > > > > On Thu, 2011-07-07 at 12:33 -0500, > > Ted Hesselroth > > > wrote: > > > > > That did resolve the problem. > > Thanks. > > > > > > > > > > On 7/7/2011 11:19 AM, Michael > > Moore wrote: > > > > > > Hi Ted, > > > > > > > > > > > > There was a regression when > > adding support for > > > newer > > > > kernels that made it in > > > > > > to the 2.8.4 release. I > > believe that's the issue > > > you're > > > > seeing (a kernel > > > > > > panic immediately on > > modprobe/insmod). The next > > > release > > > > will include that > > > > > > fix. Until then, if you can > > check out the latest > > > version > > > > of the code from > > > > > > CVS, it should resolve the > > issue. The CVS branch > > > is > > > > Orange-Branch, full > > > > > > directions for CVS checkout at > > > > http://www.orangefs.org/support/ > > > > > > > > > > > > We are currently running the > > kernel module with > > > the latest > > > > code on CentOS 5 > > > > > > and SL 6 systems. Let me know > > how it goes. > > > > > > > > > > > > For anyone interested, the > > commit to resolve the > > > issue > > > > was: > > > > > > > > > > > > > > > > http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853 > > > > > > > > > > > > Michael > > > > > > > > > > > > > > > > > > On Thu, Jul 7, 2011 at 11:36 > > AM, Ted > > > > Hesselroth<[email protected]> wrote: > > > > > > > > > > > >> I have built the kernel > > module from > > > orangefs-2.8.4 source > > > > against a 64-bit > > > > > >> 2.6.18-238.12.1 linux kernel > > source, and > > > against a 32-bit > > > > 2.6.18-238.9.1 > > > > > >> source. In both cases, the > > kernel hung when the > > > module > > > > was inserted with > > > > > >> insmod. The first did report > > "kernel: Oops: > > > 0000 [1] > > > > SMP". The distributions > > > > > >> are Scientific Linux 5.x, > > which is rpm-based > > > and similar > > > > to Centos. > > > > > >> > > > > > >> Are there kernels for this > > scenario for which > > > the build > > > > is known to work? > > > > > >> The server build and install > > went fine, but I > > > would like > > > > to configure some > > > > > >> clients to access orangefs > > through a mount > > > point. > > > > > >> > > > > > >> Thanks. > > > > > >> > > > > > >> > > > > > ______________________________**_________________ > > > > > >> Pvfs2-users mailing list > > > > > >> > > > > > > > > > Pvfs2-users@beowulf-**underground.org< > [email protected]> > > > > > >> > > > > > > > > > http://www.beowulf-** > underground.org/mailman/**listinfo/pvfs2-users< > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users> > > > > > >> > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Pvfs2-users mailing list > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > > > > > > > -- > > > > > > > > Mi Zhou > > > > System Integration Engineer > > > > Information Sciences > > > > St. Jude Children's Research > > Hospital > > > > 262 Danny Thomas Pl. MS 312 > > > > Memphis, TN 38105 > > > > 901.595.5771 > > > > > > > > Email Disclaimer: > > www.stjude.org/emaildisclaimer > > > > > > > > > > -- > > > > > > > > > Mi Zhou > > > System Integration Engineer > > > Information Sciences > > > St. Jude Children's Research Hospital > > > 262 Danny Thomas Pl. MS 312 > > > Memphis, TN 38105 > > > 901.595.5771 > > > > > > > > > > > _______________________________________________ > > > Pvfs2-users mailing list > > > [email protected] > > > > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > > > > > > > -- > > > > > > Mi Zhou > > System Integration Engineer > > Information Sciences > > St. Jude Children's Research Hospital > > 262 Danny Thomas Pl. MS 312 > > Memphis, TN 38105 > > 901.595.5771 > > > > > > > > > -- > > Mi Zhou > System Integration Engineer > Information Sciences > St. Jude Children's Research Hospital > 262 Danny Thomas Pl. MS 312 > Memphis, TN 38105 > 901.595.5771 >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
