Michael, I reran the program yesterday afternoon, killed it when I started to see errors in the application log. Both server and client did not crash at that point. When I look at it this morning, I saw the kernel panic on the client. Server is still running. Here is the client log.
Thanks, Mi On Mon, 2011-07-18 at 10:08 -0500, Michael Moore wrote: > Mi, > > There are a couple issues here, one is the kernel panic and the other > is the server segfault. The clocks between the two hosts appear to be > out of sync based on the logs. Can you infer or have you observed what > the pattern of failure is (e.g server crash -> client I/O failure -> > kernel panic)? I'm still looking at the kernel panic just trying to > better understand which code paths are involved. > > Thanks, > Michael > > On Mon, Jul 18, 2011 at 10:57 AM, Mi Zhou <[email protected]> wrote: > Michael, > > Thanks for looking into this. > > I applied the patch, recompiled the kernel module, still got > "kernel > panic" after a while. But the errors on the client and server > seem to be > different than before: > > Client log: > > [E 23:48:00.514201] PVFS Client Daemon Started. Version > 2.8.4-orangefs-2011-07-18-043429 > [D 23:48:00.535157] [INFO]: Mapping pointer 0x2ad023ca8000 for > I/O. > [D 23:48:00.536534] [INFO]: Mapping pointer 0x3756000 for I/O. > [E 23:55:53.162291] job_time_mgr_expire: job time out: > cancelling flow > operation, job_id: 107661. > [E 23:55:53.162340] fp_multiqueue_cancel: flow proto cancel > called on > 0x3bf4d78 > [E 23:55:53.162349] fp_multiqueue_cancel: I/O error occurred > [E 23:55:53.162358] handle_io_error: flow proto error cleanup > started on > 0x3bf4d78: Operation cancelled (possibly due to timeout) > [E 23:55:53.162400] handle_io_error: flow proto 0x3bf4d78 > canceled 1 > operations, will clean up. > [E 23:55:53.167283] mem_to_bmi_callback_fn: I/O error occurred > [E 23:55:53.167293] handle_io_error: flow proto 0x3bf4d78 > error cleanup > finished: Operation cancelled (possibly due to timeout) > [E 23:55:55.201498] Child process with pid 17377 was killed by > an > uncaught signal 6 > [E 23:55:55.203703] PVFS Client Daemon Started. Version > 2.8.4-orangefs-2011-07-18-043429 > > > Server log: > > [E 07/18/2011 01:41:23] Error: ib_check_cq: unknown send state > (unknown) > (0) of sq 0x25be0e0. > [E 07/18/2011 01:41:23] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d3a] > > [E 07/18/2011 01:41:23] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x447180] > [E 07/18/2011 01:41:23] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x448a79] > [E 07/18/2011 01:41:23] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected > +0x383) > [0x445263] > > [E 07/18/2011 01:41:23] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x47b27c] > [E 07/18/2011 01:41:23] [bt] /lib64/libpthread.so.0 > [0x3ba820673d] > [E 07/18/2011 01:41:23] [bt] /lib64/libc.so.6(clone > +0x6d) > [0x3ba7ad44bd] > > > Thanks, > > Mi > > > > On Thu, 2011-07-14 at 12:53 -0500, Michael Moore wrote: > > My apologies, that was the wrong version of the patch > (notably > > missing ;s). Try this one, please. > > > > Thanks, > > Michael > > > > On Thu, Jul 14, 2011 at 1:22 PM, Michael Moore > <[email protected]> > > wrote: > > Hi Mi, > > > > I haven't had a chance to reproduce the issue you're > seeing > > yet. However, the panic seems to point in the > direction of > > this patch. Could you apply the attached patch and > let me know > > if it resolves or changes the panic you're seeing? > For > > reference, if you need it, > > > > From the top-level pvfs2 directory of the source (if > from CVS) > > or the orangefs directory (if the release) run patch > -p0 < > > ~/<path to attached path>. If applying against the > release, it > > will apply with a little bit of an offset. Let me > know how it > > goes. > > > > Thanks, > > Michael > > > > > > > > On Wed, Jul 13, 2011 at 11:39 AM, Mi Zhou > <[email protected]> > > wrote: > > Hi Kyle, > > > > Thanks for sharing your experience. > > Yes, we are using openib. I'm not very > familiar with > > GAMESS itself, just > > running it to benchmark with our current > file system. > > So I am not sure > > what stage it got before the panic. I'll > have our > > users look at the > > logs. I'm attaching the log here as well. > > > > Have you ever found a solution or workaround > for that? > > > > Thanks, > > > > Mi > > > > > > On Wed, 2011-07-13 at 09:50 -0500, Kyle > Schochenmaier > > wrote: > > > Hi Mi - > > > > > > > > > What network interface are you using? > > > I saw some similar panics years ago when > running > > GAMESS over > > > bmi-openib where mopid's where getting > reused > > incorrectly. > > > > > > > > > IIRC GAMESS has a very basic IO pattern, n > > simultaneous reads on n > > > files, followed by a final write ? > > > I'm assuming you make it through the > read/compute > > process prior to > > > the panic? > > > > > > > > > Kyle Schochenmaier > > > > > > > > > On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou > > <[email protected]> wrote: > > > Hi Michael, > > > > > > Yes, the panic is reproducible, I > got it > > about 10 minutes > > > after this > > > application (GAMESS) started. I > ran GAMESS > > on 16 cores on 2 > > > nodes. It > > > was an MPI program, so I think > only the > > master node is doing > > > the writing > > > and hence the panic only occurred > on the > > master node. > > > > > > When this happens, the > pvfs2-server daemon > > disappear. > > > Comparing > > > timestamps in the log, looks like > the > > pvfs2-server errored > > > first and > > > then client got "job_time out" > problem. > > > > > > I am attaching the logs on both > server and > > client and the > > > config file on > > > the server. We have 3 identical > pvfs2-server > > nodes > > > (pvfs01-03), but > > > seems the problem only happens on > pvfs02. > > > > > > Your advice is greatly > appreciated. > > > > > > Thanks, > > > > > > Mi > > > > > > On Tue, 2011-07-12 at 20:54 -0500, > Michael > > Moore wrote: > > > > Hi Mi, > > > > > > > > I don't think there have been > any > > applicable commits since > > > 06/28 to > > > > Orange-Branch that would address > this > > issue. Is the panic > > > consistently > > > > reproducible? If so, what > workload leads > > to the panic? > > > Single client > > > > with writes to a single file? > I'll look at > > the logs to see > > > if anything > > > > stands out, otherwise I may need > to > > locally reproduce the > > > issue to > > > > track down what's going on. > > > > > > > > Thanks for reporting the issue, > > > > Michael > > > > > > > > On Tue, Jul 12, 2011 at 5:43 PM, > Mi Zhou > > > <[email protected]> wrote: > > > > Hi, > > > > > > > > I checked out the code > from the > > cvs branch on 6/28, > > > I don't > > > > see an > > > > immediate kernel panic > any more, > > but still got > > > kernel panic > > > > after some > > > > intensive write to the > file system > > (pls see attached > > > screen > > > > shot). > > > > > > > > This is what is in the > server log: > > > > pvfs02: [E 07/12/2011 > 16:05:30] > > Error: > > > > encourage_recv_incoming: > mop_id > > > > 17bb01c0 in RTS_DONE > message not > > found. > > > > pvfs02: [E 07/12/2011 > 16:05:30] > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error > +0xca) > > > [0x449d3a] > > > > pvfs02: [E 07/12/2011 > 16:05:30] > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x446f30] > > > > pvfs02: [E 07/12/2011 > 16:05:30] > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x448a79] > > > > pvfs02: [E 07/12/2011 > 16:05:30] > > > > > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected > > > > +0x383) > > > > [0x445263] > > > > pvfs02: [E 07/12/2011 > 16:05:30] > > > > > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x47b27c] > > > > pvfs02: [E 07/12/2011 > 16:05:30] > > > > > [bt] /lib64/libpthread.so.0 > > > > [0x3ba820673d] > > > > pvfs02: [E 07/12/2011 > 16:05:30] > > > > > [bt] /lib64/libc.so.6(clone > > > > +0x6d) [0x3ba7ad44bd] > > > > > > > > > > > > > > > > And this is the client > log: > > > > > > > > [E 16:10:17.359727] > > job_time_mgr_expire: job time > > > out: > > > > cancelling bmi > > > > operation, job_id: > 173191. > > > > [E 16:10:19.371926] > Warning: > > ib_tcp_client_connect: > > > connect to > > > > server > > > > pvfs02:3337: Connection > refused. > > > > [E 16:10:19.371943] > Receive > > immediately failed: > > > Connection > > > > refused > > > > [E 16:10:21.382875] > Warning: > > ib_tcp_client_connect: > > > connect to > > > > server > > > > pvfs02:3337: Connection > refused. > > > > [E 16:10:21.382888] > Receive > > immediately failed: > > > Connection > > > > refused > > > > > > > > We have pvfs01-03 as > running the > > pvfs-server. Both > > > client and > > > > server are > > > > on centos 5 x63_64, > kernel version > > > 2.6.18-238.9.1.el5. > > > > > > > > Any advice? > > > > > > > > Thanks, > > > > > > > > Mi > > > > > > > > > > > > > > > > On Thu, 2011-07-07 at > 12:33 -0500, > > Ted Hesselroth > > > wrote: > > > > > That did resolve the > problem. > > Thanks. > > > > > > > > > > On 7/7/2011 11:19 AM, > Michael > > Moore wrote: > > > > > > Hi Ted, > > > > > > > > > > > > There was a > regression when > > adding support for > > > newer > > > > kernels that made it in > > > > > > to the 2.8.4 > release. I > > believe that's the issue > > > you're > > > > seeing (a kernel > > > > > > panic immediately on > > modprobe/insmod). The next > > > release > > > > will include that > > > > > > fix. Until then, if > you can > > check out the latest > > > version > > > > of the code from > > > > > > CVS, it should > resolve the > > issue. The CVS branch > > > is > > > > Orange-Branch, full > > > > > > directions for CVS > checkout at > > > > > http://www.orangefs.org/support/ > > > > > > > > > > > > We are currently > running the > > kernel module with > > > the latest > > > > code on CentOS 5 > > > > > > and SL 6 systems. > Let me know > > how it goes. > > > > > > > > > > > > For anyone > interested, the > > commit to resolve the > > > issue > > > > was: > > > > > > > > > > > > > > > > > http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853 > > > > > > > > > > > > Michael > > > > > > > > > > > > > > > > > > On Thu, Jul 7, 2011 > at 11:36 > > AM, Ted > > > > > Hesselroth<[email protected]> wrote: > > > > > > > > > > > >> I have built the > kernel > > module from > > > orangefs-2.8.4 source > > > > against a 64-bit > > > > > >> 2.6.18-238.12.1 > linux kernel > > source, and > > > against a 32-bit > > > > 2.6.18-238.9.1 > > > > > >> source. In both > cases, the > > kernel hung when the > > > module > > > > was inserted with > > > > > >> insmod. The first > did report > > "kernel: Oops: > > > 0000 [1] > > > > SMP". The distributions > > > > > >> are Scientific > Linux 5.x, > > which is rpm-based > > > and similar > > > > to Centos. > > > > > >> > > > > > >> Are there kernels > for this > > scenario for which > > > the build > > > > is known to work? > > > > > >> The server build > and install > > went fine, but I > > > would like > > > > to configure some > > > > > >> clients to access > orangefs > > through a mount > > > point. > > > > > >> > > > > > >> Thanks. > > > > > >> > > > > > >> > > > > > > ______________________________**_________________ > > > > > >> Pvfs2-users mailing > list > > > > > >> > > > > > > > > > > > Pvfs2-users@beowulf-**underground.org<[email protected]> > > > > > >> > > > > > > > > > > > http://www.beowulf-**underground.org/mailman/**listinfo/pvfs2-users<http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users> > > > > > >> > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Pvfs2-users mailing > list > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > > > > > > > -- > > > > > > > > Mi Zhou > > > > System Integration > Engineer > > > > Information Sciences > > > > St. Jude Children's > Research > > Hospital > > > > 262 Danny Thomas Pl. MS > 312 > > > > Memphis, TN 38105 > > > > 901.595.5771 > > > > > > > > Email Disclaimer: > > www.stjude.org/emaildisclaimer > > > > > > > > > > -- > > > > > > > > > Mi Zhou > > > System Integration Engineer > > > Information Sciences > > > St. Jude Children's Research > Hospital > > > 262 Danny Thomas Pl. MS 312 > > > Memphis, TN 38105 > > > 901.595.5771 > > > > > > > > > > > > _______________________________________________ > > > Pvfs2-users mailing list > > > > [email protected] > > > > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > > > > > > > -- > > > > > > Mi Zhou > > System Integration Engineer > > Information Sciences > > St. Jude Children's Research Hospital > > 262 Danny Thomas Pl. MS 312 > > Memphis, TN 38105 > > 901.595.5771 > > > > > > > > > > -- > > > Mi Zhou > System Integration Engineer > Information Sciences > St. Jude Children's Research Hospital > 262 Danny Thomas Pl. MS 312 > Memphis, TN 38105 > 901.595.5771 > > -- Mi Zhou System Integration Engineer Information Sciences St. Jude Children's Research Hospital 262 Danny Thomas Pl. MS 312 Memphis, TN 38105 901.595.5771
[E 14:43:34.809187] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 14:43:34.829408] [INFO]: Mapping pointer 0x2b046490f000 for I/O. [D 14:43:34.830770] [INFO]: Mapping pointer 0x10e1f000 for I/O. [E 14:56:43.233779] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 218845. [E 14:56:43.233815] fp_multiqueue_cancel: flow proto cancel called on 0x111ef988 [E 14:56:43.233824] fp_multiqueue_cancel: I/O error occurred [E 14:56:43.233831] handle_io_error: flow proto error cleanup started on 0x111ef988: Operation cancelled (possibly due to timeout) [E 14:56:43.233869] handle_io_error: flow proto 0x111ef988 canceled 1 operations, will clean up. [E 14:56:43.234092] bmi_to_mem_callback_fn: I/O error occurred [E 14:56:43.234100] handle_io_error: flow proto 0x111ef988 error cleanup finished: Operation cancelled (possibly due to timeout) [E 14:56:43.234112] io_datafile_complete_operations: flow failed, retrying from msgpair [E 15:03:38.939127] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 321779. [E 15:03:38.939162] fp_multiqueue_cancel: flow proto cancel called on 0x111ef988 [E 15:03:38.939171] fp_multiqueue_cancel: I/O error occurred [E 15:03:38.939178] handle_io_error: flow proto error cleanup started on 0x111ef988: Operation cancelled (possibly due to timeout) [E 15:03:38.939218] handle_io_error: flow proto 0x111ef988 canceled 1 operations, will clean up. [E 15:03:38.939426] bmi_to_mem_callback_fn: I/O error occurred [E 15:03:38.939434] handle_io_error: flow proto 0x111ef988 error cleanup finished: Operation cancelled (possibly due to timeout) [E 15:03:38.939449] io_datafile_complete_operations: flow failed, retrying from msgpair [E 15:09:46.777800] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 408208. [E 15:09:46.777832] fp_multiqueue_cancel: flow proto cancel called on 0x1110e148 [E 15:09:46.777840] fp_multiqueue_cancel: I/O error occurred [E 15:09:46.777848] handle_io_error: flow proto error cleanup started on 0x1110e148: Operation cancelled (possibly due to timeout) [E 15:09:46.777887] handle_io_error: flow proto 0x1110e148 canceled 1 operations, will clean up. [E 15:09:46.778094] mem_to_bmi_callback_fn: I/O error occurred [E 15:09:46.778104] handle_io_error: flow proto 0x1110e148 error cleanup finished: Operation cancelled (possibly due to timeout) [E 15:09:48.804003] Child process with pid 4153 was killed by an uncaught signal 6 [E 15:09:48.806178] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 15:09:48.825021] [INFO]: Mapping pointer 0x2b785c3c6000 for I/O. [D 15:09:48.826315] [INFO]: Mapping pointer 0x1826b000 for I/O. [E 15:18:19.513574] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 180966. [E 15:18:19.513613] fp_multiqueue_cancel: flow proto cancel called on 0x185dff28 [E 15:18:19.513621] fp_multiqueue_cancel: I/O error occurred [E 15:18:19.513629] handle_io_error: flow proto error cleanup started on 0x185dff28: Operation cancelled (possibly due to timeout) [E 15:18:19.513667] handle_io_error: flow proto 0x185dff28 canceled 1 operations, will clean up. [E 15:18:19.513890] mem_to_bmi_callback_fn: I/O error occurred [E 15:18:19.513899] handle_io_error: flow proto 0x185dff28 error cleanup finished: Operation cancelled (possibly due to timeout) [E 15:18:19.517715] Child process with pid 4380 was killed by an uncaught signal 6 [E 15:18:19.519877] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 15:18:19.538804] [INFO]: Mapping pointer 0x2b4c69624000 for I/O. [D 15:18:19.540125] [INFO]: Mapping pointer 0x1dd93000 for I/O. [E 15:25:18.646816] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 96951. [E 15:25:18.646849] fp_multiqueue_cancel: flow proto cancel called on 0x1df150d8 [E 15:25:18.646857] fp_multiqueue_cancel: I/O error occurred [E 15:25:18.646865] handle_io_error: flow proto error cleanup started on 0x1df150d8: Operation cancelled (possibly due to timeout) [E 15:25:18.646903] handle_io_error: flow proto 0x1df150d8 canceled 1 operations, will clean up. [E 15:25:18.647148] bmi_to_mem_callback_fn: I/O error occurred [E 15:25:18.647157] handle_io_error: flow proto 0x1df150d8 error cleanup finished: Operation cancelled (possibly due to timeout) [E 15:25:18.647170] io_datafile_complete_operations: flow failed, retrying from msgpair [E 15:25:18.650983] Child process with pid 4410 was killed by an uncaught signal 6 [E 15:25:18.653123] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 15:25:18.671839] [INFO]: Mapping pointer 0x2b1b47547000 for I/O. [D 15:25:18.673229] [INFO]: Mapping pointer 0xeedd000 for I/O. [E 15:30:46.272575] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 37436. [E 15:30:46.272630] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 37437. [E 15:30:46.276215] Child process with pid 4428 was killed by an uncaught signal 6 [E 15:30:46.278400] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 15:30:46.297483] [INFO]: Mapping pointer 0x2ba621943000 for I/O. [D 15:30:46.298816] [INFO]: Mapping pointer 0x1304a000 for I/O. [E 15:37:18.560857] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 78656. [E 15:37:18.560895] fp_multiqueue_cancel: flow proto cancel called on 0x133c8548 [E 15:37:18.560903] fp_multiqueue_cancel: I/O error occurred [E 15:37:18.560910] handle_io_error: flow proto error cleanup started on 0x133c8548: Operation cancelled (possibly due to timeout) [E 15:37:18.560945] handle_io_error: flow proto 0x133c8548 canceled 1 operations, will clean up. [E 15:37:18.561165] bmi_to_mem_callback_fn: I/O error occurred [E 15:37:18.561174] handle_io_error: flow proto 0x133c8548 error cleanup finished: Operation cancelled (possibly due to timeout) [E 15:37:18.561186] io_datafile_complete_operations: flow failed, retrying from msgpair [E 15:44:37.728727] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 223270. [E 15:44:37.728758] fp_multiqueue_cancel: flow proto cancel called on 0x133b43d8 [E 15:44:37.728766] fp_multiqueue_cancel: I/O error occurred [E 15:44:37.728774] handle_io_error: flow proto error cleanup started on 0x133b43d8: Operation cancelled (possibly due to timeout) [E 15:44:37.728811] handle_io_error: flow proto 0x133b43d8 canceled 1 operations, will clean up. [E 15:44:37.729030] mem_to_bmi_callback_fn: I/O error occurred [E 15:44:37.729039] handle_io_error: flow proto 0x133b43d8 error cleanup finished: Operation cancelled (possibly due to timeout) [E 15:44:39.761953] Child process with pid 4443 was killed by an uncaught signal 6 [E 15:44:39.764109] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 15:44:39.782909] [INFO]: Mapping pointer 0x2ba57ebb4000 for I/O. [D 15:44:39.784222] [INFO]: Mapping pointer 0xb79a000 for I/O. [E 15:53:22.065270] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 207856. [E 15:53:22.065333] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 207857. [E 15:53:22.069428] Child process with pid 4478 was killed by an uncaught signal 6 [E 15:53:22.071609] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 15:53:22.090658] [INFO]: Mapping pointer 0x2abb7282a000 for I/O. [D 15:53:22.091974] [INFO]: Mapping pointer 0x11203000 for I/O. [E 16:01:40.833736] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 184765. [E 16:01:40.833765] fp_multiqueue_cancel: flow proto cancel called on 0x1157a718 [E 16:01:40.833774] fp_multiqueue_cancel: I/O error occurred [E 16:01:40.833782] handle_io_error: flow proto error cleanup started on 0x1157a718: Operation cancelled (possibly due to timeout) [E 16:01:40.833820] handle_io_error: flow proto 0x1157a718 canceled 1 operations, will clean up. [E 16:01:40.834037] mem_to_bmi_callback_fn: I/O error occurred [E 16:01:40.834046] handle_io_error: flow proto 0x1157a718 error cleanup finished: Operation cancelled (possibly due to timeout) [E 16:01:42.868685] Child process with pid 4503 was killed by an uncaught signal 6 [E 16:01:42.870839] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 16:01:42.889938] [INFO]: Mapping pointer 0x2ba6f51f3000 for I/O. [D 16:01:42.891225] [INFO]: Mapping pointer 0x14fcd000 for I/O. [E 16:06:43.923989] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 179. [E 16:06:43.924270] mem_to_bmi_callback_fn: I/O error occurred [E 16:06:43.924282] handle_io_error: flow proto error cleanup started on 0x15331aa8: [E 16:06:43.924290] handle_io_error: flow proto 0x15331aa8 canceled 0 operations, will clean up. [E 16:06:43.924297] handle_io_error: flow proto 0x15331aa8 error cleanup finished: [E 16:06:43.924319] Warning: msgpair failed to ib://pvfs02:3337,tcp://pvfs02:3336, will retry: [E 16:06:43.924328] *** msgpairarray_completion_fn: msgpair to server ib://pvfs02:3337,tcp://pvfs02:3336 failed: [E 16:06:43.924335] *** Non-BMI failure. [E 16:11:45.055178] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 38397. [E 16:11:45.055458] mem_to_bmi_callback_fn: I/O error occurred [E 16:11:45.055470] handle_io_error: flow proto error cleanup started on 0x15318c48: [E 16:11:45.055478] handle_io_error: flow proto 0x15318c48 canceled 0 operations, will clean up. [E 16:11:45.055486] handle_io_error: flow proto 0x15318c48 error cleanup finished: [E 16:16:47.052531] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 47026. [E 16:16:47.052591] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 47030. [E 16:16:47.052807] bmi_to_mem_callback_fn: I/O error occurred [E 16:16:47.052821] handle_io_error: flow proto error cleanup started on 0x15318c48: [E 16:16:47.052830] handle_io_error: flow proto 0x15318c48 canceled 0 operations, will clean up. [E 16:16:47.052838] handle_io_error: flow proto 0x15318c48 error cleanup finished: [E 16:16:47.056391] Child process with pid 4550 was killed by an uncaught signal 6 [E 16:16:47.058595] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 16:16:47.077816] [INFO]: Mapping pointer 0x2afb89c1f000 for I/O. [D 16:16:47.079133] [INFO]: Mapping pointer 0x4ae1000 for I/O. [E 16:17:07.057884] mem_to_bmi_callback_fn: I/O error occurred [E 16:17:07.057919] handle_io_error: flow proto error cleanup started on 0x4e40c28: [E 16:17:07.057928] handle_io_error: flow proto 0x4e40c28 canceled 0 operations, will clean up. [E 16:17:07.057936] handle_io_error: flow proto 0x4e40c28 error cleanup finished: [E 16:22:07.741920] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 3423. [E 16:22:07.741985] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 3426. [E 16:22:07.741995] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 3429. [E 16:22:07.742029] Warning: msgpair failed to ib://pvfs02:3337,tcp://pvfs02:3336, will retry: [E 16:22:07.742039] *** msgpairarray_completion_fn: msgpair to server ib://pvfs02:3337,tcp://pvfs02:3336 failed: [E 16:22:07.742046] *** Non-BMI failure. [E 16:22:07.742063] Warning: msgpair failed to ib://pvfs02:3337,tcp://pvfs02:3336, will retry: [E 16:22:07.742071] *** msgpairarray_completion_fn: msgpair to server ib://pvfs02:3337,tcp://pvfs02:3336 failed: [E 16:22:07.742078] *** Non-BMI failure. [E 16:22:07.742089] Warning: msgpair failed to ib://pvfs02:3337,tcp://pvfs02:3336, will retry: [E 16:22:07.742097] *** msgpairarray_completion_fn: msgpair to server ib://pvfs02:3337,tcp://pvfs02:3336 failed: [E 16:22:07.742104] *** Non-BMI failure. [E 16:22:07.747556] Child process with pid 4588 was killed by an uncaught signal 6 [E 16:22:07.749757] PVFS Client Daemon Started. Version 2.8.4-orangefs-2011-07-18-043429 [D 16:22:07.768792] [INFO]: Mapping pointer 0x2b914df81000 for I/O. [D 16:22:07.770095] [INFO]: Mapping pointer 0xc8b8000 for I/O.
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
