Re: [Pvfs2-users] pvfs2.ko causes kernel hang

Mi Zhou Tue, 19 Jul 2011 07:53:05 -0700

Michael,

I reran the program yesterday afternoon, killed it when I started to see
errors in the application log. Both server and client did not crash at
that point. When I look at it this morning, I saw the kernel panic on
the client. Server is still running. Here is the client log.


Thanks,

Mi 

On Mon, 2011-07-18 at 10:08 -0500, Michael Moore wrote:
> Mi,
> 
> There are a couple issues here, one is the kernel panic and the other
> is the server segfault. The clocks between the two hosts appear to be
> out of sync based on the logs. Can you infer or have you observed what
> the pattern of failure is (e.g server crash -> client I/O failure ->
> kernel panic)?  I'm still looking at the kernel panic just trying to
> better understand which code paths are involved.
> 
> Thanks,
> Michael
> 
> On Mon, Jul 18, 2011 at 10:57 AM, Mi Zhou <[email protected]> wrote:
>         Michael,
>         
>         Thanks for looking into this.
>         
>         I applied the patch, recompiled the kernel module, still got
>         "kernel
>         panic" after a while. But the errors on the client and server
>         seem to be
>         different than before:
>         
>         Client log:
>         
>         [E 23:48:00.514201] PVFS Client Daemon Started.  Version
>         2.8.4-orangefs-2011-07-18-043429
>         [D 23:48:00.535157] [INFO]: Mapping pointer 0x2ad023ca8000 for
>         I/O.
>         [D 23:48:00.536534] [INFO]: Mapping pointer 0x3756000 for I/O.
>         [E 23:55:53.162291] job_time_mgr_expire: job time out:
>         cancelling flow
>         operation, job_id: 107661.
>         [E 23:55:53.162340] fp_multiqueue_cancel: flow proto cancel
>         called on
>         0x3bf4d78
>         [E 23:55:53.162349] fp_multiqueue_cancel: I/O error occurred
>         [E 23:55:53.162358] handle_io_error: flow proto error cleanup
>         started on
>         0x3bf4d78: Operation cancelled (possibly due to timeout)
>         [E 23:55:53.162400] handle_io_error: flow proto 0x3bf4d78
>         canceled 1
>         operations, will clean up.
>         [E 23:55:53.167283] mem_to_bmi_callback_fn: I/O error occurred
>         [E 23:55:53.167293] handle_io_error: flow proto 0x3bf4d78
>         error cleanup
>         finished: Operation cancelled (possibly due to timeout)
>         [E 23:55:55.201498] Child process with pid 17377 was killed by
>         an
>         uncaught signal 6
>         [E 23:55:55.203703] PVFS Client Daemon Started.  Version
>         2.8.4-orangefs-2011-07-18-043429
>         
>         
>         Server log:
>         
>         [E 07/18/2011 01:41:23] Error: ib_check_cq: unknown send state
>         (unknown)
>         (0) of sq 0x25be0e0.
>         [E 07/18/2011 01:41:23]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d3a]
>         
>         [E 07/18/2011 01:41:23]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x447180]
>         [E 07/18/2011 01:41:23]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x448a79]
>         [E 07/18/2011 01:41:23]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected
>         +0x383)
>         [0x445263]
>         
>         [E 07/18/2011 01:41:23]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x47b27c]
>         [E 07/18/2011 01:41:23]         [bt] /lib64/libpthread.so.0
>         [0x3ba820673d]
>         [E 07/18/2011 01:41:23]         [bt] /lib64/libc.so.6(clone
>         +0x6d)
>         [0x3ba7ad44bd]
>         
>         
>         Thanks,
>         
>         Mi
>         
>         
>         
>         On Thu, 2011-07-14 at 12:53 -0500, Michael Moore wrote:
>         > My apologies, that was the wrong version of the patch
>         (notably
>         > missing ;s). Try this one, please.
>         >
>         > Thanks,
>         > Michael
>         >
>         > On Thu, Jul 14, 2011 at 1:22 PM, Michael Moore
>         <[email protected]>
>         > wrote:
>         >         Hi Mi,
>         >
>         >         I haven't had a chance to reproduce the issue you're
>         seeing
>         >         yet. However, the panic seems to point in the
>         direction of
>         >         this patch. Could you apply the attached patch and
>         let me know
>         >         if it resolves or changes the panic you're seeing?
>         For
>         >         reference, if you need it,
>         >
>         >         From the top-level pvfs2 directory of the source (if
>         from CVS)
>         >         or the orangefs directory (if the release) run patch
>         -p0 <
>         >         ~/<path to attached path>. If applying against the
>         release, it
>         >         will apply with a little bit of an offset. Let me
>         know how it
>         >         goes.
>         >
>         >         Thanks,
>         >         Michael
>         >
>         >
>         >
>         >         On Wed, Jul 13, 2011 at 11:39 AM, Mi Zhou
>         <[email protected]>
>         >         wrote:
>         >                 Hi Kyle,
>         >
>         >                 Thanks for sharing your experience.
>         >                 Yes, we are using openib. I'm not very
>         familiar with
>         >                 GAMESS itself, just
>         >                 running it to benchmark with our current
>         file system.
>         >                 So I am not sure
>         >                 what stage it got before the panic. I'll
>         have our
>         >                 users look at the
>         >                 logs. I'm attaching the log here as well.
>         >
>         >                 Have you ever found a solution or workaround
>         for that?
>         >
>         >                 Thanks,
>         >
>         >                 Mi
>         >
>         >
>         >                 On Wed, 2011-07-13 at 09:50 -0500, Kyle
>         Schochenmaier
>         >                 wrote:
>         >                 > Hi Mi -
>         >                 >
>         >                 >
>         >                 > What network interface are you using?
>         >                 > I saw some similar panics years ago when
>         running
>         >                 GAMESS over
>         >                 > bmi-openib where mopid's where getting
>         reused
>         >                 incorrectly.
>         >                 >
>         >                 >
>         >                 > IIRC GAMESS has a very basic IO pattern, n
>         >                 simultaneous reads on n
>         >                 > files, followed by a final write ?
>         >                 >   I'm assuming you make it through the
>         read/compute
>         >                 process prior to
>         >                 > the panic?
>         >                 >
>         >                 >
>         >                 > Kyle Schochenmaier
>         >                 >
>         >                 >
>         >                 > On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou
>         >                 <[email protected]> wrote:
>         >                 >         Hi Michael,
>         >                 >
>         >                 >         Yes, the panic is reproducible, I
>         got it
>         >                 about 10 minutes
>         >                 >         after this
>         >                 >         application (GAMESS) started. I
>         ran GAMESS
>         >                 on 16 cores on 2
>         >                 >         nodes. It
>         >                 >         was an MPI program, so I think
>         only the
>         >                 master node is doing
>         >                 >         the writing
>         >                 >         and hence the panic only occurred
>         on the
>         >                 master node.
>         >                 >
>         >                 >         When this happens, the
>         pvfs2-server daemon
>         >                 disappear.
>         >                 >         Comparing
>         >                 >         timestamps in the log, looks like
>         the
>         >                 pvfs2-server errored
>         >                 >         first and
>         >                 >         then client got "job_time out"
>         problem.
>         >                 >
>         >                 >         I am attaching the logs on both
>         server and
>         >                 client and the
>         >                 >         config file on
>         >                 >         the server. We have 3 identical
>         pvfs2-server
>         >                 nodes
>         >                 >         (pvfs01-03), but
>         >                 >         seems the problem only happens on
>         pvfs02.
>         >                 >
>         >                 >         Your advice is greatly
>         appreciated.
>         >                 >
>         >                 >         Thanks,
>         >                 >
>         >                 >         Mi
>         >                 >
>         >                 >         On Tue, 2011-07-12 at 20:54 -0500,
>         Michael
>         >                 Moore wrote:
>         >                 >         > Hi Mi,
>         >                 >         >
>         >                 >         > I don't think there have been
>         any
>         >                 applicable commits since
>         >                 >         06/28 to
>         >                 >         > Orange-Branch that would address
>         this
>         >                 issue. Is the panic
>         >                 >         consistently
>         >                 >         > reproducible? If so, what
>         workload leads
>         >                 to the panic?
>         >                 >         Single client
>         >                 >         > with writes to a single file?
>         I'll look at
>         >                 the logs to see
>         >                 >         if anything
>         >                 >         > stands out, otherwise I may need
>         to
>         >                 locally reproduce the
>         >                 >         issue to
>         >                 >         > track down what's going on.
>         >                 >         >
>         >                 >         > Thanks for reporting the issue,
>         >                 >         > Michael
>         >                 >         >
>         >                 >         > On Tue, Jul 12, 2011 at 5:43 PM,
>         Mi Zhou
>         >                 >         <[email protected]> wrote:
>         >                 >         >         Hi,
>         >                 >         >
>         >                 >         >         I checked out the code
>         from the
>         >                 cvs branch on 6/28,
>         >                 >         I don't
>         >                 >         >         see an
>         >                 >         >         immediate kernel panic
>         any more,
>         >                 but still got
>         >                 >         kernel panic
>         >                 >         >         after some
>         >                 >         >         intensive write to the
>         file system
>         >                 (pls see attached
>         >                 >         screen
>         >                 >         >         shot).
>         >                 >         >
>         >                 >         >         This is what is in the
>         server log:
>         >                 >         >         pvfs02: [E 07/12/2011
>         16:05:30]
>         >                 Error:
>         >                 >         >         encourage_recv_incoming:
>         mop_id
>         >                 >         >         17bb01c0 in RTS_DONE
>         message not
>         >                 found.
>         >                 >         >         pvfs02: [E 07/12/2011
>         16:05:30]
>         >                 >         >
>         >                 [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error
>         +0xca)
>         >                 >         [0x449d3a]
>         >                 >         >         pvfs02: [E 07/12/2011
>         16:05:30]
>         >                 >         >
>         >                 [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x446f30]
>         >                 >         >         pvfs02: [E 07/12/2011
>         16:05:30]
>         >                 >         >
>         >                 [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x448a79]
>         >                 >         >         pvfs02: [E 07/12/2011
>         16:05:30]
>         >                 >         >
>         >                 >
>         >
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected
>         >                 >         >         +0x383)
>         >                 >         >         [0x445263]
>         >                 >         >         pvfs02: [E 07/12/2011
>         16:05:30]
>         >                 >         >
>         >                 [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x47b27c]
>         >                 >         >         pvfs02: [E 07/12/2011
>         16:05:30]
>         >                 >         >
>         [bt] /lib64/libpthread.so.0
>         >                 >         >         [0x3ba820673d]
>         >                 >         >         pvfs02: [E 07/12/2011
>         16:05:30]
>         >                 >         >
>         [bt] /lib64/libc.so.6(clone
>         >                 >         >         +0x6d) [0x3ba7ad44bd]
>         >                 >         >
>         >                 >         >
>         >                 >         >
>         >                 >         >         And this is the client
>         log:
>         >                 >         >
>         >                 >         >         [E 16:10:17.359727]
>         >                 job_time_mgr_expire: job time
>         >                 >         out:
>         >                 >         >         cancelling bmi
>         >                 >         >         operation, job_id:
>         173191.
>         >                 >         >         [E 16:10:19.371926]
>         Warning:
>         >                 ib_tcp_client_connect:
>         >                 >         connect to
>         >                 >         >         server
>         >                 >         >         pvfs02:3337: Connection
>         refused.
>         >                 >         >         [E 16:10:19.371943]
>         Receive
>         >                 immediately failed:
>         >                 >         Connection
>         >                 >         >         refused
>         >                 >         >         [E 16:10:21.382875]
>         Warning:
>         >                 ib_tcp_client_connect:
>         >                 >         connect to
>         >                 >         >         server
>         >                 >         >         pvfs02:3337: Connection
>         refused.
>         >                 >         >         [E 16:10:21.382888]
>         Receive
>         >                 immediately failed:
>         >                 >         Connection
>         >                 >         >         refused
>         >                 >         >
>         >                 >         >         We have pvfs01-03 as
>         running the
>         >                 pvfs-server. Both
>         >                 >         client and
>         >                 >         >         server are
>         >                 >         >         on centos 5 x63_64,
>         kernel version
>         >                 >         2.6.18-238.9.1.el5.
>         >                 >         >
>         >                 >         >         Any advice?
>         >                 >         >
>         >                 >         >         Thanks,
>         >                 >         >
>         >                 >         >         Mi
>         >                 >         >
>         >                 >         >
>         >                 >         >
>         >                 >         >         On Thu, 2011-07-07 at
>         12:33 -0500,
>         >                 Ted Hesselroth
>         >                 >         wrote:
>         >                 >         >         > That did resolve the
>         problem.
>         >                 Thanks.
>         >                 >         >         >
>         >                 >         >         > On 7/7/2011 11:19 AM,
>         Michael
>         >                 Moore wrote:
>         >                 >         >         > > Hi Ted,
>         >                 >         >         > >
>         >                 >         >         > > There was a
>         regression when
>         >                 adding support for
>         >                 >         newer
>         >                 >         >         kernels that made it in
>         >                 >         >         > > to the 2.8.4
>         release. I
>         >                 believe that's the issue
>         >                 >         you're
>         >                 >         >         seeing (a kernel
>         >                 >         >         > > panic immediately on
>         >                 modprobe/insmod). The next
>         >                 >         release
>         >                 >         >         will include that
>         >                 >         >         > > fix. Until then, if
>         you can
>         >                 check out the latest
>         >                 >         version
>         >                 >         >         of the code from
>         >                 >         >         > > CVS, it should
>         resolve the
>         >                 issue. The CVS branch
>         >                 >         is
>         >                 >         >         Orange-Branch, full
>         >                 >         >         > > directions for CVS
>         checkout at
>         >                 >         >
>         http://www.orangefs.org/support/
>         >                 >         >         > >
>         >                 >         >         > > We are currently
>         running the
>         >                 kernel module with
>         >                 >         the latest
>         >                 >         >         code on CentOS 5
>         >                 >         >         > > and SL 6 systems.
>         Let me know
>         >                 how it goes.
>         >                 >         >         > >
>         >                 >         >         > > For anyone
>         interested, the
>         >                 commit to resolve the
>         >                 >         issue
>         >                 >         >         was:
>         >                 >         >         > >
>         >                 >         >
>         >                 >
>         >
>         
> http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853
>         >                 >         >         > >
>         >                 >         >         > > Michael
>         >                 >         >         > >
>         >                 >         >         > >
>         >                 >         >         > > On Thu, Jul 7, 2011
>         at 11:36
>         >                 AM, Ted
>         >                 >         >
>         Hesselroth<[email protected]>  wrote:
>         >                 >         >         > >
>         >                 >         >         > >> I have built the
>         kernel
>         >                 module from
>         >                 >         orangefs-2.8.4 source
>         >                 >         >         against a 64-bit
>         >                 >         >         > >> 2.6.18-238.12.1
>         linux kernel
>         >                 source, and
>         >                 >         against a 32-bit
>         >                 >         >         2.6.18-238.9.1
>         >                 >         >         > >> source. In both
>         cases, the
>         >                 kernel hung when the
>         >                 >         module
>         >                 >         >         was inserted with
>         >                 >         >         > >> insmod. The first
>         did report
>         >                 "kernel: Oops:
>         >                 >         0000 [1]
>         >                 >         >         SMP". The distributions
>         >                 >         >         > >> are Scientific
>         Linux 5.x,
>         >                 which is rpm-based
>         >                 >         and similar
>         >                 >         >         to Centos.
>         >                 >         >         > >>
>         >                 >         >         > >> Are there kernels
>         for this
>         >                 scenario for which
>         >                 >         the build
>         >                 >         >         is known to work?
>         >                 >         >         > >> The server build
>         and install
>         >                 went fine, but I
>         >                 >         would like
>         >                 >         >         to configure some
>         >                 >         >         > >> clients to access
>         orangefs
>         >                 through a mount
>         >                 >         point.
>         >                 >         >         > >>
>         >                 >         >         > >> Thanks.
>         >                 >         >         > >>
>         >                 >         >         > >>
>         >                 >
>         >
>         ______________________________**_________________
>         >                 >         >         > >> Pvfs2-users mailing
>         list
>         >                 >         >         > >>
>         >                 >         >
>         >                 >
>         >
>         
> Pvfs2-users@beowulf-**underground.org<[email protected]>
>         >                 >         >         > >>
>         >                 >         >
>         >                 >
>         >
>         
> http://www.beowulf-**underground.org/mailman/**listinfo/pvfs2-users<http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users>
>         >                 >         >         > >>
>         >                 >         >         > >
>         >                 >         >
>         >                 >         >         >
>         >
>         _______________________________________________
>         >                 >         >         > Pvfs2-users mailing
>         list
>         >                 >         >         >
>         >                 [email protected]
>         >                 >         >
>         >                 >         >         >
>         >                 >         >
>         >                 >
>         >
>         http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>         >                 >         >         >
>         >                 >         >
>         >                 >         >         --
>         >                 >         >
>         >                 >         >         Mi Zhou
>         >                 >         >         System Integration
>         Engineer
>         >                 >         >         Information Sciences
>         >                 >         >         St. Jude Children's
>         Research
>         >                 Hospital
>         >                 >         >         262 Danny Thomas Pl. MS
>         312
>         >                 >         >         Memphis, TN 38105
>         >                 >         >         901.595.5771
>         >                 >         >
>         >                 >         >         Email Disclaimer:
>         >                  www.stjude.org/emaildisclaimer
>         >                 >         >
>         >                 >
>         >                 >         --
>         >                 >
>         >                 >
>         >                 >         Mi Zhou
>         >                 >         System Integration Engineer
>         >                 >         Information Sciences
>         >                 >         St. Jude Children's Research
>         Hospital
>         >                 >         262 Danny Thomas Pl. MS 312
>         >                 >         Memphis, TN 38105
>         >                 >         901.595.5771
>         >                 >
>         >                 >
>         >                 >
>         >
>         _______________________________________________
>         >                 >         Pvfs2-users mailing list
>         >                 >
>         [email protected]
>         >                 >
>         >
>         http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>         >                 >
>         >                 >
>         >                 >
>         >
>         >                 --
>         >
>         >
>         >                 Mi Zhou
>         >                 System Integration Engineer
>         >                 Information Sciences
>         >                 St. Jude Children's Research Hospital
>         >                 262 Danny Thomas Pl. MS 312
>         >                 Memphis, TN 38105
>         >                 901.595.5771
>         >
>         >
>         >
>         >
>         
>         --
>         
>         
>         Mi Zhou
>         System Integration Engineer
>         Information Sciences
>         St. Jude Children's Research Hospital
>         262 Danny Thomas Pl. MS 312
>         Memphis, TN 38105
>         901.595.5771
>         
> 
-- 

Mi Zhou
System Integration Engineer
Information Sciences
St. Jude Children's Research Hospital
262 Danny Thomas Pl. MS 312 
Memphis, TN 38105
901.595.5771

[E 14:43:34.809187] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 14:43:34.829408] [INFO]: Mapping pointer 0x2b046490f000 for I/O.
[D 14:43:34.830770] [INFO]: Mapping pointer 0x10e1f000 for I/O.
[E 14:56:43.233779] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 218845.
[E 14:56:43.233815] fp_multiqueue_cancel: flow proto cancel called on 0x111ef988
[E 14:56:43.233824] fp_multiqueue_cancel: I/O error occurred
[E 14:56:43.233831] handle_io_error: flow proto error cleanup started on 0x111ef988: Operation cancelled (possibly due to timeout)
[E 14:56:43.233869] handle_io_error: flow proto 0x111ef988 canceled 1 operations, will clean up.
[E 14:56:43.234092] bmi_to_mem_callback_fn: I/O error occurred
[E 14:56:43.234100] handle_io_error: flow proto 0x111ef988 error cleanup finished: Operation cancelled (possibly due to timeout)
[E 14:56:43.234112] io_datafile_complete_operations: flow failed, retrying from msgpair
[E 15:03:38.939127] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 321779.
[E 15:03:38.939162] fp_multiqueue_cancel: flow proto cancel called on 0x111ef988
[E 15:03:38.939171] fp_multiqueue_cancel: I/O error occurred
[E 15:03:38.939178] handle_io_error: flow proto error cleanup started on 0x111ef988: Operation cancelled (possibly due to timeout)
[E 15:03:38.939218] handle_io_error: flow proto 0x111ef988 canceled 1 operations, will clean up.
[E 15:03:38.939426] bmi_to_mem_callback_fn: I/O error occurred
[E 15:03:38.939434] handle_io_error: flow proto 0x111ef988 error cleanup finished: Operation cancelled (possibly due to timeout)
[E 15:03:38.939449] io_datafile_complete_operations: flow failed, retrying from msgpair
[E 15:09:46.777800] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 408208.
[E 15:09:46.777832] fp_multiqueue_cancel: flow proto cancel called on 0x1110e148
[E 15:09:46.777840] fp_multiqueue_cancel: I/O error occurred
[E 15:09:46.777848] handle_io_error: flow proto error cleanup started on 0x1110e148: Operation cancelled (possibly due to timeout)
[E 15:09:46.777887] handle_io_error: flow proto 0x1110e148 canceled 1 operations, will clean up.
[E 15:09:46.778094] mem_to_bmi_callback_fn: I/O error occurred
[E 15:09:46.778104] handle_io_error: flow proto 0x1110e148 error cleanup finished: Operation cancelled (possibly due to timeout)
[E 15:09:48.804003] Child process with pid 4153 was killed by an uncaught signal 6
[E 15:09:48.806178] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 15:09:48.825021] [INFO]: Mapping pointer 0x2b785c3c6000 for I/O.
[D 15:09:48.826315] [INFO]: Mapping pointer 0x1826b000 for I/O.
[E 15:18:19.513574] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 180966.
[E 15:18:19.513613] fp_multiqueue_cancel: flow proto cancel called on 0x185dff28
[E 15:18:19.513621] fp_multiqueue_cancel: I/O error occurred
[E 15:18:19.513629] handle_io_error: flow proto error cleanup started on 0x185dff28: Operation cancelled (possibly due to timeout)
[E 15:18:19.513667] handle_io_error: flow proto 0x185dff28 canceled 1 operations, will clean up.
[E 15:18:19.513890] mem_to_bmi_callback_fn: I/O error occurred
[E 15:18:19.513899] handle_io_error: flow proto 0x185dff28 error cleanup finished: Operation cancelled (possibly due to timeout)
[E 15:18:19.517715] Child process with pid 4380 was killed by an uncaught signal 6
[E 15:18:19.519877] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 15:18:19.538804] [INFO]: Mapping pointer 0x2b4c69624000 for I/O.
[D 15:18:19.540125] [INFO]: Mapping pointer 0x1dd93000 for I/O.
[E 15:25:18.646816] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 96951.
[E 15:25:18.646849] fp_multiqueue_cancel: flow proto cancel called on 0x1df150d8
[E 15:25:18.646857] fp_multiqueue_cancel: I/O error occurred
[E 15:25:18.646865] handle_io_error: flow proto error cleanup started on 0x1df150d8: Operation cancelled (possibly due to timeout)
[E 15:25:18.646903] handle_io_error: flow proto 0x1df150d8 canceled 1 operations, will clean up.
[E 15:25:18.647148] bmi_to_mem_callback_fn: I/O error occurred
[E 15:25:18.647157] handle_io_error: flow proto 0x1df150d8 error cleanup finished: Operation cancelled (possibly due to timeout)
[E 15:25:18.647170] io_datafile_complete_operations: flow failed, retrying from msgpair
[E 15:25:18.650983] Child process with pid 4410 was killed by an uncaught signal 6
[E 15:25:18.653123] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 15:25:18.671839] [INFO]: Mapping pointer 0x2b1b47547000 for I/O.
[D 15:25:18.673229] [INFO]: Mapping pointer 0xeedd000 for I/O.
[E 15:30:46.272575] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 37436.
[E 15:30:46.272630] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 37437.
[E 15:30:46.276215] Child process with pid 4428 was killed by an uncaught signal 6
[E 15:30:46.278400] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 15:30:46.297483] [INFO]: Mapping pointer 0x2ba621943000 for I/O.
[D 15:30:46.298816] [INFO]: Mapping pointer 0x1304a000 for I/O.
[E 15:37:18.560857] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 78656.
[E 15:37:18.560895] fp_multiqueue_cancel: flow proto cancel called on 0x133c8548
[E 15:37:18.560903] fp_multiqueue_cancel: I/O error occurred
[E 15:37:18.560910] handle_io_error: flow proto error cleanup started on 0x133c8548: Operation cancelled (possibly due to timeout)
[E 15:37:18.560945] handle_io_error: flow proto 0x133c8548 canceled 1 operations, will clean up.
[E 15:37:18.561165] bmi_to_mem_callback_fn: I/O error occurred
[E 15:37:18.561174] handle_io_error: flow proto 0x133c8548 error cleanup finished: Operation cancelled (possibly due to timeout)
[E 15:37:18.561186] io_datafile_complete_operations: flow failed, retrying from msgpair
[E 15:44:37.728727] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 223270.
[E 15:44:37.728758] fp_multiqueue_cancel: flow proto cancel called on 0x133b43d8
[E 15:44:37.728766] fp_multiqueue_cancel: I/O error occurred
[E 15:44:37.728774] handle_io_error: flow proto error cleanup started on 0x133b43d8: Operation cancelled (possibly due to timeout)
[E 15:44:37.728811] handle_io_error: flow proto 0x133b43d8 canceled 1 operations, will clean up.
[E 15:44:37.729030] mem_to_bmi_callback_fn: I/O error occurred
[E 15:44:37.729039] handle_io_error: flow proto 0x133b43d8 error cleanup finished: Operation cancelled (possibly due to timeout)
[E 15:44:39.761953] Child process with pid 4443 was killed by an uncaught signal 6
[E 15:44:39.764109] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 15:44:39.782909] [INFO]: Mapping pointer 0x2ba57ebb4000 for I/O.
[D 15:44:39.784222] [INFO]: Mapping pointer 0xb79a000 for I/O.
[E 15:53:22.065270] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 207856.
[E 15:53:22.065333] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 207857.
[E 15:53:22.069428] Child process with pid 4478 was killed by an uncaught signal 6
[E 15:53:22.071609] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 15:53:22.090658] [INFO]: Mapping pointer 0x2abb7282a000 for I/O.
[D 15:53:22.091974] [INFO]: Mapping pointer 0x11203000 for I/O.
[E 16:01:40.833736] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 184765.
[E 16:01:40.833765] fp_multiqueue_cancel: flow proto cancel called on 0x1157a718
[E 16:01:40.833774] fp_multiqueue_cancel: I/O error occurred
[E 16:01:40.833782] handle_io_error: flow proto error cleanup started on 0x1157a718: Operation cancelled (possibly due to timeout)
[E 16:01:40.833820] handle_io_error: flow proto 0x1157a718 canceled 1 operations, will clean up.
[E 16:01:40.834037] mem_to_bmi_callback_fn: I/O error occurred
[E 16:01:40.834046] handle_io_error: flow proto 0x1157a718 error cleanup finished: Operation cancelled (possibly due to timeout)
[E 16:01:42.868685] Child process with pid 4503 was killed by an uncaught signal 6
[E 16:01:42.870839] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 16:01:42.889938] [INFO]: Mapping pointer 0x2ba6f51f3000 for I/O.
[D 16:01:42.891225] [INFO]: Mapping pointer 0x14fcd000 for I/O.
[E 16:06:43.923989] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 179.
[E 16:06:43.924270] mem_to_bmi_callback_fn: I/O error occurred
[E 16:06:43.924282] handle_io_error: flow proto error cleanup started on 0x15331aa8: 
[E 16:06:43.924290] handle_io_error: flow proto 0x15331aa8 canceled 0 operations, will clean up.
[E 16:06:43.924297] handle_io_error: flow proto 0x15331aa8 error cleanup finished: 
[E 16:06:43.924319] Warning: msgpair failed to ib://pvfs02:3337,tcp://pvfs02:3336, will retry: 
[E 16:06:43.924328] *** msgpairarray_completion_fn: msgpair to server ib://pvfs02:3337,tcp://pvfs02:3336 failed: 
[E 16:06:43.924335] *** Non-BMI failure.
[E 16:11:45.055178] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 38397.
[E 16:11:45.055458] mem_to_bmi_callback_fn: I/O error occurred
[E 16:11:45.055470] handle_io_error: flow proto error cleanup started on 0x15318c48: 
[E 16:11:45.055478] handle_io_error: flow proto 0x15318c48 canceled 0 operations, will clean up.
[E 16:11:45.055486] handle_io_error: flow proto 0x15318c48 error cleanup finished: 
[E 16:16:47.052531] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 47026.
[E 16:16:47.052591] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 47030.
[E 16:16:47.052807] bmi_to_mem_callback_fn: I/O error occurred
[E 16:16:47.052821] handle_io_error: flow proto error cleanup started on 0x15318c48: 
[E 16:16:47.052830] handle_io_error: flow proto 0x15318c48 canceled 0 operations, will clean up.
[E 16:16:47.052838] handle_io_error: flow proto 0x15318c48 error cleanup finished: 
[E 16:16:47.056391] Child process with pid 4550 was killed by an uncaught signal 6
[E 16:16:47.058595] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 16:16:47.077816] [INFO]: Mapping pointer 0x2afb89c1f000 for I/O.
[D 16:16:47.079133] [INFO]: Mapping pointer 0x4ae1000 for I/O.
[E 16:17:07.057884] mem_to_bmi_callback_fn: I/O error occurred
[E 16:17:07.057919] handle_io_error: flow proto error cleanup started on 0x4e40c28: 
[E 16:17:07.057928] handle_io_error: flow proto 0x4e40c28 canceled 0 operations, will clean up.
[E 16:17:07.057936] handle_io_error: flow proto 0x4e40c28 error cleanup finished: 
[E 16:22:07.741920] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 3423.
[E 16:22:07.741985] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 3426.
[E 16:22:07.741995] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 3429.
[E 16:22:07.742029] Warning: msgpair failed to ib://pvfs02:3337,tcp://pvfs02:3336, will retry: 
[E 16:22:07.742039] *** msgpairarray_completion_fn: msgpair to server ib://pvfs02:3337,tcp://pvfs02:3336 failed: 
[E 16:22:07.742046] *** Non-BMI failure.
[E 16:22:07.742063] Warning: msgpair failed to ib://pvfs02:3337,tcp://pvfs02:3336, will retry: 
[E 16:22:07.742071] *** msgpairarray_completion_fn: msgpair to server ib://pvfs02:3337,tcp://pvfs02:3336 failed: 
[E 16:22:07.742078] *** Non-BMI failure.
[E 16:22:07.742089] Warning: msgpair failed to ib://pvfs02:3337,tcp://pvfs02:3336, will retry: 
[E 16:22:07.742097] *** msgpairarray_completion_fn: msgpair to server ib://pvfs02:3337,tcp://pvfs02:3336 failed: 
[E 16:22:07.742104] *** Non-BMI failure.
[E 16:22:07.747556] Child process with pid 4588 was killed by an uncaught signal 6
[E 16:22:07.749757] PVFS Client Daemon Started.  Version 2.8.4-orangefs-2011-07-18-043429
[D 16:22:07.768792] [INFO]: Mapping pointer 0x2b914df81000 for I/O.
[D 16:22:07.770095] [INFO]: Mapping pointer 0xc8b8000 for I/O.

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] pvfs2.ko causes kernel hang

Reply via email to