My apologies, that was the wrong version of the patch (notably missing ;s).
Try this one, please.

Thanks,
Michael

On Thu, Jul 14, 2011 at 1:22 PM, Michael Moore <[email protected]> wrote:

> Hi Mi,
>
> I haven't had a chance to reproduce the issue you're seeing yet. However,
> the panic seems to point in the direction of this patch. Could you apply the
> attached patch and let me know if it resolves or changes the panic you're
> seeing? For reference, if you need it,
>
> From the top-level pvfs2 directory of the source (if from CVS) or the
> orangefs directory (if the release) run patch -p0 < ~/<path to attached
> path>. If applying against the release, it will apply with a little bit of
> an offset. Let me know how it goes.
>
> Thanks,
> Michael
>
>
> On Wed, Jul 13, 2011 at 11:39 AM, Mi Zhou <[email protected]> wrote:
>
>> Hi Kyle,
>>
>> Thanks for sharing your experience.
>> Yes, we are using openib. I'm not very familiar with GAMESS itself, just
>> running it to benchmark with our current file system. So I am not sure
>> what stage it got before the panic. I'll have our users look at the
>> logs. I'm attaching the log here as well.
>>
>> Have you ever found a solution or workaround for that?
>>
>> Thanks,
>>
>> Mi
>>
>> On Wed, 2011-07-13 at 09:50 -0500, Kyle Schochenmaier wrote:
>> > Hi Mi -
>> >
>> >
>> > What network interface are you using?
>> > I saw some similar panics years ago when running GAMESS over
>> > bmi-openib where mopid's where getting reused incorrectly.
>> >
>> >
>> > IIRC GAMESS has a very basic IO pattern, n simultaneous reads on n
>> > files, followed by a final write ?
>> >   I'm assuming you make it through the read/compute process prior to
>> > the panic?
>> >
>> >
>> > Kyle Schochenmaier
>> >
>> >
>> > On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou <[email protected]> wrote:
>> >         Hi Michael,
>> >
>> >         Yes, the panic is reproducible, I got it about 10 minutes
>> >         after this
>> >         application (GAMESS) started. I ran GAMESS on 16 cores on 2
>> >         nodes. It
>> >         was an MPI program, so I think only the master node is doing
>> >         the writing
>> >         and hence the panic only occurred on the master node.
>> >
>> >         When this happens, the pvfs2-server daemon disappear.
>> >         Comparing
>> >         timestamps in the log, looks like the pvfs2-server errored
>> >         first and
>> >         then client got "job_time out" problem.
>> >
>> >         I am attaching the logs on both server and client and the
>> >         config file on
>> >         the server. We have 3 identical pvfs2-server nodes
>> >         (pvfs01-03), but
>> >         seems the problem only happens on pvfs02.
>> >
>> >         Your advice is greatly appreciated.
>> >
>> >         Thanks,
>> >
>> >         Mi
>> >
>> >         On Tue, 2011-07-12 at 20:54 -0500, Michael Moore wrote:
>> >         > Hi Mi,
>> >         >
>> >         > I don't think there have been any applicable commits since
>> >         06/28 to
>> >         > Orange-Branch that would address this issue. Is the panic
>> >         consistently
>> >         > reproducible? If so, what workload leads to the panic?
>> >         Single client
>> >         > with writes to a single file? I'll look at the logs to see
>> >         if anything
>> >         > stands out, otherwise I may need to locally reproduce the
>> >         issue to
>> >         > track down what's going on.
>> >         >
>> >         > Thanks for reporting the issue,
>> >         > Michael
>> >         >
>> >         > On Tue, Jul 12, 2011 at 5:43 PM, Mi Zhou
>> >         <[email protected]> wrote:
>> >         >         Hi,
>> >         >
>> >         >         I checked out the code from the cvs branch on 6/28,
>> >         I don't
>> >         >         see an
>> >         >         immediate kernel panic any more, but still got
>> >         kernel panic
>> >         >         after some
>> >         >         intensive write to the file system (pls see attached
>> >         screen
>> >         >         shot).
>> >         >
>> >         >         This is what is in the server log:
>> >         >         pvfs02: [E 07/12/2011 16:05:30] Error:
>> >         >         encourage_recv_incoming: mop_id
>> >         >         17bb01c0 in RTS_DONE message not found.
>> >         >         pvfs02: [E 07/12/2011 16:05:30]
>> >         >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca)
>> >         [0x449d3a]
>> >         >         pvfs02: [E 07/12/2011 16:05:30]
>> >         >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x446f30]
>> >         >         pvfs02: [E 07/12/2011 16:05:30]
>> >         >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x448a79]
>> >         >         pvfs02: [E 07/12/2011 16:05:30]
>> >         >
>> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected
>> >         >         +0x383)
>> >         >         [0x445263]
>> >         >         pvfs02: [E 07/12/2011 16:05:30]
>> >         >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x47b27c]
>> >         >         pvfs02: [E 07/12/2011 16:05:30]
>> >         >         [bt] /lib64/libpthread.so.0
>> >         >         [0x3ba820673d]
>> >         >         pvfs02: [E 07/12/2011 16:05:30]
>> >         >         [bt] /lib64/libc.so.6(clone
>> >         >         +0x6d) [0x3ba7ad44bd]
>> >         >
>> >         >
>> >         >
>> >         >         And this is the client log:
>> >         >
>> >         >         [E 16:10:17.359727] job_time_mgr_expire: job time
>> >         out:
>> >         >         cancelling bmi
>> >         >         operation, job_id: 173191.
>> >         >         [E 16:10:19.371926] Warning: ib_tcp_client_connect:
>> >         connect to
>> >         >         server
>> >         >         pvfs02:3337: Connection refused.
>> >         >         [E 16:10:19.371943] Receive immediately failed:
>> >         Connection
>> >         >         refused
>> >         >         [E 16:10:21.382875] Warning: ib_tcp_client_connect:
>> >         connect to
>> >         >         server
>> >         >         pvfs02:3337: Connection refused.
>> >         >         [E 16:10:21.382888] Receive immediately failed:
>> >         Connection
>> >         >         refused
>> >         >
>> >         >         We have pvfs01-03 as running the pvfs-server. Both
>> >         client and
>> >         >         server are
>> >         >         on centos 5 x63_64, kernel version
>> >         2.6.18-238.9.1.el5.
>> >         >
>> >         >         Any advice?
>> >         >
>> >         >         Thanks,
>> >         >
>> >         >         Mi
>> >         >
>> >         >
>> >         >
>> >         >         On Thu, 2011-07-07 at 12:33 -0500, Ted Hesselroth
>> >         wrote:
>> >         >         > That did resolve the problem. Thanks.
>> >         >         >
>> >         >         > On 7/7/2011 11:19 AM, Michael Moore wrote:
>> >         >         > > Hi Ted,
>> >         >         > >
>> >         >         > > There was a regression when adding support for
>> >         newer
>> >         >         kernels that made it in
>> >         >         > > to the 2.8.4 release. I believe that's the issue
>> >         you're
>> >         >         seeing (a kernel
>> >         >         > > panic immediately on modprobe/insmod). The next
>> >         release
>> >         >         will include that
>> >         >         > > fix. Until then, if you can check out the latest
>> >         version
>> >         >         of the code from
>> >         >         > > CVS, it should resolve the issue. The CVS branch
>> >         is
>> >         >         Orange-Branch, full
>> >         >         > > directions for CVS checkout at
>> >         >         http://www.orangefs.org/support/
>> >         >         > >
>> >         >         > > We are currently running the kernel module with
>> >         the latest
>> >         >         code on CentOS 5
>> >         >         > > and SL 6 systems. Let me know how it goes.
>> >         >         > >
>> >         >         > > For anyone interested, the commit to resolve the
>> >         issue
>> >         >         was:
>> >         >         > >
>> >         >
>> >
>> http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853
>> >         >         > >
>> >         >         > > Michael
>> >         >         > >
>> >         >         > >
>> >         >         > > On Thu, Jul 7, 2011 at 11:36 AM, Ted
>> >         >         Hesselroth<[email protected]>  wrote:
>> >         >         > >
>> >         >         > >> I have built the kernel module from
>> >         orangefs-2.8.4 source
>> >         >         against a 64-bit
>> >         >         > >> 2.6.18-238.12.1 linux kernel source, and
>> >         against a 32-bit
>> >         >         2.6.18-238.9.1
>> >         >         > >> source. In both cases, the kernel hung when the
>> >         module
>> >         >         was inserted with
>> >         >         > >> insmod. The first did report "kernel: Oops:
>> >         0000 [1]
>> >         >         SMP". The distributions
>> >         >         > >> are Scientific Linux 5.x, which is rpm-based
>> >         and similar
>> >         >         to Centos.
>> >         >         > >>
>> >         >         > >> Are there kernels for this scenario for which
>> >         the build
>> >         >         is known to work?
>> >         >         > >> The server build and install went fine, but I
>> >         would like
>> >         >         to configure some
>> >         >         > >> clients to access orangefs through a mount
>> >         point.
>> >         >         > >>
>> >         >         > >> Thanks.
>> >         >         > >>
>> >         >         > >>
>> >         ______________________________**_________________
>> >         >         > >> Pvfs2-users mailing list
>> >         >         > >>
>> >         >
>> >         Pvfs2-users@beowulf-**underground.org<
>> [email protected]>
>> >         >         > >>
>> >         >
>> >         http://www.beowulf-**
>> underground.org/mailman/**listinfo/pvfs2-users<
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users>
>> >         >         > >>
>> >         >         > >
>> >         >
>> >         >         > _______________________________________________
>> >         >         > Pvfs2-users mailing list
>> >         >         > [email protected]
>> >         >
>> >         >         >
>> >         >
>> >         http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>> >         >         >
>> >         >
>> >         >         --
>> >         >
>> >         >         Mi Zhou
>> >         >         System Integration Engineer
>> >         >         Information Sciences
>> >         >         St. Jude Children's Research Hospital
>> >         >         262 Danny Thomas Pl. MS 312
>> >         >         Memphis, TN 38105
>> >         >         901.595.5771
>> >         >
>> >         >         Email Disclaimer:  www.stjude.org/emaildisclaimer
>> >         >
>> >
>> >         --
>> >
>> >
>> >         Mi Zhou
>> >         System Integration Engineer
>> >         Information Sciences
>> >         St. Jude Children's Research Hospital
>> >         262 Danny Thomas Pl. MS 312
>> >         Memphis, TN 38105
>> >         901.595.5771
>> >
>> >
>> >         _______________________________________________
>> >         Pvfs2-users mailing list
>> >         [email protected]
>> >         http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>> >
>> >
>> >
>> --
>>
>> Mi Zhou
>> System Integration Engineer
>> Information Sciences
>> St. Jude Children's Research Hospital
>> 262 Danny Thomas Pl. MS 312
>> Memphis, TN 38105
>> 901.595.5771
>>
>
>

Attachment: kernel-lock-op-init.patch
Description: Binary data

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to