Hi Kyle,
Thanks for sharing your experience.
Yes, we are using openib. I'm not very familiar with GAMESS itself, just
running it to benchmark with our current file system. So I am not sure
what stage it got before the panic. I'll have our users look at the
logs. I'm attaching the log here as well.
Have you ever found a solution or workaround for that?
Thanks,
Mi
On Wed, 2011-07-13 at 09:50 -0500, Kyle Schochenmaier wrote:
> Hi Mi -
>
>
> What network interface are you using?
> I saw some similar panics years ago when running GAMESS over
> bmi-openib where mopid's where getting reused incorrectly.
>
>
> IIRC GAMESS has a very basic IO pattern, n simultaneous reads on n
> files, followed by a final write ?
> I'm assuming you make it through the read/compute process prior to
> the panic?
>
>
> Kyle Schochenmaier
>
>
> On Wed, Jul 13, 2011 at 9:41 AM, Mi Zhou <[email protected]> wrote:
> Hi Michael,
>
> Yes, the panic is reproducible, I got it about 10 minutes
> after this
> application (GAMESS) started. I ran GAMESS on 16 cores on 2
> nodes. It
> was an MPI program, so I think only the master node is doing
> the writing
> and hence the panic only occurred on the master node.
>
> When this happens, the pvfs2-server daemon disappear.
> Comparing
> timestamps in the log, looks like the pvfs2-server errored
> first and
> then client got "job_time out" problem.
>
> I am attaching the logs on both server and client and the
> config file on
> the server. We have 3 identical pvfs2-server nodes
> (pvfs01-03), but
> seems the problem only happens on pvfs02.
>
> Your advice is greatly appreciated.
>
> Thanks,
>
> Mi
>
> On Tue, 2011-07-12 at 20:54 -0500, Michael Moore wrote:
> > Hi Mi,
> >
> > I don't think there have been any applicable commits since
> 06/28 to
> > Orange-Branch that would address this issue. Is the panic
> consistently
> > reproducible? If so, what workload leads to the panic?
> Single client
> > with writes to a single file? I'll look at the logs to see
> if anything
> > stands out, otherwise I may need to locally reproduce the
> issue to
> > track down what's going on.
> >
> > Thanks for reporting the issue,
> > Michael
> >
> > On Tue, Jul 12, 2011 at 5:43 PM, Mi Zhou
> <[email protected]> wrote:
> > Hi,
> >
> > I checked out the code from the cvs branch on 6/28,
> I don't
> > see an
> > immediate kernel panic any more, but still got
> kernel panic
> > after some
> > intensive write to the file system (pls see attached
> screen
> > shot).
> >
> > This is what is in the server log:
> > pvfs02: [E 07/12/2011 16:05:30] Error:
> > encourage_recv_incoming: mop_id
> > 17bb01c0 in RTS_DONE message not found.
> > pvfs02: [E 07/12/2011 16:05:30]
> > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca)
> [0x449d3a]
> > pvfs02: [E 07/12/2011 16:05:30]
> > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x446f30]
> > pvfs02: [E 07/12/2011 16:05:30]
> > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x448a79]
> > pvfs02: [E 07/12/2011 16:05:30]
> >
> [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected
> > +0x383)
> > [0x445263]
> > pvfs02: [E 07/12/2011 16:05:30]
> > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x47b27c]
> > pvfs02: [E 07/12/2011 16:05:30]
> > [bt] /lib64/libpthread.so.0
> > [0x3ba820673d]
> > pvfs02: [E 07/12/2011 16:05:30]
> > [bt] /lib64/libc.so.6(clone
> > +0x6d) [0x3ba7ad44bd]
> >
> >
> >
> > And this is the client log:
> >
> > [E 16:10:17.359727] job_time_mgr_expire: job time
> out:
> > cancelling bmi
> > operation, job_id: 173191.
> > [E 16:10:19.371926] Warning: ib_tcp_client_connect:
> connect to
> > server
> > pvfs02:3337: Connection refused.
> > [E 16:10:19.371943] Receive immediately failed:
> Connection
> > refused
> > [E 16:10:21.382875] Warning: ib_tcp_client_connect:
> connect to
> > server
> > pvfs02:3337: Connection refused.
> > [E 16:10:21.382888] Receive immediately failed:
> Connection
> > refused
> >
> > We have pvfs01-03 as running the pvfs-server. Both
> client and
> > server are
> > on centos 5 x63_64, kernel version
> 2.6.18-238.9.1.el5.
> >
> > Any advice?
> >
> > Thanks,
> >
> > Mi
> >
> >
> >
> > On Thu, 2011-07-07 at 12:33 -0500, Ted Hesselroth
> wrote:
> > > That did resolve the problem. Thanks.
> > >
> > > On 7/7/2011 11:19 AM, Michael Moore wrote:
> > > > Hi Ted,
> > > >
> > > > There was a regression when adding support for
> newer
> > kernels that made it in
> > > > to the 2.8.4 release. I believe that's the issue
> you're
> > seeing (a kernel
> > > > panic immediately on modprobe/insmod). The next
> release
> > will include that
> > > > fix. Until then, if you can check out the latest
> version
> > of the code from
> > > > CVS, it should resolve the issue. The CVS branch
> is
> > Orange-Branch, full
> > > > directions for CVS checkout at
> > http://www.orangefs.org/support/
> > > >
> > > > We are currently running the kernel module with
> the latest
> > code on CentOS 5
> > > > and SL 6 systems. Let me know how it goes.
> > > >
> > > > For anyone interested, the commit to resolve the
> issue
> > was:
> > > >
> >
>
> http://www.pvfs.org/fisheye/changelog/~br=Orange-Branch/PVFS/?cs=Orange-Branch:mtmoore:20110530154853
> > > >
> > > > Michael
> > > >
> > > >
> > > > On Thu, Jul 7, 2011 at 11:36 AM, Ted
> > Hesselroth<[email protected]> wrote:
> > > >
> > > >> I have built the kernel module from
> orangefs-2.8.4 source
> > against a 64-bit
> > > >> 2.6.18-238.12.1 linux kernel source, and
> against a 32-bit
> > 2.6.18-238.9.1
> > > >> source. In both cases, the kernel hung when the
> module
> > was inserted with
> > > >> insmod. The first did report "kernel: Oops:
> 0000 [1]
> > SMP". The distributions
> > > >> are Scientific Linux 5.x, which is rpm-based
> and similar
> > to Centos.
> > > >>
> > > >> Are there kernels for this scenario for which
> the build
> > is known to work?
> > > >> The server build and install went fine, but I
> would like
> > to configure some
> > > >> clients to access orangefs through a mount
> point.
> > > >>
> > > >> Thanks.
> > > >>
> > > >>
> ______________________________**_________________
> > > >> Pvfs2-users mailing list
> > > >>
> >
>
> Pvfs2-users@beowulf-**underground.org<[email protected]>
> > > >>
> >
>
> http://www.beowulf-**underground.org/mailman/**listinfo/pvfs2-users<http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users>
> > > >>
> > > >
> >
> > > _______________________________________________
> > > Pvfs2-users mailing list
> > > [email protected]
> >
> > >
> >
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> > >
> >
> > --
> >
> > Mi Zhou
> > System Integration Engineer
> > Information Sciences
> > St. Jude Children's Research Hospital
> > 262 Danny Thomas Pl. MS 312
> > Memphis, TN 38105
> > 901.595.5771
> >
> > Email Disclaimer: www.stjude.org/emaildisclaimer
> >
>
> --
>
>
> Mi Zhou
> System Integration Engineer
> Information Sciences
> St. Jude Children's Research Hospital
> 262 Danny Thomas Pl. MS 312
> Memphis, TN 38105
> 901.595.5771
>
>
> _______________________________________________
> Pvfs2-users mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>
>
--
Mi Zhou
System Integration Engineer
Information Sciences
St. Jude Children's Research Hospital
262 Danny Thomas Pl. MS 312
Memphis, TN 38105
901.595.5771
-catch_rsh /nfs_exports/apps/sge/ge6.2u5/default/spool/node033/active_jobs/1472833.1/pe_hostfile
node033
node034
SGE has assigned the following compute nodes to this run:
node033
node034
----- GAMESS execution script -----
This job is running on host node033
under operating system Linux at Tue Jul 12 16:03:17 CDT 2011
Available scratch disk space (Kbyte units) at beginning of the job is
Filesystem 1K-blocks Used Available Use% Mounted on
ib://pvfs03:3337/pvfs2-fs
26369925120 17895424 26352029696 1% /hpcf/scratch
cp AGCT.inp /hpcf/scratch/1472833.1.test/AGCT.F05
unset echo
setenv ERICFMT /hpcf/apps/gamess/gamess-063011/auxdata/ericfmt.dat
setenv MCPPATH /hpcf/apps/gamess/gamess-063011/auxdata/MCP
setenv BASPATH /hpcf/apps/gamess/gamess-063011/auxdata/BASES
setenv QUANPOL /hpcf/apps/gamess/gamess-063011/auxdata/QUANPOL
setenv MAKEFP /hpcf/scratch/1472833.1.test/AGCT.efp
setenv GAMMA /hpcf/scratch/1472833.1.test/AGCT.gamma
setenv TRAJECT /hpcf/scratch/1472833.1.test/AGCT.trj
setenv RESTART /hpcf/scratch/1472833.1.test/AGCT.rst
setenv INPUT /hpcf/scratch/1472833.1.test/AGCT.F05
setenv PUNCH /hpcf/scratch/1472833.1.test/AGCT.dat
setenv AOINTS /hpcf/scratch/1472833.1.test/AGCT.F08
setenv MOINTS /hpcf/scratch/1472833.1.test/AGCT.F09
setenv DICTNRY /hpcf/scratch/1472833.1.test/AGCT.F10
setenv DRTFILE /hpcf/scratch/1472833.1.test/AGCT.F11
setenv CIVECTR /hpcf/scratch/1472833.1.test/AGCT.F12
setenv CASINTS /hpcf/scratch/1472833.1.test/AGCT.F13
setenv CIINTS /hpcf/scratch/1472833.1.test/AGCT.F14
setenv WORK15 /hpcf/scratch/1472833.1.test/AGCT.F15
setenv WORK16 /hpcf/scratch/1472833.1.test/AGCT.F16
setenv CSFSAVE /hpcf/scratch/1472833.1.test/AGCT.F17
setenv FOCKDER /hpcf/scratch/1472833.1.test/AGCT.F18
setenv WORK19 /hpcf/scratch/1472833.1.test/AGCT.F19
setenv DASORT /hpcf/scratch/1472833.1.test/AGCT.F20
setenv DFTINTS /hpcf/scratch/1472833.1.test/AGCT.F21
setenv DFTGRID /hpcf/scratch/1472833.1.test/AGCT.F22
setenv JKFILE /hpcf/scratch/1472833.1.test/AGCT.F23
setenv ORDINT /hpcf/scratch/1472833.1.test/AGCT.F24
setenv EFPIND /hpcf/scratch/1472833.1.test/AGCT.F25
setenv PCMDATA /hpcf/scratch/1472833.1.test/AGCT.F26
setenv PCMINTS /hpcf/scratch/1472833.1.test/AGCT.F27
setenv SVPWRK1 /hpcf/scratch/1472833.1.test/AGCT.F26
setenv SVPWRK2 /hpcf/scratch/1472833.1.test/AGCT.F27
setenv COSCAV /hpcf/scratch/1472833.1.test/AGCT.F26
setenv COSDATA /hpcf/scratch/1472833.1.test/AGCT.cosmo
setenv COSPOT /hpcf/scratch/1472833.1.test/AGCT.pot
setenv MLTPL /hpcf/scratch/1472833.1.test/AGCT.F28
setenv MLTPLT /hpcf/scratch/1472833.1.test/AGCT.F29
setenv DAFL30 /hpcf/scratch/1472833.1.test/AGCT.F30
setenv SOINTX /hpcf/scratch/1472833.1.test/AGCT.F31
setenv SOINTY /hpcf/scratch/1472833.1.test/AGCT.F32
setenv SOINTZ /hpcf/scratch/1472833.1.test/AGCT.F33
setenv SORESC /hpcf/scratch/1472833.1.test/AGCT.F34
setenv GCILIST /hpcf/scratch/1472833.1.test/AGCT.F37
setenv HESSIAN /hpcf/scratch/1472833.1.test/AGCT.F38
setenv QMMMTEI /hpcf/scratch/1472833.1.test/AGCT.F39
setenv SOCCDAT /hpcf/scratch/1472833.1.test/AGCT.F40
setenv AABB41 /hpcf/scratch/1472833.1.test/AGCT.F41
setenv BBAA42 /hpcf/scratch/1472833.1.test/AGCT.F42
setenv BBBB43 /hpcf/scratch/1472833.1.test/AGCT.F43
setenv MCQD50 /hpcf/scratch/1472833.1.test/AGCT.F50
setenv MCQD51 /hpcf/scratch/1472833.1.test/AGCT.F51
setenv MCQD52 /hpcf/scratch/1472833.1.test/AGCT.F52
setenv MCQD53 /hpcf/scratch/1472833.1.test/AGCT.F53
setenv MCQD54 /hpcf/scratch/1472833.1.test/AGCT.F54
setenv MCQD55 /hpcf/scratch/1472833.1.test/AGCT.F55
setenv MCQD56 /hpcf/scratch/1472833.1.test/AGCT.F56
setenv MCQD57 /hpcf/scratch/1472833.1.test/AGCT.F57
setenv MCQD58 /hpcf/scratch/1472833.1.test/AGCT.F58
setenv MCQD59 /hpcf/scratch/1472833.1.test/AGCT.F59
setenv MCQD60 /hpcf/scratch/1472833.1.test/AGCT.F60
setenv MCQD61 /hpcf/scratch/1472833.1.test/AGCT.F61
setenv MCQD62 /hpcf/scratch/1472833.1.test/AGCT.F62
setenv MCQD63 /hpcf/scratch/1472833.1.test/AGCT.F63
setenv MCQD64 /hpcf/scratch/1472833.1.test/AGCT.F64
setenv NMRINT1 /hpcf/scratch/1472833.1.test/AGCT.F61
setenv NMRINT2 /hpcf/scratch/1472833.1.test/AGCT.F62
setenv NMRINT3 /hpcf/scratch/1472833.1.test/AGCT.F63
setenv NMRINT4 /hpcf/scratch/1472833.1.test/AGCT.F64
setenv NMRINT5 /hpcf/scratch/1472833.1.test/AGCT.F65
setenv NMRINT6 /hpcf/scratch/1472833.1.test/AGCT.F66
setenv DCPHFH2 /hpcf/scratch/1472833.1.test/AGCT.F67
setenv DCPHF21 /hpcf/scratch/1472833.1.test/AGCT.F68
setenv ELNUINT /hpcf/scratch/1472833.1.test/AGCT.F67
setenv NUNUINT /hpcf/scratch/1472833.1.test/AGCT.F68
setenv GVVPT /hpcf/scratch/1472833.1.test/AGCT.F69
setenv NUMOIN /hpcf/scratch/1472833.1.test/AGCT.F69
setenv NUMOCAS /hpcf/scratch/1472833.1.test/AGCT.F70
setenv NUELMO /hpcf/scratch/1472833.1.test/AGCT.F71
setenv NUELCAS /hpcf/scratch/1472833.1.test/AGCT.F72
setenv RIVMAT /hpcf/scratch/1472833.1.test/AGCT.F51
setenv RIT2A /hpcf/scratch/1472833.1.test/AGCT.F52
setenv RIT3A /hpcf/scratch/1472833.1.test/AGCT.F53
setenv RIT2B /hpcf/scratch/1472833.1.test/AGCT.F54
setenv RIT3B /hpcf/scratch/1472833.1.test/AGCT.F55
setenv GMCREF /hpcf/scratch/1472833.1.test/AGCT.F70
setenv GMCO2R /hpcf/scratch/1472833.1.test/AGCT.F71
setenv GMCROC /hpcf/scratch/1472833.1.test/AGCT.F72
setenv GMCOOC /hpcf/scratch/1472833.1.test/AGCT.F73
setenv GMCCC0 /hpcf/scratch/1472833.1.test/AGCT.F74
setenv GMCHMA /hpcf/scratch/1472833.1.test/AGCT.F75
setenv GMCEI1 /hpcf/scratch/1472833.1.test/AGCT.F76
setenv GMCEI2 /hpcf/scratch/1472833.1.test/AGCT.F77
setenv GMCEOB /hpcf/scratch/1472833.1.test/AGCT.F78
setenv GMCEDT /hpcf/scratch/1472833.1.test/AGCT.F79
setenv GMCERF /hpcf/scratch/1472833.1.test/AGCT.F80
setenv GMCHCR /hpcf/scratch/1472833.1.test/AGCT.F81
setenv GMCGJK /hpcf/scratch/1472833.1.test/AGCT.F82
setenv GMCGAI /hpcf/scratch/1472833.1.test/AGCT.F83
setenv GMCGEO /hpcf/scratch/1472833.1.test/AGCT.F84
setenv GMCTE1 /hpcf/scratch/1472833.1.test/AGCT.F85
setenv GMCTE2 /hpcf/scratch/1472833.1.test/AGCT.F86
setenv GMCHEF /hpcf/scratch/1472833.1.test/AGCT.F87
setenv GMCMOL /hpcf/scratch/1472833.1.test/AGCT.F88
setenv GMCMOS /hpcf/scratch/1472833.1.test/AGCT.F89
setenv GMCWGT /hpcf/scratch/1472833.1.test/AGCT.F90
setenv GMCRM2 /hpcf/scratch/1472833.1.test/AGCT.F91
setenv GMCRM1 /hpcf/scratch/1472833.1.test/AGCT.F92
setenv GMCR00 /hpcf/scratch/1472833.1.test/AGCT.F93
setenv GMCRP1 /hpcf/scratch/1472833.1.test/AGCT.F94
setenv GMCRP2 /hpcf/scratch/1472833.1.test/AGCT.F95
setenv GMCVEF /hpcf/scratch/1472833.1.test/AGCT.F96
setenv GMCDIN /hpcf/scratch/1472833.1.test/AGCT.F97
setenv GMC2SZ /hpcf/scratch/1472833.1.test/AGCT.F98
setenv GMCCCS /hpcf/scratch/1472833.1.test/AGCT.F99
unset echo
This is a GDDI run, keeping various output files on local disks
setenv OUTPUT /hpcf/scratch/1472833.1.test/AGCT.F06
setenv PUNCH /hpcf/scratch/1472833.1.test/AGCT.F07
unset echo
setenv I_MPI_WAIT_MODE enable
setenv I_MPI_PIN disable
setenv I_MPI_DEBUG 0
setenv I_MPI_STATS 0
setenv I_MPI_DAT_LIBRARY libdat2.so
setenv I_MPI_FABRICS dapl
unset echo
scp /hpcf/scratch/1472833.1.test/AGCT.F05 node034:/hpcf/scratch/1472833.1.test/AGCT.F05
MPI kickoff will run GAMESS on 16 cores in 2 nodes.
The binary to be executed is /hpcf/apps/gamess/gamess-063011/gamess.01.x
MPI will run 16 compute processes and 16 data servers,
placing 8 of each process type onto each node.
The scratch disk space on each node is /hpcf/scratch/1472833.1.test, with free space
Filesystem 1K-blocks Used Available Use% Mounted on
ib://pvfs03:3337/pvfs2-fs
26369925120 17895424 26352029696 1% /hpcf/scratch
mpdboot --rsh=ssh -n 2 -f /hpcf/scratch/1472833.1.test/AGCT.nodes.mpd
mpiexec -configfile /hpcf/scratch/1472833.1.test/AGCT.processes.mpd
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 3 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 7 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 1 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 5 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 6 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 4 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 10 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 12 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 9 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 13 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 14 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 11 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 8 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 15 NGROUPS= 1 MYGROUP= 0
GDDI IS ABOUT TO SWITCH GROUPS: MEGLOB= 2 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 2 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 5 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 15 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 3 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 7 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 10 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 1 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 13 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 4 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 12 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 11 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 14 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 9 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 8 NGROUPS= 1 MYGROUP= 0
GDDI HAS SWITCHED GROUPS: MEGLOB= 6 NGROUPS= 1 MYGROUP= 0
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users