Jen, Can you run Maui under gdb. (See section 14.1.4 of the online docs)
When the failure occurs, please issue 'where' and send us the output. We will also attempt to reproduce this locally. Dave On Fri, 2005-12-16 at 14:51 -0500, Aquarijen wrote: > Hi All, > > I am not sure if this is a gold question or a maui question - so I am > posting to both - I hope that is ok... > Sorry for so many questions lately! So, I made sure that no users on > the test cluster have usernames begining with a number. I have gold > running and I have accounts, projects, machines and users set up with > 100000000 deposited to each gold account. > If I configure maui to use gold as its AM, maui pretty much instantly > dies. I am using maui 3.2.6p13 and gold version 2.0.0.4. I cleared > out the checkpoint file. I shut everything down and cleared the > queue. I then started gold, then maui, then pbs_server and then the > pbs_moms. Maui dies. I've tried this in different orders, too. Maui > dies if I have the AMCFG line included. > > Here is my simple maui.cfg: > > # maui.cfg 3.2 > SERVERHOST b05l02 > ADMIN1 root tippensjl > RMCFG[base] TYPE=PBS > JOBAGGREGATIONTIME 00:00:10 > RMPOLLINTERVAL 00:00:30 > DOWNNODEDELAYTIME 72:00:00 > SERVERPORT 42559 > SERVERMODE NORMAL > LOGFILE maui.log > LOGFILEMAXSIZE 100000000 > LOGLEVEL 9 > QUEUETIMEWEIGHT[0] 10 > FSPOLICY DEDICATEDPS > FSDEPTH 7 > FSINTERVAL 24:00:00 > FSWEIGHT 1 > FSDECAY 0.80 > BACKFILLPOLICY ON > BACKFILLTYPE BESTFIT > RESERVATIONPOLICY CURRENTHIGHEST > NODEACCESSPOLICY SHARED > JOBMAXSTARTTIME 2:00:00 > JOBMAXOVERRUN 0:30:00 > AMCFG[bank] TYPE=GOLD HOST=b05l02 PORT=7112 SOCKETPROTOCOL=HTTP > WIRE-PROTOCOL=XML CHARGEPOLICY=DEBITALLWC JOBFAILUREACTION=NONE > FLUSHINTERVAL=12:00:00 TIMEOUT=15 > > And here is my maui-private.cfg: > CLIENTCFG[AM:bank] CSKEY=sss CSALGO=HMAC > > And here is the last little bit of my maui.log. I have loglevel turned up to > 9. > > 12/16 14:32:42 MUserAdd(UName,UP) > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MCPRestore(USER,tippensjl,Optr) > 12/16 14:32:42 INFO: no checkpoint entry for object 'USER > tippensjl ' > 12/16 14:32:42 INFO: user tippensjl added > 12/16 14:32:42 INFO: PBS attribute 'job_state' value: 'Q' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'queue' value: 'workq' (r: NULL) > 12/16 14:32:42 MReqSetAttr(44,RQ,ReqClass,Value,1,2) > 12/16 14:32:42 INFO: job flags for job 44: 0 > 12/16 14:32:42 MJobSetAttr(44,GAttr,Value,1,5) > 12/16 14:32:42 MUMAGetBM(JFeature,PREEMPTEE,3) > 12/16 14:32:42 INFO: attribute 'PREEMPTEE' cleared for job 44 > 12/16 14:32:42 MJobGetPAL(44,RPAL,PAL,NULL) > 12/16 14:32:42 INFO: PBS attribute 'server' value: 'b05l02' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Checkpoint' value: 'u' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'ctime' value: '1134761206' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Error_Path' value: > 'b05l02:/home/2vt/jenstests/schulth/Science/dms/GaN/Cube2x2x2_1Mn_NCC/sic_11111/jen-b5.e44' > (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Hold_Types' value: 'n' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Join_Path' value: 'n' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Keep_Files' value: 'n' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Mail_Points' value: 'ae' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Mail_Users' value: > '[EMAIL PROTECTED]' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'mtime' value: '1134761206' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Output_Path' value: > 'b05l02:/home/2vt/jenstests/schulth/Science/dms/GaN/Cube2x2x2_1Mn_NCC/sic_11111/jen-b5.o44' > (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Priority' value: '0' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'qtime' value: '1134761206' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Rerunable' value: 'True' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Resource_List' value: > '10000:00:00' (r: cput) > 12/16 14:32:42 INFO: PBS attribute 'Resource_List' value: '1' (r: ncpus) > 12/16 14:32:42 INFO: PBS attribute 'Resource_List' value: > '30:ppn=2' (r: neednodes) > 12/16 14:32:42 __MPBSGetTaskList(44,30:ppn=2,NULL,0) > 12/16 14:32:42 MReqSetAttr(44,RQ,ReqNodeFeature,Value,1,2) > 12/16 14:32:42 INFO: 0 host task(s) located for job > 12/16 14:32:42 INFO: PBS attribute 'Resource_List' value: '30' > (r: nodect)12/16 14:32:42 INFO: PBS attribute 'Resource_List' > value: '30:ppn=2' (r: nodes) > 12/16 14:32:42 INFO: processing node request line '30:ppn=2' > 12/16 14:32:42 __MPBSGetTaskList(44,30:ppn=2,NULL,0) > 12/16 14:32:42 MReqSetAttr(44,RQ,ReqNodeFeature,Value,1,2) > 12/16 14:32:42 INFO: 0 host task(s) located for job > 12/16 14:32:42 INFO: PBS attribute 'Resource_List' value: > '10000:00:00' (r: walltime) > 12/16 14:32:42 INFO: PBS attribute 'Shell_Path_List' value: > '/bin/bash' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'substate' value: '10' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'Variable_List' value: > 'PBS_O_HOME=/home/2vt,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=tippensjl,PBS_O_PATH=/opt/intel/cce/9.0/bin:/opt/intel/fce/9.0/bin:/usr/kerberos/bin:/opt/mpich-ch_p4-icc-1.2.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/env-switcher/bin:/opt/kernel_picker/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/home/2vt/bin,PBS_O_MAIL=/var/spool/mail/tippensjl,PBS_O_SHELL=/bin/bash,PBS_O_HOST=b05l02,PBS_O_WORKDIR=/home/2vt/jenstests/schulth/Science/dms/GaN/Cube2x2x2_1Mn_NCC/sic_11111,MODULE_VERSION_STACK=3.1.6,MANPATH=/opt/intel/cce/9.0/man:/opt/intel/fce/9.0/man:/opt/mpich-ch_p4-icc-1.2.7/man:/opt/modules/default/man:/usr/share/man:/usr/man:/usr/local/share/man:/usr/local/man:/usr/X11R6/man:/opt/pbs/man:/opt/env-switcher/man:/opt/kernel_picker/man:/opt/pvm3/man,HOSTNAME=b05l02,PVM_RSH=ssh,_MODULESBEGINENV_=/home/2vt/.modulesbeginenv,SHELL=/bin/bash,TERM=xterm,HISTSIZE=1000,TMPDIR=/home/2vt/.tmpdir,MODULE_SHELL=sh,OLDPWD=/home/2vt,MODULE_OSCAR_USER=tippensjl,USER=tippensjl,LD_LIBRARY_PATH=/opt/intel/mkl72/lib/em64t:/opt/intel/cce/9.0/lib:/opt/intel/fce/9.0/lib,LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:,ENV=/home/2vt/.bashrc,OSCAR_HOME=/opt/oscar,PVM_ROOT=/opt/pvm3,PVM_ARCH=LINUX,MODULE_VERSION=3.1.6,MAIL=/var/spool/mail/tippensjl,PATH=/opt/intel/cce/9.0/bin:/opt/intel/fce/9.0/bin:/usr/kerberos/bin:/opt/mpich-ch_p4-icc-1.2.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/env-switcher/bin:/opt/kernel_picker/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/home/2vt/bin,INPUTRC=/etc/inputrc,PWD=/home/2vt/jenstests/schulth/Science/dms/GaN/Cube2x2x2_1Mn_NCC/sic_11111,_LMFILES_=/opt/modules/oscar-modulefiles/default-manpath/1.0.1:/opt/modules/oscar-modulefiles/torque/1.2.0p5:/opt/env-switcher/share/env-switcher/mpi/mpich-ch_p4-icc-1.2.7:/opt/modules/oscar-modulefiles/switcher/1.0.13:/opt/modules/oscar-modulefiles/kernel_picker/1.4.1.3:/opt/modules/oscar-modulefiles/pvm/3.4.5+4:/opt/modules/modulefiles/oscar-modules/1.0.5:/opt/modules/modulefiles/iforte/9.0:/opt/modules/modulefiles/icce/9.0:/opt/modules/modulefiles/mkl-em64t/7.2,LANG=en_US.UTF-8,MODULEPATH=/opt/env-switcher/share/env-switcher:/opt/modules/oscar-modulefiles:/opt/modules/version:/opt/modules/$MODULE_VERSION/modulefiles:/opt/modules/modulefiles:,LOADEDMODULES=default-manpath/1.0.1:torque/1.2.0p5:mpi/mpich-ch_p4-icc-1.2.7:switcher/1.0.13:kernel_picker/1.4.1.3:pvm/3.4.5+4:oscar-modules/1.0.5:iforte/9.0:icce/9.0:mkl-em64t/7.2,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=1,HOME=/home/2vt,LOGNAME=tippensjl,MODULESHOME=/opt/modules/3.1.6,LESSOPEN=|/usr/bin/lesspipe.sh > %s,G_BROKEN_FILENAMES=1,_=/opt/pbs/bin/qsub,PBS_O_QUEUE=workq' (r: > NULL) > 12/16 14:32:42 INFO: PBS attribute 'euser' value: 'tippensjl' (r: NULL) > 12/16 14:32:42 MUserAdd(UName,UP) > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 INFO: PBS attribute 'egroup' value: 'tippensjl' (r: NULL) > 12/16 14:32:42 MGroupAdd(GName,GP) > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MCPRestore(GROUP,tippensjl,Optr) > 12/16 14:32:42 INFO: no checkpoint entry for object 'GROUP > tippensjl ' > 12/16 14:32:42 INFO: group tippensjl added > 12/16 14:32:42 INFO: PBS attribute 'queue_rank' value: '41' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'queue_type' value: 'E' (r: NULL) > 12/16 14:32:42 INFO: PBS attribute 'etime' value: '1134761206' (r: NULL) > 12/16 14:32:42 MJobSetCreds(44,tippensjl,tippensjl,) > 12/16 14:32:42 MUserAdd(UName,UP) > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MGroupAdd(GName,GP) > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MUGetHash(tippensjl) > 12/16 14:32:42 INFO: hash 'tippensjl' --> 550228005 > 12/16 14:32:42 MJobGetAccount(44,A) > 12/16 14:32:42 MAMAccountGetDefault(tippensjl,AName,RIndex) > 12/16 14:32:42 MSSSDoCommand(allocation-manager,NULL,OBuf,ODE,SC,EMsg) > 12/16 14:32:42 MSysEMSubmit(EM,scheduler,comcom,scheduler,allocation-manager;) > 12/16 14:32:42 INFO: EM disabled > 12/16 14:32:42 MSUConnect(S,TRUE,EMsg) > 12/16 14:32:42 INFO: trying to connect to 192.168.79.231 (Port: 7112) > 12/16 14:32:42 INFO: successful connect to TCP server (sd: 10) > 12/16 14:32:42 MSUSendData(S,15000000,FALSE,FALSE) > 12/16 14:32:42 MSecGetChecksum(Buf,185,Checksum,HMAC64,CSKey) > 12/16 14:32:42 MSecHMACGetDigest(sss,3,<Body actor="root"><Request > action="Query" actor="root"><Object>User</Object><Where > name="Special">False</Where><Get name="Name"></Get><Get > name="DefaultProject"></Get></Request></Body>,185,CSString,20,DigestString,TRUE,TRUE) > 12/16 14:32:42 __MSecSHA1Init(context) > 12/16 14:32:42 __MSecSHA1Transform(context) > > And that's it - it just dies. I have the feeling that this is > something fairly easy that I didn't set up correctly... Just can't > seem to find what it is - I'm pretty new at this... Oh, yeah, I am > using torque 2.0.0p2 if that makes a difference. > > Thank you for any help you can give - I'm pulling my hair out. :-O :) > > -Jen > > Jennifer Tippens > Unix Admin, ORNL Institutional Cluster > Oak Ridge National Lab > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
