Re: [gt-user] questions about SEG, lost SGE jobs, and general stability

Joseph Bester Tue, 07 Feb 2012 13:37:30 -0800

On Feb 6, 2012, at 3:53 PM, Brian O'Connor wrote:
> Hi Joseph,
> 
> Thanks very much for your email.
> 
> We actually just had a failure (reboot) of out Globus box just a
> couple hours ago.  So this gets at my question below about how to
> cleanup after a failure.  When the machine rebooted I now see a ton of
> globus-job-managers running as my "seqware" user (the one that
> originally submitted the globus jobs).


> [seqware@sqwprod ~]$ ps aux | grep globus-job-manager | grep seqware | wc -l
> 1837
> 
> So there are 1837 of these daemons running.
> 

That's probably condor-g restarting job managers automatically. :)

> I can no longer submit a cluster job using:
> 
> globus-job-run sqwprod/jobmanager-sge /bin/hostname
> 
> It just hangs.
> 
> I think this is because there is a lock file
> ~/.globus/job/sqwprod.hpc.oicr.on.ca/
> 
> sge.4572dcea.lock
> 
> My questions are 1) what's the proper way to reset here, kill all the
> globus-job-managers, remove the lock, and allow the job manager to
> repawn?  2) Why doesn't globus-job-manager (or the gateway) look at
> sge.4572dcea.pid and realize the previous globus-job-manager is dead?
> Shouldn't it detect this, cleanup it's state, and launch a single
> replacement?
> 
> Thanks for your help.  I really appreciate it!

If the home filesystem is a shared filesystem, perhaps there might be some 
issue with
lock state getting mixed up with the reboot? I thought 5.2.0 would put the lock 
file
in /var/lib/globus/gram_job_state/$LOGNAME. You might get success by adding
-globus-job-dir /var/lib/globus/gram_job_state in 
/etc/globus/globus-gram-job-manager.conf
to force that.

Joe

Re: [gt-user] questions about SEG, lost SGE jobs, and general stability

Reply via email to