On Feb 6, 2012, at 3:53 PM, Brian O'Connor wrote: > Hi Joseph, > > Thanks very much for your email. > > We actually just had a failure (reboot) of out Globus box just a > couple hours ago. So this gets at my question below about how to > cleanup after a failure. When the machine rebooted I now see a ton of > globus-job-managers running as my "seqware" user (the one that > originally submitted the globus jobs).
> [seqware@sqwprod ~]$ ps aux | grep globus-job-manager | grep seqware | wc -l > 1837 > > So there are 1837 of these daemons running. > That's probably condor-g restarting job managers automatically. :) > I can no longer submit a cluster job using: > > globus-job-run sqwprod/jobmanager-sge /bin/hostname > > It just hangs. > > I think this is because there is a lock file > ~/.globus/job/sqwprod.hpc.oicr.on.ca/ > > sge.4572dcea.lock > > My questions are 1) what's the proper way to reset here, kill all the > globus-job-managers, remove the lock, and allow the job manager to > repawn? 2) Why doesn't globus-job-manager (or the gateway) look at > sge.4572dcea.pid and realize the previous globus-job-manager is dead? > Shouldn't it detect this, cleanup it's state, and launch a single > replacement? > > Thanks for your help. I really appreciate it! If the home filesystem is a shared filesystem, perhaps there might be some issue with lock state getting mixed up with the reboot? I thought 5.2.0 would put the lock file in /var/lib/globus/gram_job_state/$LOGNAME. You might get success by adding -globus-job-dir /var/lib/globus/gram_job_state in /etc/globus/globus-gram-job-manager.conf to force that. Joe
