Yuriy wrote:
> Cannot reproduce it anymore... I submitted jobs with/without
> delegation, with/without streaming, with globus-delegate for
> credential and without, and none of them were killed... In fact I
> cannot see any user jobs dying for about a week now. Maybe it is
> related to the state of the container?
>
> Is there anything in the logs that could indicate the moment that
> some credential was removed and the reason for it?
By default no. You can set the log level for the delegation service
to debug (log4j.category.org.globus.delegation.service=DEBUG in
$GLOBUS_LOCATION/container-log4j.properties), and the log tells you
then that a delegation resource is being destroyed, but unfortunately
it does not tell you the id/name of the resource.
As far as I know the reason for removal can be:
- explicit call to destroy by a client
- a client/service tries to access the credential and it is expired.
I think there's no general periodical sweep and destroy if expired
for persisted delegation resources.
>
> The persisted/../DelegationResource/ folder (this is where credentials
> are stored, right?)
right.
> contains 1200 files, most of the related jobs are
> probably dead. Is there any way to decider those files and see what is
> inside?
>
Delegated credentials are serialized Java objects (DelegationResource objects).
I attached a small program that reads all serialized delegated credentials
from the persistence directory and prints information about it.
Point the variable "persistenceDirName" to the persistence directory of the
delegated credentials before you compile it.
Compile it:
- source ${GLOBUS_LOCATION}/etc/globus-devel-env.sh (assuming bash/bourne shell)
- javac CheckDelegationResources.java (assuming java 1.4+)
Run it:
- java CheckDelegationResource
This program won't win a beauty contest, extend it as you need it.
Hope this helps.
-Martin
> Cheers,
> Yuriy
>
>
> On Mon, Aug 10, 2009 at 08:24:35AM -0500, Martin Feller wrote:
>> What very probably happens is that a credential being delegated to the
>> server expired. It's being removed on the server-side in that case
>> and jobs that still refer to such a (no longer existing) credential
>> fail with the error message you pasted.
>>
>> How do you delegate the credential that is being used by jobs:
>> * Do you let globusrun-ws delegate for you?
>> * Do you delegate a credential, e.g. using globus-credential-delegate
>> and refer to the credential in your job description or let globusrun-ws
>> pick up the epr of the manually delegated credential?
>>
>> You can debug this e.g. like this:
>> * Submit jobs that do not require a delegated credential and see if the
>> same problem still occurs. From your description I'd say that those jobs
>> will not fail.
>> * Delegate a credential that is valid for, say, 60h, using
>> globus-credential-delegate and refer to that credential in your jobs.
>> (globusrun-ws options: -Jf, -Sf) and check if the jobs still fail after
>> 24h.
>>
>> Maybe worth noting: sometimes people delegate although they don't really
>> need to delegate, i.e. the job does not need a job credential and no
>> staging is performed.
>>
>> -Martin
>>
>> Yuriy wrote:
>>> Hi,
>>>
>>> Some of the jobs submitted to torque via GRAM are killed after about
>>> 24 hours in the queue, all with the similar message in globus logs:
>>>
>>> 2009-07-10 11:32:16,052 INFO exec.StateMachine
>>> [RunQueueThread_5,logJobFailed:3250] Job
>>> 74bd3c60-6c17-11de-9a06-9ba1d1ebd14a failed. Description: Couldn't obtain a
>>> delegated credential. Cause: org.globus.exec.generated.FaultType: Couldn't
>>> obtain a delegated credential. caused by [0:
>>> org.oasis.wsrf.faults.BaseFaultType: Error getting delegation resource
>>> [Caused by: org.globus.wsrf.NoSuchResourceException]]
>>>
>>> torque reports exit status = 271 (exceeds resource limit or killed by
>>> user), none of the "problematic" jobs seem to exceed any
>>> limits. Moreover we had a lot of jobs that run for longer then 24 hours
>>> and completed successfully (sometimes users just re-submitted jobs
>>> with the same description and using exactly the same tools and it
>>> completed without any problems).
>>>
>>> All problematic jobs were submitted with globusrun-ws tool
>>>
>>> Could anyone explain what is going on here?
>>>
>>>
>>> Currently we use globus version from VDT 1.10, started with VDT 1.6
>>> From looking in logs, we had the same problem for over a year, but not
>>> many people are affected and most just re-submit without
>>> reporting.
>>>
>>> Cheers,
>>> Yuriy
>>>
>>
>>
import java.io.File;
import java.io.FileInputStream;
import java.io.ObjectInputStream;
import java.util.Calendar;
import java.util.Date;
public class CheckDelegationResources {
public static void main(String[] args)
throws Exception {
// Fill in path to persistence directory of delegated credentials
String persistenceDirName = "";
File persistenceDir = new File(persistenceDirName);
if (persistenceDir.exists()) {
String[] resources = persistenceDir.list();
for (int i=0; i<resources.length; i++) {
File f = new File(persistenceDir,resources[i]);
printInfo(f);
}
} else {
System.err.println(persistenceDirName + " does not exist");
System.exit(1);
}
}
public static void printInfo(
File delegationResource)
throws Exception {
String path = delegationResource.getAbsolutePath();
if (!delegationResource.exists()) {
throw new Exception("File " + path + " does not exist");
}
FileInputStream fis = null;
Calendar terminationTime = null;
try {
fis = new FileInputStream(delegationResource);
ObjectInputStream ois = new ObjectInputStream(fis);
String callerDN = (String) ois.readObject();
String localName = (String) ois.readObject();
String resourceDescPath = (String) ois.readObject();
terminationTime = (Calendar) ois.readObject();
// ignore the rest
Date date = terminationTime.getTime();
System.out.println(path);
for (int i=0; i<path.length(); i++) {
System.out.print("#");
}
System.out.println("");
System.out.println("caller DN: " + callerDN);
System.out.println("local name: " + localName);
System.out.println("termination time: " + date.toString());
System.out.println("expired: " + new
Boolean(date.before(new Date())));
System.out.println("");
} catch (Exception e) {
throw new Exception("Unable to load delegation resource", e);
} finally {
if (fis != null) {
try {
fis.close();
} catch (Exception ee) {}
}
}
}
}