I just found this page: http://www.globus.org/api/c-globus-4.0/globus_scheduler_event_generator/html/group__seg__api.html
So it *is* a failure code... wow. Okay, now can someone point me to a list of such failure codes so I can choose something more generic or appropriate than what we're currently using? Thanks again, Adam On Fri, Apr 4, 2008 at 10:39 AM, Adam Bazinet <[EMAIL PROTECTED]> wrote: > Thanks for the replies, guys. I think it is the case that there is an > actual underlying job failure (which is my fault for not finding earlier) > and that the error message is a complete red herring. Basically, we have a > SEG that we wrote from scratch for BOINC jobs, and in the case that a job > fails for some reason the following code gets executed: > > //job failed > time_t ts; > globus_scheduler_event_failed(ts, wu_name, 5); > fprintf(fp, "%s : Set %s to failed\n",asctime(t), wu_name); > break; > > Now, it's my guess that the "Invalid executable path" error generated > before is just some kind of default error message, mainly because I don't > see any place in our code where we specify the error message, unless that > "5" as the last argument to globus_scheduler_event_failed means something > specific. Does anyone know what that last argument means? I looked quickly > here ( > http://www.globus.org/toolkit/docs/development/3.9.4/execution/wsgram/developer/scheduler-tutorial-seg.html) > but didn't see it. > > Thanks, > Adam > > > > On Wed, Apr 2, 2008 at 9:54 AM, Stuart Martin <[EMAIL PROTECTED]> wrote: > > > Adam, > > > > If we are dealing with a tmp dir, a cluster, and many jobs/lots of > > activity on the cluster, this could be a problem with the shared file > > system. Maybe occasionally, the compute host where the job is run cannot > > see the tmp dir / executable? > > > > -Stu > > > > > > On Apr 1, 2008, at Apr 1, 11:41 PM, [EMAIL PROTECTED] wrote: > > > > Hi Adam, > > > > > > i can't say much about it right now, but at first glance it looks > > > to me that the application causes the problem. I have to add that > > > i don't know BOINC and probably i didn't understand all details. > > > > > > Can you describe the role of hmmpfam a bit more: > > > * this is not the main executable, right? > > > * it is called by the main executable somehow under certain > > > conditions? > > > * if so: you said that hmmpfam actually should not be used at all. > > > in what situation could the executable call hmmpfam (i love > > > that word ... :-) ). > > > > > > Martin > > > > > > > > > Hi, > > > > > > > > We are experiencing a strange problem that is causing jobs to fail, > > > > > > > albeit > > > > > > > somewhat randomly and infrequently. At any given time, we may have > > > > 100 > > > > > > > active GRAM jobs on a given resource. All of these jobs submit fine > > > and > > > there are usually no immediate failures. However, every so often, one > > > will > > > > > > > fail, and this can be days after it was submitted, with the > > > > following > > > > > > > error: > > > > > > > > > > > [EMAIL PROTECTED]:/export/grid_files/260600020.09477316738932795> > > > > > > > globusrun-ws -status -j jobEPR.txt > > > > > > > Current job state: Failed > > > > globusrun-ws: Job failed: Invalid executable path > > > > > > > > "/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/hmmpfam". > > > > > > > ProcessDied > > > > > > > > > > > Now, the resource we are submitting to is unique in that we are not > > > > > > > actually > > > > > > > transferring in the hmmpfam executable; that is just a dummy path, > > > > and > > > > > > > our > > > > > > > custom BOINC job manager does not attempt to make use of it, as > > > > BOINC > > > > > > > executables live elsewhere. So until recently, the executable > > > specified > > > on > > > > > > > that path never existed, and so the error *kinda* made sense; what > > > > > > > didn't > > > > > > > make sense is why it happened randomly. In an attempt to make this > > > > problem > > > > go away, I now have the BOINC job manager create a dummy executable > > > > on that > > > > path when the job is submitted, but it doesn't look like that has > > > > helped > > > > > > > because the error is still popping up. *Now* the error message > > > certainly > > > > > > > doesn't make sense if taken at face value, because that has been a > > > > > > > "valid > > > > > > > path", technically speaking, for the lifetime of the job in question > > > > -- yet > > > > the job still failed. If it helps, I'll attach debug output below, > > > > > > > though > > > > > > > I > > > > wasn't able to glean any additional information from it. Does > > > > anyone > > > > > > > have > > > > > > > a > > > > guess as to why this would happen so randomly and infrequently, or > > > > > > > happen > > > > > > > in > > > > the first place? > > > > > > > > This one is costing us big time because when a job fails, Globus > > > > deletes > > > > > > > all > > > > > > > the output collected thus far, and these are large batches of work. > > > > > > > > Thanks! > > > > Adam > > > > > > > > [EMAIL PROTECTED]:/export/grid_files/260600020.09477316738932795> > > > > > > > globusrun-ws -debug -status -j jobEPR.txt > > > > > > > > > > > === REQUEST MESSAGE (length 816) (time 1206627634.506184000) === > > > > > > > <ns00:Envelope > > > > > > > xmlns:ns00="http://schemas.xmlsoap.org/soap/ > > > > envelope/"><ns00:Header></ns00:Header><ns00:Body><ns01:GetMultipleResourceProperties > > > > > > > xmlns:ns01=" > > > > > > > > > > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties-1.2-draft-01.xsd > > > > "><ns01:ResourceProperty > > > > > > > xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/ > > > types">ns02:state</ns01:ResourceProperty><ns01:ResourceProperty > > > xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/ > > > types">ns02:holding</ns01:ResourceProperty><ns01:ResourceProperty > > > xmlns:ns03="http://www.globus.org/namespaces/2004/10/gram/job/ > > > faults">ns03:fault</ns01:ResourceProperty><ns01:ResourceProperty > > > xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/types > > > > > > ">ns02:exitCode</ns01:ResourceProperty></ns01:GetMultipleResourceProperties></ns00:Body></ns00:Envelope> > > > ---------------------------------------------- > > > > > > > > > > > === RESPONSE MESSAGE (length 6399) (time 1206627634.546965000) === > > > > > > > <soapenv:Envelope > > > > > > > xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" > > > > xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi=" > > > > http://www.w3.org/2001/XMLSchema-instance" xmlns:wsa=" > > > > http://schemas.xmlsoap.org/ws/2004/03/ > > > > addressing"><soapenv:Header><wsa:MessageID > > > > > > > > > > soapenv:mustUnderstand="0">uuid:f6d14500-fc08-11dc-a90a-8a2384ad991f</wsa:MessageID><wsa:To > > > soapenv:mustUnderstand="0"> > > > > > > > http://schemas.xmlsoap.org/ws/2004/03/addressing/role/anonymous > > > > </wsa:To><wsa:Action > > > > > > > soapenv:mustUnderstand="0"> > > > > > > > > > > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties/GetMultipleResourcePropertiesResponse > > > > </wsa:Action><wsa:From > > > > > > > soapenv:mustUnderstand="0" xmlns:ns4=" > > > > > > > http://www.globus.org/namespaces/2004/10/gram/job"><wsa:Address> > > > > > > > https://128.8.120.35:8443/wsrf/services/ManagedExecutableJobService > > > </wsa:Address><wsa:ReferenceProperties><ns4:ResourceID > > > xmlns:ns4="http://www.globus.org/namespaces/2004/10/gram/ > > > job">a1253c20-fa72-11dc-a908-8a2384ad991f</ns4:ResourceID></wsa:ReferenceProperties></wsa:From><wsa:RelatesTo > > > RelationshipType="wsa:Reply" > > > > > > > > > > > soapenv:mustUnderstand="0">uuid:f6d04560-fc08-11dc-b962-000f1f66888a</wsa:RelatesTo></soapenv:Header><soapenv:Body><GetMultipleResourcePropertiesResponse > > > > > > > xmlns=" > > > > > > > > > > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties-1.2-draft-01.xsd > > > > "><ns1:state > > > > > > > xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job/ > > > types">Failed</ns1:state><ns2:holding > > > xmlns:ns2="http://www.globus.org/namespaces/2004/10/gram/job/ > > > types">false</ns2:holding><ns3:fault > > > xmlns:ns3="http://www.globus.org/namespaces/2004/10/gram/job/ > > > faults"><ns3:invalidPathFault><ns4:Timestamp > > > xmlns:ns4=" > > > > > > > > > > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd > > > > > > > ">2008-03-27T06:09:37.203Z</ns4:Timestamp><ns5:Originator xmlns:ns5=" > > > > > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd > > > "><wsa:Address> > > > > > > > https://128.8.120.35:8443/wsrf/services/ManagedJobFactoryService > > > > </wsa:Address><wsa:ReferenceProperties><ns6:ResourceID > > > > > > > xmlns:ns6="http://www.globus.org/namespaces/2004/10/gram/ > > > job">a1253c20-fa72-11dc-a908-8a2384ad991f</ns6:ResourceID></wsa:ReferenceProperties><wsa:ReferenceParameters/></ns5:Originator><ns7:Description > > > xmlns:ns7=" > > > > > > > > > > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd > > > > ">Invalid > > > > > > > executable path > > > > > > > > > > > "/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/hmmpfam".</ns7:Description><ns8:FaultCause > > > > > > > xmlns:ns8=" > > > > > > > > > > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd > > > > > > > > > > "><ns8:Timestamp>2008-03-27T06:09:37.203Z</ns8:Timestamp><ns8:ErrorCode > > > dialect="http://www.globus.org/fault/stacktrace"> > > > > > > > at > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > > > Method) > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance( > > > > NativeConstructorAccessorImpl.java:39) > > > > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( > > > > DelegatingConstructorAccessorImpl.java:27) > > > > at > > > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:494) > > > > > > > at java.lang.Class.newInstance0(Class.java:350) > > > > at java.lang.Class.newInstance(Class.java:303) > > > > at > > > > > > > org.globus.exec.utils.FaultUtils.makeFault(FaultUtils.java:485) > > > > > > > at org.globus.exec.utils.FaultUtils.createInvalidPathFault( > > > > FaultUtils.java:129) > > > > at > > > > org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode( > > > > > > > StateMachine.java:3184) > > > > > > > at > > > > > > > > org.globus.exec.service.exec.StateMachine.processWaitingForStateChangesState > > > > > > > (StateMachine.java:1652) > > > > > > > at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown > > > > > > > Source) > > > > > > > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > > > > DelegatingMethodAccessorImpl.java:25) > > > > at java.lang.reflect.Method.invoke(Method.java:585) > > > > at org.globus.exec.service.exec.StateMachine.processState( > > > > StateMachine.java:328) > > > > at > > > > org.globus.exec.service.exec.RunThread.run(RunThread.java:85) > > > > </ns8:ErrorCode><ns8:Description> > > > > > > > > org.globus.exec.generated.InvalidPathFaultType</ns8:Description></ns8:FaultCause><ns9:FaultCause > > > > > > > xmlns:ns9=" > > > > > > > > > > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd > > > > > > > "><ns9:Timestamp>2008-03-27T06:09:37.203Z > > > > > > > > > > > </ns9:Timestamp><ns9:Description>ProcessDied</ns9:Description><ns9:FaultCause><ns9:Timestamp>2008-03-27T06:09: > > > > > > > 37.207Z</ns9:Timestamp><ns9:ErrorCode dialect=" > > > > > > > http://www.globus.org/fault/stacktrace">java.lang.Exception: > > > > ProcessDied > > > > at > > > > org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode( > > > > > > > StateMachine.java:3127) > > > > > > > at > > > > > > > > org.globus.exec.service.exec.StateMachine.processWaitingForStateChangesState > > > > > > > (StateMachine.java:1652) > > > > > > > at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown > > > > > > > Source) > > > > > > > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > > > > DelegatingMethodAccessorImpl.java:25) > > > > at java.lang.reflect.Method.invoke(Method.java:585) > > > > at org.globus.exec.service.exec.StateMachine.processState( > > > > StateMachine.java:328) > > > > at > > > > org.globus.exec.service.exec.RunThread.run(RunThread.java:85) > > > > </ns9:ErrorCode><ns9:Description>java.lang.Exception > > > > > > > > </ns9:Description></ns9:FaultCause><ns9:FaultCause><ns9:Timestamp>2008-03-27T06:09: > > > > > > > 37.207Z</ns9:Timestamp><ns9:ErrorCode dialect=" > > > > > > > http://www.globus.org/fault/stacktrace"> > > > > at > > > > > > > org.globus.wsrf.utils.FaultHelper.toBaseFault(FaultHelper.java > > > > > > > :282) > > > > at > > > > > > > org.globus.exec.utils.FaultUtils.makeFault(FaultUtils.java:505) > > > > > > > at org.globus.exec.utils.FaultUtils.createInvalidPathFault( > > > > FaultUtils.java:129) > > > > at > > > > org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode( > > > > > > > StateMachine.java:3184) > > > > > > > at > > > > > > > > org.globus.exec.service.exec.StateMachine.processWaitingForStateChangesState > > > > > > > (StateMachine.java:1652) > > > > > > > at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown > > > > > > > Source) > > > > > > > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > > > > DelegatingMethodAccessorImpl.java:25) > > > > at java.lang.reflect.Method.invoke(Method.java:585) > > > > at org.globus.exec.service.exec.StateMachine.processState( > > > > StateMachine.java:328) > > > > at > > > > org.globus.exec.service.exec.RunThread.run(RunThread.java:85) > > > > > > > > </ns9:ErrorCode><ns9:Description>org.oasis.wsrf.faults.BaseFaultType</ns9:Description></ns9:FaultCause></ns9:FaultCause><ns3:stateWhenFailureOccurred>Active</ns3:stateWhenFailureOccurred><ns3:command>submit</ns3:command><ns3:gt2ErrorCode>5</ns3:gt2ErrorCode><ns3:attribute>executable</ns3:attribute><ns3:path>/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/hmmpfam</ns3:path></ns3:invalidPathFault></ns3:fault><ns10:exitCode > > > > > > > xmlns:ns10="http://www.globus.org/namespaces/2004/10/gram/job/types > > > > > > ">5</ns10:exitCode></GetMultipleResourcePropertiesResponse></soapenv:Body></soapenv:Envelope> > > > ---------------------------------------------- > > > > > > > Current job state: Failed > > > > globusrun-ws: Job failed: Invalid executable path > > > > > > > > "/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/hmmpfam". > > > > > > > ProcessDied > > > > > > > > > > > > > > > > > > > > > > > > > >
