grid1 submit -> grid2 GRAM server -> grid4 execute
$ globus-job-submit grid2.ramscommunity.org/jobmanager-condor /bin/ hostname GRAM Job submission failed because data transfer to the server failed (error code 10)
$ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/hostnameGRAM Job submission failed because data transfer to the server failed (error code 10)
I don't get the job ID back, so I cannot trace it with globus-job- status although I can see that it polls until the job execute node becomes available, and then deletes all files.
Although the error message sounds like that the request did not go to the server or to the execute node, the job goes to grid2, because I see it on the gatekeeper and jobmanager log. As the jobmanager log shows, it was sent to grid4 for execution but nothing was returned. Finally, I can see that the job was executed successfully on grid4 in the Condor StartLog.
On grid1, I see a gram log for this request fora very short time and it is cleaned up. It leaves no file in /home/yoichi/.glbous/job/grid2/ ramscommuity.org, either. So, I can't tell what's going on.
If I run globus-job-run, the gram log remains. So the attached gram log is from it, not from the globus-job-submit. It is possibly showing the same or a different error than what globus-job-submit may be causing, because their behaviours seem to be different.
Any suggestion?The last option I can try is to install Globus Toolkit on the Condor Execute node, which I think was not necessary.
Thanks, Yoichi ------------------------------------------------------------------------------------------------------------------------ DETAILS ------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------ Firewall Firewall on grid1:ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:gsiftp (2811) ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:7512 (myproxy-server) - should not be open ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:pcsync-https (8443) ACCEPT udp -- anywhere anywhere state NEW udp dpt:gsiftp (2811) ACCEPT udp -- anywhere anywhere state NEW udp dpt:pcsync-https (8443) ACCEPT tcp -- anywhere anywhere state NEW tcp dpts:40000:41000 ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:webcache (8080) ACCEPT udp -- anywhere anywhere state NEW udp dpt:webcache (8080)
Firewall on grid2ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:gsiftp (2811) ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:myproxy-server (7512) ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:pcsync-https (8443) ACCEPT udp -- anywhere anywhere state NEW udp dpt:gsiftp (2811) ACCEPT udp -- anywhere anywhere state NEW udp dpt:pcsync-https (8443) ACCEPT tcp -- anywhere anywhere state NEW tcp dpts:40000:41000 ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:gsigatekeeper (2119) ACCEPT udp -- anywhere anywhere state NEW udp dpt:gsigatekeeper (2119)
on grid4: (ports newly opened)ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:gsiftp (2811) ACCEPT udp -- anywhere anywhere state NEW udp dpt:gsiftp (2811) ACCEPT tcp -- anywhere anywhere state NEW tcp dpts:40000:41000
------------------------------------------------------------------------------------------------------------------------ GridFTP xinetd configuration on grid4: (newly created) $ vi /etc/inetd.d/gridftp service gsiftp { instances = 100 socket_type = stream wait = no user = root env += GLOBUS_LOCATION=/usr/local/globus env += LD_LIBRARY_PATH=/usr/local/globus/lib env += GLOBUS_TCP_PORT_RANGE=40000,41000 server = /usr/local/globus/sbin/globus-gridftp-server server_args = -i log_on_success += DURATION disable = no } ------------------------------------------------------------------------------------------------------------------------ Gatekeeper log On grid2: $ vi /usr/local/globus/var/globus-gatekeeper.log ... TIME: Tue Oct 21 13:13:04 2008PID: 9827 -- Notice: 6: globus-gatekeeper pid=9827 starting at Tue Oct 21 13:13:04 2008
TIME: Tue Oct 21 13:13:04 2008PID: 9827 -- Notice: 6: Got connection 137.111.246.175 at Tue Oct 21 13:13:04 2008
TIME: Tue Oct 21 13:13:04 2008PID: 9827 -- Notice: 5: Authenticated globus user: /O=Grid/ OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/ CN=Yoichi Takayama
TIME: Tue Oct 21 13:13:04 2008 PID: 9827 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6 TIME: Tue Oct 21 13:13:04 2008 PID: 9827 -- Notice: 5: Requested service: jobmanager-condor TIME: Tue Oct 21 13:13:04 2008 PID: 9827 -- Notice: 5: Authorized as local user: yoichi TIME: Tue Oct 21 13:13:04 2008 PID: 9827 -- Notice: 5: Authorized as local uid: 500 TIME: Tue Oct 21 13:13:04 2008 PID: 9827 -- Notice: 5: and local gid: 500 TIME: Tue Oct 21 13:13:04 2008PID: 9827 -- Notice: 0: executing /usr/local/globus/libexec/globus- job-manager
TIME: Tue Oct 21 13:13:04 2008 PID: 9827 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9 TIME: Tue Oct 21 13:13:04 2008 PID: 9827 -- Notice: 0: Child 9828 started ------------------------------------------------------------------------------------------------------------------------ jobmanager logon grid2: (this is actually from globus-job-submit earlier. Somehow globus-job-run does not make any entry)
$ vi /usr/local/globus/var/globus-condor.log <c> <a n="MyType"><s>SubmitEvent</s></a> <a n="EventTypeNumber"><i>0</i></a> <a n="MyType"><s>SubmitEvent</s></a> <a n="EventTime"><s>2008-10-21T13:02:12</s></a> <a n="Cluster"><i>40</i></a> <a n="Proc"><i>0</i></a> <a n="Subproc"><i>0</i></a> <a n="SubmitHost"><s><137.111.246.176:9646></s></a> </c> <c> <a n="MyType"><s>ExecuteEvent</s></a> <a n="EventTypeNumber"><i>1</i></a> <a n="MyType"><s>ExecuteEvent</s></a> <a n="EventTime"><s>2008-10-21T13:02:15</s></a> <a n="Cluster"><i>40</i></a> <a n="Proc"><i>0</i></a> <a n="Subproc"><i>0</i></a> <a n="ExecuteHost"><s><137.111.246.250:9649></s></a> </c> <c> <a n="MyType"><s>JobTerminatedEvent</s></a> <a n="EventTypeNumber"><i>5</i></a> <a n="MyType"><s>JobTerminatedEvent</s></a> <a n="EventTime"><s>2008-10-21T13:02:15</s></a> <a n="Cluster"><i>40</i></a> <a n="Proc"><i>0</i></a> <a n="Subproc"><i>0</i></a> <a n="TerminatedNormally"><b v="t"/></a> <a n="ReturnValue"><i>0</i></a> <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a> <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a> <a n="TotalLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a> <a n="TotalRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a> <a n="SentBytes"><r>0.000000000000000E+00</r></a> <a n="ReceivedBytes"><r>0.000000000000000E+00</r></a> <a n="TotalSentBytes"><r>0.000000000000000E+00</r></a> <a n="TotalReceivedBytes"><r>0.000000000000000E+00</r></a> </c> ------------------------------------------------------------------------------------------------------------------------ Condor logon grid4: the job seemed to have executed globus-job-submit successfully, but globus-job-run does not seem to appear)
$ cat StartLog ... 10/21 13:02:12 match_info called 10/21 13:02:12 Received match <137.111.246.250:9649>#1224479079#12#... 10/21 13:02:12 State change: match notification protocol successful 10/21 13:02:12 Changing state: Unclaimed -> Matched 10/21 13:02:12 Request accepted. 10/21 13:02:12 Remote owner is [EMAIL PROTECTED] 10/21 13:02:12 State change: claiming protocol successful 10/21 13:02:12 Changing state: Matched -> Claimed10/21 13:02:15 Got activate_claim request from shadow (<137.111.246.176:9657>)
10/21 13:02:15 Remote job ID is 40.0 10/21 13:02:15 Got universe "VANILLA" (5) from request classad 10/21 13:02:15 State change: claim-activation protocol successful 10/21 13:02:15 Changing activity: Idle -> Busy 10/21 13:02:15 Called deactivate_claim_forcibly() 10/21 13:02:15 Starter pid 29441 exited with status 0 10/21 13:02:15 State change: starter exited 10/21 13:02:15 Changing activity: Busy -> Idle 10/21 13:02:15 State change: received RELEASE_CLAIM command10/21 13:02:15 Changing state and activity: Claimed/Idle -> Preempting/ Vacating
10/21 13:02:15 State change: No preempting claim, returning to owner10/21 13:02:15 Changing state and activity: Preempting/Vacating -> Owner/Idle
10/21 13:02:15 State change: IS_OWNER is false 10/21 13:02:15 Changing state: Owner -> Unclaimed 10/21 13:15:20 State change: RunBenchmarks is TRUE 10/21 13:15:20 Changing activity: Idle -> Benchmarking 10/21 13:15:26 State change: benchmarks completed 10/21 13:15:26 Changing activity: Benchmarking -> IdleStarterLog also shows globus-job-submit successfully was run successfully. There is no sign of globus-job-run????
$ cat StarterLog ...10/21 13:02:15 Using config source: /nfs/software/condor/7.0.4/etc/ condor_config
10/21 13:02:15 Using local config sources: 10/21 13:02:15 /scratch/condor/condor_config.local 10/21 13:02:15 DaemonCore: Command Socket at <137.111.246.250:9622> 10/21 13:02:15 Done setting resource limits 10/21 13:02:15 Communicating with shadow <137.111.246.176:9645> 10/21 13:02:15 Submitting machine is "grid2.ramscommunity.org" 10/21 13:02:15 setting the orig job name in starter 10/21 13:02:15 setting the orig job iwd in starter 10/21 13:02:15 Job 40.0 set to execute immediately 10/21 13:02:15 Starting a VANILLA universe job with ID: 40.0 10/21 13:02:15 IWD: /home/yoichi10/21 13:02:15 Output file: /home/yoichi/.globus/job/ grid2.ramscommunity.org/9796.1224554532/stdout 10/21 13:02:15 Error file: /home/yoichi/.globus/job/ grid2.ramscommunity.org/9796.1224554532/stderr
10/21 13:02:15 About to exec /bin/hostname 10/21 13:02:15 Create_Process succeeded, pid=29442 10/21 13:02:15 Process exited, pid=29442, status=0 10/21 13:02:15 Got SIGQUIT. Performing fast shutdown. 10/21 13:02:15 ShutdownFast all jobs. ------------------------------------------------------------------------------------------------------------------------GRAM log (this is from globus-job-run, not from globus-job-run. globus- job-submit polls condor until it becomes available, but deletes the files immediately after condor executed the request. So it is impossible to see the gram log or output file for it )
On grid1: gram log $ cat gram_job_mgr_9828.log 10/21 13:13:04 JM: TARGET_GLOBUS_LOCATION = /usr/local/globus 10/21 13:13:04 JM: Security context imported10/21 13:13:04 JM: Adding new callback contact (url=https://grid1.ramscommunity.org:56799/ , mask=1048575)
10/21 13:13:04 JM: Added successfully10/21 13:13:04 Pre-parsed RSL string: &("rsl_substitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421" ) ) ("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
10/21 13:13:04 <<<<<Job Request RSL&("rsl_substitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job Request RSL 10/21 13:13:04 <<<<<Job Request RSL (canonical)&("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job Request RSL (canonical)10/21 13:13:04 JM: Evaluating RSL Value10/21 13:13:04 JM: Evaluated RSL Value to GLOBUSRUN_GASS_URL10/21 13:13:04 JM: Evaluating RSL Value10/21 13:13:04 JM: Evaluated RSL Value to https://grid1.ramscommunity.org:3642110/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
10/21 13:13:04 <<<<<Job RSL&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) ) ("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job RSL 10/21 13:13:04 <<<<<Job RSL (post-eval)&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) ) ("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421 " ) )("stderr" = "https://grid1.ramscommunity.org:36421/dev/stderr" ) ("stdout" = "https://grid1.ramscommunity.org:36421/dev/stdout" ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-eval) Adding default RSL of proxy_timeout = 60 Adding default RSL of dry_run = no Adding default RSL of gram_my_job = collective Adding default RSL of job_type = multiple Adding default RSL of count = 1 Adding default RSL of stdin = /dev/null Adding default RSL of directory = $(HOME) 10/21 13:13:04 <<<<<Job RSL (post-validation)&("directory" = $("HOME") )("stdin" = "/dev/null" )("count" = "1" ) ("job_type" = "multiple" )("gram_my_job" = "collective" )("dry_run" = "no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/home/ yoichi" ) ("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421" ) ) ("stderr" = "https://grid1.ramscommunity.org:36421/dev/stderr" ) ("stdout" = "https://grid1.ramscommunity.org:36421/dev/stdout" ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-validation) 10/21 13:13:04 <<<<<Job RSL (post-validation-eval)&("directory" = "/home/yoichi" )("stdin" = "/dev/null" )("count" = "1" )("job_type" = "multiple" )("gram_my_job" = "collective" ) ("dry_run" = "no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/ home/yoichi" ) ("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421" ) ) ("stderr" = "https://grid1.ramscommunity.org:36421/dev/stderr" ) ("stdout" = "https://grid1.ramscommunity.org:36421/dev/stdout" ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-validation-eval) 10/21 13:13:04 JMI: Getting RSL output value 10/21 13:13:04 JMI: Processing output positions 10/21 13:13:04 JMI: Getting RSL output value 10/21 13:13:04 JMI: Processing output positions10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
10/21 13:13:04 JM: Opening output destinations10/21 13:13:04 JM: stdout goes to /home/yoichi/.globus/job/ grid2.ramscommunity.org/9828.1224555184/stdout 10/21 13:13:04 JM: stderr goes to /home/yoichi/.globus/job/ grid2.ramscommunity.org/9828.1224555184/stderr
10/21 13:13:04 JM: Opening https://grid1.ramscommunity.org:36421/dev/stdout 10/21 13:13:04 JM: Opened GASS handle 1.10/21 13:13:04 JM: exiting globus_l_gram_job_manager_output_destination_open()
10/21 13:13:04 JM: Opening https://grid1.ramscommunity.org:36421/dev/stderr 10/21 13:13:04 JM: Opened GASS handle 2.10/21 13:13:04 JM: exiting globus_l_gram_job_manager_output_destination_open()
10/21 13:13:04 stdout or stderr is being used, starting to poll 10/21 13:13:04 JM: Finished opening output destinations10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP 10/21 13:13:04 JMI: testing job manager scripts for type condor exist and permissions are ok. 10/21 13:13:04 JMI: completed script validation: job manager type is condor.
10/21 13:13:04 JMI: cmd = cache_cleanup Tue Oct 21 13:13:04 2008 JM_SCRIPT: New Perl JobManager created.Tue Oct 21 13:13:04 2008 JM_SCRIPT: Using jm supplied job dir: /home/ yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184 Tue Oct 21 13:13:04 2008 JM_SCRIPT: Using jm supplied job dir: /home/ yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184
Tue Oct 21 13:13:04 2008 JM_SCRIPT: cache_cleanup(enter)Tue Oct 21 13:13:04 2008 JM_SCRIPT: Cleaning files in job dir /home/ yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184 Tue Oct 21 13:13:04 2008 JM_SCRIPT: Removed 3 files from /home/ yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184
Tue Oct 21 13:13:04 2008 JM_SCRIPT: cache_cleanup(exit)10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
10/21 13:13:04 JM: before sending to client: rc=0 (Success)10/21 13:13:04 Job Manager State Machine (exiting): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/21 13:13:04 JM: in globus_gram_job_manager_reporting_file_remove()10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/21 13:13:04 JM: in globus_gram_job_manager_reporting_file_remove() 10/21 13:13:04 JM: exiting globus_gram_job_manager. -------------------------------------------------------------------------- Yoichi Takayama, PhD Senior Research Fellow RAMP Project MELCOE (Macquarie E-Learning Centre of Excellence) MACQUARIE UNIVERSITY Phone: +61 (0)2 9850 9073 Fax: +61 (0)2 9850 6527 www.mq.edu.au www.melcoe.mq.edu.au/projects/RAMP/ -------------------------------------------------------------------------- MACQUARIE UNIVERSITY: CRICOS Provider No 00002JThis message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie E-Learning Centre Of Excellence (MELCOE) or Macquarie University.
On 21/10/2008, at 2:18 AM, Charles Bacon wrote:
I don't know what's wrong. The error 155 in the gram log you show suggests that it was unable to transfer the output back to the client, but I don't know why it's showing up as an error 10 in the client instead of the error 155 I see in the logs on the server side. It seems possible that you've got a firewall that's preventing the jobmanager from contacting the client. You could test that theory by using globus-job-submit instead of globus-job- run, then running globus-job-status to see the results. That method shouldn't involve callbacks. If that works, you could then try globus-job-get-output to retrieve the results.Charles
smime.p7s
Description: S/MIME cryptographic signature