Correction.

The new GridFTP ports were opened on grid4, not grid1 (which have been open).

Thanks,
Yoichi

--------------------------------------------------------------------------
Yoichi Takayama, PhD
Senior Research Fellow
RAMP Project
MELCOE (Macquarie E-Learning Centre of Excellence)
MACQUARIE UNIVERSITY

Phone: +61 (0)2 9850 9073
Fax: +61 (0)2 9850 6527
www.mq.edu.au
www.melcoe.mq.edu.au/projects/RAMP/
--------------------------------------------------------------------------
MACQUARIE UNIVERSITY: CRICOS Provider No 00002J

This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie E-Learning Centre Of Excellence (MELCOE) or Macquarie University.

Begin forwarded message:

From: Yoichi Takayama <[EMAIL PROTECTED]>
Date: 21 October 2008 2:10:37 PM
To: Charles Bacon <[EMAIL PROTECTED]>
Cc: gt-user@globus.org
Subject: [gt-user] GRAM2 test (3)

After opening GridFTP ports on grid1, I still get the same error message with either globus-job-submit or globus-job-run.

grid1 submit -> grid2 GRAM server -> grid4 execute


$ globus-job-submit grid2.ramscommunity.org/jobmanager-condor /bin/ hostname GRAM Job submission failed because data transfer to the server failed (error code 10)


$ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/ hostname GRAM Job submission failed because data transfer to the server failed (error code 10)



I don't get the job ID back, so I cannot trace it with globus-job- status although I can see that it polls until the job execute node becomes available, and then deletes all files.

Although the error message sounds like that the request did not go to the server or to the execute node, the job goes to grid2, because I see it on the gatekeeper and jobmanager log. As the jobmanager log shows, it was sent to grid4 for execution but nothing was returned. Finally, I can see that the job was executed successfully on grid4 in the Condor StartLog.

On grid1, I see a gram log for this request fora very short time and it is cleaned up. It leaves no file in /home/yoichi/.glbous/job/ grid2/ramscommuity.org, either. So, I can't tell what's going on.

If I run globus-job-run, the gram log remains. So the attached gram log is from it, not from the globus-job-submit. It is possibly showing the same or a different error than what globus-job-submit may be causing, because their behaviours seem to be different.

Any suggestion?

The last option I can try is to install Globus Toolkit on the Condor Execute node, which I think was not necessary.

Thanks,
Yoichi



------------------------------------------------------------------------------------------------------------------------
DETAILS
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
Firewall

Firewall on grid1:

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:gsiftp (2811) ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:7512 (myproxy-server) - should not be open ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:pcsync-https (8443) ACCEPT udp -- anywhere anywhere state NEW udp dpt:gsiftp (2811) ACCEPT udp -- anywhere anywhere state NEW udp dpt:pcsync-https (8443) ACCEPT tcp -- anywhere anywhere state NEW tcp dpts:40000:41000 ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:webcache (8080) ACCEPT udp -- anywhere anywhere state NEW udp dpt:webcache (8080)

Firewall on grid2

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:gsiftp (2811) ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:myproxy-server (7512) ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:pcsync-https (8443) ACCEPT udp -- anywhere anywhere state NEW udp dpt:gsiftp (2811) ACCEPT udp -- anywhere anywhere state NEW udp dpt:pcsync-https (8443) ACCEPT tcp -- anywhere anywhere state NEW tcp dpts:40000:41000 ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:gsigatekeeper (2119) ACCEPT udp -- anywhere anywhere state NEW udp dpt:gsigatekeeper (2119)


on grid4: (ports newly opened)

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:gsiftp (2811) ACCEPT udp -- anywhere anywhere state NEW udp dpt:gsiftp (2811) ACCEPT tcp -- anywhere anywhere state NEW tcp dpts:40000:41000


------------------------------------------------------------------------------------------------------------------------
GridFTP xinetd configuration on grid4: (newly created)

$ vi /etc/inetd.d/gridftp

service gsiftp
{
instances               = 100
socket_type             = stream
wait                    = no
user                    = root
env                     += GLOBUS_LOCATION=/usr/local/globus
env                     += LD_LIBRARY_PATH=/usr/local/globus/lib
env                     += GLOBUS_TCP_PORT_RANGE=40000,41000
server                  = /usr/local/globus/sbin/globus-gridftp-server
server_args             = -i
log_on_success          += DURATION
disable                 = no
}

------------------------------------------------------------------------------------------------------------------------
Gatekeeper log

On grid2:

$ vi /usr/local/globus/var/globus-gatekeeper.log
...
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 6: globus-gatekeeper pid=9827 starting at Tue Oct 21 13:13:04 2008

TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 6: Got connection 137.111.246.175 at Tue Oct 21 13:13:04 2008

TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: Authenticated globus user: /O=Grid/ OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/ OU=ramscommunity.org/CN=Yoichi Takayama
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: Requested service: jobmanager-condor
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: Authorized as local user: yoichi
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: Authorized as local uid: 500
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5:           and local gid: 500
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 0: executing /usr/local/globus/libexec/globus- job-manager
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 0: Child 9828 started


------------------------------------------------------------------------------------------------------------------------
jobmanager log

on grid2: (this is actually from globus-job-submit earlier. Somehow globus-job-run does not make any entry)

$ vi /usr/local/globus/var/globus-condor.log

<c>
   <a n="MyType"><s>SubmitEvent</s></a>
   <a n="EventTypeNumber"><i>0</i></a>
   <a n="MyType"><s>SubmitEvent</s></a>
   <a n="EventTime"><s>2008-10-21T13:02:12</s></a>
   <a n="Cluster"><i>40</i></a>
   <a n="Proc"><i>0</i></a>
   <a n="Subproc"><i>0</i></a>
   <a n="SubmitHost"><s>&lt;137.111.246.176:9646&gt;</s></a>
</c>
<c>
   <a n="MyType"><s>ExecuteEvent</s></a>
   <a n="EventTypeNumber"><i>1</i></a>
   <a n="MyType"><s>ExecuteEvent</s></a>
   <a n="EventTime"><s>2008-10-21T13:02:15</s></a>
   <a n="Cluster"><i>40</i></a>
   <a n="Proc"><i>0</i></a>
   <a n="Subproc"><i>0</i></a>
   <a n="ExecuteHost"><s>&lt;137.111.246.250:9649&gt;</s></a>
</c>
<c>
   <a n="MyType"><s>JobTerminatedEvent</s></a>
   <a n="EventTypeNumber"><i>5</i></a>
   <a n="MyType"><s>JobTerminatedEvent</s></a>
   <a n="EventTime"><s>2008-10-21T13:02:15</s></a>
   <a n="Cluster"><i>40</i></a>
   <a n="Proc"><i>0</i></a>
   <a n="Subproc"><i>0</i></a>
   <a n="TerminatedNormally"><b v="t"/></a>
   <a n="ReturnValue"><i>0</i></a>
   <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
   <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
   <a n="TotalLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
   <a n="TotalRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
   <a n="SentBytes"><r>0.000000000000000E+00</r></a>
   <a n="ReceivedBytes"><r>0.000000000000000E+00</r></a>
   <a n="TotalSentBytes"><r>0.000000000000000E+00</r></a>
   <a n="TotalReceivedBytes"><r>0.000000000000000E+00</r></a>
</c>



------------------------------------------------------------------------------------------------------------------------
Condor log

on grid4: the job seemed to have executed globus-job-submit successfully, but globus-job-run does not seem to appear)

$ cat StartLog
...
10/21 13:02:12 match_info called
10/21 13:02:12 Received match <137.111.246.250:9649>#1224479079#12#...
10/21 13:02:12 State change: match notification protocol successful
10/21 13:02:12 Changing state: Unclaimed -> Matched
10/21 13:02:12 Request accepted.
10/21 13:02:12 Remote owner is [EMAIL PROTECTED]
10/21 13:02:12 State change: claiming protocol successful
10/21 13:02:12 Changing state: Matched -> Claimed
10/21 13:02:15 Got activate_claim request from shadow (<137.111.246.176:9657>)
10/21 13:02:15 Remote job ID is 40.0
10/21 13:02:15 Got universe "VANILLA" (5) from request classad
10/21 13:02:15 State change: claim-activation protocol successful
10/21 13:02:15 Changing activity: Idle -> Busy
10/21 13:02:15 Called deactivate_claim_forcibly()
10/21 13:02:15 Starter pid 29441 exited with status 0
10/21 13:02:15 State change: starter exited
10/21 13:02:15 Changing activity: Busy -> Idle
10/21 13:02:15 State change: received RELEASE_CLAIM command
10/21 13:02:15 Changing state and activity: Claimed/Idle -> Preempting/Vacating
10/21 13:02:15 State change: No preempting claim, returning to owner
10/21 13:02:15 Changing state and activity: Preempting/Vacating -> Owner/Idle
10/21 13:02:15 State change: IS_OWNER is false
10/21 13:02:15 Changing state: Owner -> Unclaimed

10/21 13:15:20 State change: RunBenchmarks is TRUE
10/21 13:15:20 Changing activity: Idle -> Benchmarking
10/21 13:15:26 State change: benchmarks completed
10/21 13:15:26 Changing activity: Benchmarking -> Idle


StarterLog also shows globus-job-submit successfully was run successfully. There is no sign of globus-job-run????

$ cat StarterLog
...
10/21 13:02:15 Using config source: /nfs/software/condor/7.0.4/etc/ condor_config
10/21 13:02:15 Using local config sources:
10/21 13:02:15    /scratch/condor/condor_config.local
10/21 13:02:15 DaemonCore: Command Socket at <137.111.246.250:9622>
10/21 13:02:15 Done setting resource limits
10/21 13:02:15 Communicating with shadow <137.111.246.176:9645>
10/21 13:02:15 Submitting machine is "grid2.ramscommunity.org"
10/21 13:02:15 setting the orig job name in starter
10/21 13:02:15 setting the orig job iwd in starter
10/21 13:02:15 Job 40.0 set to execute immediately
10/21 13:02:15 Starting a VANILLA universe job with ID: 40.0
10/21 13:02:15 IWD: /home/yoichi
10/21 13:02:15 Output file: /home/yoichi/.globus/job/ grid2.ramscommunity.org/9796.1224554532/stdout 10/21 13:02:15 Error file: /home/yoichi/.globus/job/ grid2.ramscommunity.org/9796.1224554532/stderr
10/21 13:02:15 About to exec /bin/hostname
10/21 13:02:15 Create_Process succeeded, pid=29442
10/21 13:02:15 Process exited, pid=29442, status=0
10/21 13:02:15 Got SIGQUIT.  Performing fast shutdown.
10/21 13:02:15 ShutdownFast all jobs.



------------------------------------------------------------------------------------------------------------------------
GRAM log (this is from globus-job-run, not from globus-job-run. globus-job-submit polls condor until it becomes available, but deletes the files immediately after condor executed the request. So it is impossible to see the gram log or output file for it )

On grid1: gram log

$ cat gram_job_mgr_9828.log
10/21 13:13:04 JM: TARGET_GLOBUS_LOCATION = /usr/local/globus
10/21 13:13:04 JM: Security context imported
10/21 13:13:04 JM: Adding new callback contact (url=https://grid1.ramscommunity.org:56799/ , mask=1048575)
10/21 13:13:04 JM: Added successfully
10/21 13:13:04 Pre-parsed RSL string: &("rsl_substitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421"; ) ) ("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
10/21 13:13:04
<<<<<Job Request RSL
&("rsl_substitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job Request RSL
10/21 13:13:04
<<<<<Job Request RSL (canonical)
&("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job Request RSL (canonical)
10/21 13:13:04 JM: Evaluating RSL Value10/21 13:13:04 JM: Evaluated RSL Value to GLOBUSRUN_GASS_URL10/21 13:13:04 JM: Evaluating RSL Value10/21 13:13:04 JM: Evaluated RSL Value to https://grid1.ramscommunity.org:3642110/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
10/21 13:13:04
<<<<<Job RSL
&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) ) ("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job RSL
10/21 13:13:04
<<<<<Job RSL (post-eval)
&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) ) ("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421 " ) )("stderr" = "https://grid1.ramscommunity.org:36421/dev/stderr"; ) ("stdout" = "https://grid1.ramscommunity.org:36421/dev/stdout"; ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-eval)
Adding default RSL of proxy_timeout = 60
Adding default RSL of dry_run = no
Adding default RSL of gram_my_job = collective
Adding default RSL of job_type = multiple
Adding default RSL of count = 1
Adding default RSL of stdin = /dev/null
Adding default RSL of directory = $(HOME)
10/21 13:13:04
<<<<<Job RSL (post-validation)
&("directory" = $("HOME") )("stdin" = "/dev/null" )("count" = "1" ) ("job_type" = "multiple" )("gram_my_job" = "collective" )("dry_run" = "no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/home/ yoichi" ) ("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421"; ) ) ("stderr" = "https://grid1.ramscommunity.org:36421/dev/stderr"; ) ("stdout" = "https://grid1.ramscommunity.org:36421/dev/stdout"; ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-validation)
10/21 13:13:04
<<<<<Job RSL (post-validation-eval)
&("directory" = "/home/yoichi" )("stdin" = "/dev/null" )("count" = "1" )("job_type" = "multiple" )("gram_my_job" = "collective" ) ("dry_run" = "no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421"; ) ) ("stderr" = "https://grid1.ramscommunity.org:36421/dev/stderr"; ) ("stdout" = "https://grid1.ramscommunity.org:36421/dev/stdout"; ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-validation-eval)
10/21 13:13:04 JMI: Getting RSL output value
10/21 13:13:04 JMI: Processing output positions
10/21 13:13:04 JMI: Getting RSL output value
10/21 13:13:04 JMI: Processing output positions
10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
10/21 13:13:04 JM: Opening output destinations
10/21 13:13:04 JM: stdout goes to /home/yoichi/.globus/job/ grid2.ramscommunity.org/9828.1224555184/stdout 10/21 13:13:04 JM: stderr goes to /home/yoichi/.globus/job/ grid2.ramscommunity.org/9828.1224555184/stderr
10/21 13:13:04 JM: Opening https://grid1.ramscommunity.org:36421/dev/stdout
10/21 13:13:04 JM: Opened GASS handle 1.
10/21 13:13:04 JM: exiting globus_l_gram_job_manager_output_destination_open()
10/21 13:13:04 JM: Opening https://grid1.ramscommunity.org:36421/dev/stderr
10/21 13:13:04 JM: Opened GASS handle 2.
10/21 13:13:04 JM: exiting globus_l_gram_job_manager_output_destination_open()
10/21 13:13:04 stdout or stderr is being used, starting to poll
10/21 13:13:04 JM: Finished opening output destinations
10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP 10/21 13:13:04 JMI: testing job manager scripts for type condor exist and permissions are ok. 10/21 13:13:04 JMI: completed script validation: job manager type is condor.
10/21 13:13:04 JMI: cmd = cache_cleanup
Tue Oct 21 13:13:04 2008 JM_SCRIPT: New Perl JobManager created.
Tue Oct 21 13:13:04 2008 JM_SCRIPT: Using jm supplied job dir: /home/ yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184 Tue Oct 21 13:13:04 2008 JM_SCRIPT: Using jm supplied job dir: /home/ yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184
Tue Oct 21 13:13:04 2008 JM_SCRIPT: cache_cleanup(enter)
Tue Oct 21 13:13:04 2008 JM_SCRIPT: Cleaning files in job dir /home/ yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184 Tue Oct 21 13:13:04 2008 JM_SCRIPT: Removed 3 files from /home/ yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184
Tue Oct 21 13:13:04 2008 JM_SCRIPT: cache_cleanup(exit)
10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP 10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
10/21 13:13:04 JM: before sending to client: rc=0 (Success)
10/21 13:13:04 Job Manager State Machine (exiting): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/21 13:13:04 JM: in globus_gram_job_manager_reporting_file_remove()
10/21 13:13:04 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/21 13:13:04 JM: in globus_gram_job_manager_reporting_file_remove()
10/21 13:13:04 JM: exiting globus_gram_job_manager.


--------------------------------------------------------------------------
Yoichi Takayama, PhD
Senior Research Fellow
RAMP Project
MELCOE (Macquarie E-Learning Centre of Excellence)
MACQUARIE UNIVERSITY

Phone: +61 (0)2 9850 9073
Fax: +61 (0)2 9850 6527
www.mq.edu.au
www.melcoe.mq.edu.au/projects/RAMP/
--------------------------------------------------------------------------
MACQUARIE UNIVERSITY: CRICOS Provider No 00002J

This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie E-Learning Centre Of Excellence (MELCOE) or Macquarie University.

On 21/10/2008, at 2:18 AM, Charles Bacon wrote:

I don't know what's wrong. The error 155 in the gram log you show suggests that it was unable to transfer the output back to the client, but I don't know why it's showing up as an error 10 in the client instead of the error 155 I see in the logs on the server side. It seems possible that you've got a firewall that's preventing the jobmanager from contacting the client. You could test that theory by using globus-job-submit instead of globus-job- run, then running globus-job-status to see the results. That method shouldn't involve callbacks. If that works, you could then try globus-job-get-output to retrieve the results.


Charles


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to