Re: [gt-user] Configuring globus, multi-ip situation

Steven Timm Thu, 26 Jun 2008 10:46:00 -0700

Barring that, everything will work fine if you avoid notifications. Ratherthan -monitor, one can use -status to check on the progress of a job. Youmight also be able to "fix" the notification problem by using the -subjectargument to globusrun-ws, to force it to expect a particular identity. Thiswould work in 4.2, where the globusrun-ws client will automatically set thesubject name expectation to the endpoint that was submitted to, rather thanfiguring it out on the fly based on the IP address source of thenotification. That doesn't help you now, obviously.


So:
1) using the globusrun-ws -submit -batch and globusrun-ws -monitor
combination both with the -subject option, it completes without errors.
The same combination, but requiring streaming as well,
we get this error:

bash-3.00$ globusrun-ws -submit -F fnpcosg1.fnal.gov:9443 -Ft Condor-subject /DC=org/DC=doegrids/OU=Services/CN=fnpcosg1.fnal.gov -s -c/usr/bin/id

Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:e6f6706e-43a3-11dd-acad-001422086c92
Termination time: 06/27/2008 17:18 GMT
Current job state: Pending
Current job state: Active
Current job state: CleanUp-Hold
uid=13160(fnalgrid) gid=9767(fnalgrid) groups=9767(fnalgrid)
Current job state: CleanUp
Current job state: Failed
Destroying job...Done.
Cleaning up any delegated credentials...Done.
globusrun-ws: Job failed: Staging error for RSL element fileCleanUp.
----------------------------
Any idea what might be causing the above error?  Note
that we *do* get the standard output back somehow despite the error.

-------------------
Requiring delegation but not streaming, we do not get an error:

bash-3.00$ globusrun-ws -submit -F fnpcosg1.fnal.gov:9443 -Ft Condor-subject /DC=org/DC=doegrids/OU=Services/CN=fnpcosg1.fnal.gov -J -c/usr/bin/id

Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:64488c78-43a4-11dd-bf2f-001422086c92
Termination time: 06/27/2008 17:22 GMT
Current job state: Pending
Current job state: Active
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.

--------------------------------------------------------
I note that there's the following error in container-real.log:

2008-06-26 12:28:26,294 ERROR impl.QueryAggregatorSource[Timer-8,pollGetMultiple:149] Exception Getting Multiple ResourceProperties fromhttps://131.225.166.2:9443/wsrf/services/ReliableFileTransferFactoryService:java.rmi.RemoteException: Failed to serialize resource property[EMAIL PROTECTED];nested exception is:org.apache.commons.dbcp.DbcpException: java.sql.SQLException:null, message from server: "Host 'fnpcosg1.fnal.gov' is not allowed toconnect to this MySQL server".

Is this rft-related?


2)  Above you say that the globusrun-ws client will be modified in
globus 4.2 to expect the hostname to which it sent the
query originally... is this part of a bigger change in 4.2 about
the way that globus deals with hostnames?  Is there any chance
it can be back-ported into the 4.0.x branch--or anything that
would break if it happened?

Is there any way that globus could move to a model similar
to that which Condor already uses, namely to allow a server to bind
to a particular IP only and use that IP for all communication
inbound and outbound?  Since it's likely that the 4.0.x set of
clients will be with us for a while, it would be good to come
up with a solution that's more generic than requiring a client update.


3) you agreed with me that moving my service ip to an alternate
network interface such as eth2 might help, but even so, it would
only help if that interface were the default route to the rest
of the world, wouldn't it?

Suppose I have system ip
131.225.81.94 / netmask 255.255.248.0 on eth0

and service ip

131.225.94.12 / netmask 255.255.254.0 on eth1

and the routing table looks like this:

[EMAIL PROTECTED] timm]# netstat -nNr
Kernel IP routing table

Destination Gateway Genmask Flags MSS Window irttIface131.225.94.0 0.0.0.0 255.255.254.0 U 0 0 0eth1131.225.80.0 0.0.0.0 255.255.248.0 U 0 0 0eth0169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0eth10.0.0.0 131.225.87.200 0.0.0.0 UG 0 0 0eth0



In such a case, even if GLOBUS_HOSTNAME is set to 131.225.94.12
(or its corresponding name) the traffic from the container is still
going to take the default route on eth0 and the ip on that interface
that talks to it, isn't that right?

If so, then it implies that for the moment, the globus container
must for consistency be run with a GLOBUS_HOSTNAME that matches the
default IP by which you talk to the rest of the world.  Is this
correct?

Steve Timm

Charles


On Jun 26, 2008, at 10:19 AM, Charles Bacon wrote:
Sorry, but while I'm trying to figure out what's going on - can you run themonitor without the -F/-Ft? They should be redundant given the informationin the EPR, and I'd like to verify that it works in their absence.
What machine is the client on? Does it make any difference if you do thejob submission from a different host?
Last bit of info: Can you run the batch/monitor jobs with -debug, then runa failed "-submit -c" (with no -J/-S/-s) with -debug and send the results?It looks like the monitor part of the code must be getting differentinformation when the code runs straight through than when it comes in twopieces, but looking at globusrun_ws.c I can't see how.
Thanks,

Charles

On Jun 26, 2008, at 9:58 AM, Steven Timm wrote:
I made the change using vdt's vdt-local-setup.sh
which I know doesn't get modified, and now the epr shows the right
ip in it, and the example you gave works.
but my initial example still doesn't.
bash-3.00$ globusrun-ws -submit -batch -o foo.epr -Ffnpcosg1.fnal.gov:9443 -FtCondor -c /usr/bin/id
Submitting job...Done.
Job ID: uuid:decb6502-438f-11dd-9611-001422086c92
Termination time: 06/27/2008 14:55 GMT
bash-3.00$ more foo.epr
<ns00:EndpointReferenceTypexmlns:ns00="http://schemas.xmlsoap.org/ws/2004/03/ad
dressing"><ns00:Address>https://131.225.166.2:9443/wsrf/services/ManagedExecutab
leJobService</ns00:Address><ns00:ReferenceProperties><ResourceIDxmlns="http://w
ww.globus.org/namespaces/2004/10/gram/job">df3b1b40-438f-11dd-88db-cf7a593808fb<
/ResourceID></ns00:ReferenceProperties><wsa:ReferenceParametersxmlns:wsa="http:
//schemas.xmlsoap.org/ws/2004/03/addressing"/></ns00:EndpointReferenceType>
bash-3.00$ globusrun-ws -monitor -j foo.epr -F fnpcosg1.fnal.gov:9443 -FtCondor
Current job state: Done
Requesting original job description...Done.
Destroying job...Done.
bash-3.00$ globusrun-ws -submit -F fnpcosg1.fnal.gov:9443 -Ft Condor -J -s-c /usr/bin/id
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:fa78cea2-438f-11dd-a905-001422086c92
Termination time: 06/27/2008 14:55 GMT
globusrun-ws:globus_service_engine.c:globus_l_service_engine_session_started_callback:2744:
Session failed to start
globus_xio_gsi.c:globus_l_xio_gsi_read_token_cb:1335:
The peer authenticated as/DC=org/DC=doegrids/OU=Services/CN=fnpcosg1.fnal.gov.Expected the peer toauthenticate as /CN=host/fnpc3x1.fnal.govbash-3.00$ globusrun-ws -submit -F fnpcosg1.fnal.gov:9443 -Ft Condor -J -c/usr/bin/id
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:355cc8b6-4390-11dd-a249-001422086c92
Termination time: 06/27/2008 14:57 GMT
globusrun-ws:globus_service_engine.c:globus_l_service_engine_session_started_callback:2744:
Session failed to start
globus_xio_gsi.c:globus_l_xio_gsi_read_token_cb:1335:
The peer authenticated as/DC=org/DC=doegrids/OU=Services/CN=fnpcosg1.fnal.gov.Expected the peer toauthenticate as /CN=host/fnpc3x1.fnal.govbash-3.00$ globusrun-ws -submit -F fnpcosg1.fnal.gov:9443 -Ft Condor -s -c/usr/bin/id
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:3a4ee764-4390-11dd-bb28-001422086c92
Termination time: 06/27/2008 14:57 GMT
globusrun-ws:globus_service_engine.c:globus_l_service_engine_session_started_callback:2744:
Session failed to start
globus_xio_gsi.c:globus_l_xio_gsi_read_token_cb:1335:
The peer authenticated as/DC=org/DC=doegrids/OU=Services/CN=fnpcosg1.fnal.gov.Expected the peer toauthenticate as /CN=host/fnpc3x1.fnal.gov
Any idea what else we might have to fix?

Steve Timm


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
[EMAIL PROTECTED]  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant GroupLeader.
On Thu, 26 Jun 2008, Charles Bacon wrote:
On Jun 26, 2008, at 9:09 AM, Steven Timm wrote:
On Thu, 26 Jun 2008, Charles Bacon wrote:
As an experiment, can you tell me what happens if you run the job intwo parts:
First, try -submit -batch -o foo.epr
Check what hostname/IP shows up in the EPR as the endpoint of theservice.
<ns00:EndpointReferenceTypexmlns:ns00="http://schemas.xmlsoap.org/ws/2004/03/ad
dressing"><ns00:Address>https://131.225.167.18:9443/wsrf/services/ManagedExecuta
bleJobService</ns00:Address><ns00:ReferenceProperties><ResourceIDxmlns="http://
www.globus.org/namespaces/2004/10/gram/job">da7e0c90-4388-11dd-96e1-d1739b31397d
</ResourceID></ns00:ReferenceProperties><wsa:ReferenceParametersxmlns:wsa="http
://schemas.xmlsoap.org/ws/2004/03/addressing"/></ns00:EndpointReferenceType>
that's the wrong IP, it should be the other one.
Okay. So, that's going to be the difference between globus-job-run andglobusrun-ws. The globusrun-ws client is getting back an address fromthe container that it will use to get further updates. The(submit/batch) part of the job is using the address you hand-supplied onthe commandline, so it's working. The (monitor) part of the client isfailing because the service is returning a bad address.
The fix is to get the container to bind to the right address, which youcan do with GLOBUS_HOSTNAME.
as far as I can tell, GLOBUS_HOSTNAME is not set in the environment
of the container.  What's the best way to set it in a VDT environment?
I did set GLOBUS_HOSTNAME before I installed the VDT, to fnpcosg1.
I am now running the container in full-out debug mode so if there
are any logs you need to see, let me know.
It's starting globus-start-container out of /etc/init.d/globus-ws. Itlooks like it sources both setup.sh and vdt/etc/globus-options.sh.globus-options.sh looks like it is intended to setup the JVM options usedby the container. If I were going to set GLOBUS_HOSTNAME, based on whatI've seen I'd put it in the init.d script, or the globus-options.sh file.I'm not sure if those two are vulnerable to being overwritten during apacman update or by a vdt-control on/off.
The other place you can fix it that's not VDT-specific is under$GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd. The optionsare described athttp://www.globus.org/toolkit/docs/4.0/common/javawscore/admin-index.html#id2531913.Basically, adding a "<parameter name="logicalHost"value="the.right.ip.address"> to the globalConfiguration section isequivalent to setting your GLOBUS_HOSTNAME to that IP address.
Charles


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
[EMAIL PROTECTED]  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.

Re: [gt-user] Configuring globus, multi-ip situation

Reply via email to