On Thu, 26 Jun 2008, Charles Bacon wrote:
Okay. Going over what we've got so far, the problem is the notifications
from the GRAM container to your client.
The detached monitor works because the first thing it does is a poll for
status. Because your status is "done", it never receives any notifications,
so everything works fine. You can verify my suspicion by submitting a longer
running sleep job and attaching the monitor. The first status check should
work because it's a poll, but it should blow up on the first notification it
receives.
yes-- I was just writing an E-mail to you, sending the output from
a sleep job, and sure enough it blows up.
The basic problem is the source IP chosen by the container when it sends a
notification. The IP addresses make it look like you're running two
interfaces off of a single ethernet card,
That is correct.
so the source IP chosen by TCP/IP
when going from the machine to itself is likely always the default IP. There
are no options I know of that will force the container to use a non-standard
IP when sending out a message, just for controlling what it attaches to when
it listens. Possibly you could mess with your routing tables to accomplish
that, I don't know. Changing the source IP used by the container when it
sends TCP/IP messages is going to be the only way to really fix the problem,
but it would have to be done at a TCP/IP or JVM layer, because our code
doesn't have any options to change that.
There are certainly other linux applications that have a similar problem
when dealing with two ip's, both of which are on the same net interface
and whose default route is via eth0 (fnpc3x1.fnal.gov in this case)
rather than eth0:2 (fnpcosg1.fnal.gov in this case).
The reason we are trying to do this, of course, is in preparation
for high availability failover where all the state of the
globus container is living on a shared file system that is shared
by two machines, one active and one passive, and the service IP
moves from one of them to the other via the Heartbeat utility if
the main one goes down.
Barring that, everything will work fine if you avoid notifications. Rather
than -monitor, one can use -status to check on the progress of a job. You
might also be able to "fix" the notification problem by using the -subject
argument to globusrun-ws, to force it to expect a particular identity. This
would work in 4.2, where the globusrun-ws client will automatically set the
subject name expectation to the endpoint that was submitted to, rather than
figuring it out on the fly based on the IP address source of the
notification. That doesn't help you now, obviously.
It would not be much good to put a web services site on the OSG
where people would have to use non-standard options to make it work.
Seeing that this machine is a Xen instance after all, we can
make as many ethx interfaces as we want, and could put the
service IP on a different subnet and give it its own default
route. would that help us out?
Also, would changing the IP in the configuration as mentioned below
The other place you can fix it that's not VDT-specific is under
$GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd. The
options
are described at
http://www.globus.org/toolkit/docs/4.0/common/javawscore/admin-index.html#$
Basically, adding a "<parameter name="logicalHost"
value="the.right.ip.address">
do any good?
--and why does it work, at least as long as we are not delegating
or streaming, from within the same machine?
Steve Timm
Charles
On Jun 26, 2008, at 10:19 AM, Charles Bacon wrote:
Sorry, but while I'm trying to figure out what's going on - can you run the
monitor without the -F/-Ft? They should be redundant given the information
in the EPR, and I'd like to verify that it works in their absence.
What machine is the client on? Does it make any difference if you do the
job submission from a different host?
Last bit of info: Can you run the batch/monitor jobs with -debug, then run
a failed "-submit -c" (with no -J/-S/-s) with -debug and send the results?
It looks like the monitor part of the code must be getting different
information when the code runs straight through than when it comes in two
pieces, but looking at globusrun_ws.c I can't see how.
Thanks,
Charles
On Jun 26, 2008, at 9:58 AM, Steven Timm wrote:
I made the change using vdt's vdt-local-setup.sh
which I know doesn't get modified, and now the epr shows the right
ip in it, and the example you gave works.
but my initial example still doesn't.
bash-3.00$ globusrun-ws -submit -batch -o foo.epr -F
fnpcosg1.fnal.gov:9443 -FtCondor -c /usr/bin/id
Submitting job...Done.
Job ID: uuid:decb6502-438f-11dd-9611-001422086c92
Termination time: 06/27/2008 14:55 GMT
bash-3.00$ more foo.epr
<ns00:EndpointReferenceType
xmlns:ns00="http://schemas.xmlsoap.org/ws/2004/03/ad
dressing"><ns00:Address>https://131.225.166.2:9443/wsrf/services/ManagedExecutab
leJobService</ns00:Address><ns00:ReferenceProperties><ResourceID
xmlns="http://w
ww.globus.org/namespaces/2004/10/gram/job">df3b1b40-438f-11dd-88db-cf7a593808fb<
/ResourceID></ns00:ReferenceProperties><wsa:ReferenceParameters
xmlns:wsa="http:
//schemas.xmlsoap.org/ws/2004/03/addressing"/></ns00:EndpointReferenceType>
bash-3.00$ globusrun-ws -monitor -j foo.epr -F fnpcosg1.fnal.gov:9443 -Ft
Condor
Current job state: Done
Requesting original job description...Done.
Destroying job...Done.
bash-3.00$ globusrun-ws -submit -F fnpcosg1.fnal.gov:9443 -Ft Condor -J -s
-c /usr/bin/id
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:fa78cea2-438f-11dd-a905-001422086c92
Termination time: 06/27/2008 14:55 GMT
globusrun-ws:
globus_service_engine.c:globus_l_service_engine_session_started_callback:2744:
Session failed to start
globus_xio_gsi.c:globus_l_xio_gsi_read_token_cb:1335:
The peer authenticated as
/DC=org/DC=doegrids/OU=Services/CN=fnpcosg1.fnal.gov.Expected the peer to
authenticate as /CN=host/fnpc3x1.fnal.gov
bash-3.00$ globusrun-ws -submit -F fnpcosg1.fnal.gov:9443 -Ft Condor -J -c
/usr/bin/id
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:355cc8b6-4390-11dd-a249-001422086c92
Termination time: 06/27/2008 14:57 GMT
globusrun-ws:
globus_service_engine.c:globus_l_service_engine_session_started_callback:2744:
Session failed to start
globus_xio_gsi.c:globus_l_xio_gsi_read_token_cb:1335:
The peer authenticated as
/DC=org/DC=doegrids/OU=Services/CN=fnpcosg1.fnal.gov.Expected the peer to
authenticate as /CN=host/fnpc3x1.fnal.gov
bash-3.00$ globusrun-ws -submit -F fnpcosg1.fnal.gov:9443 -Ft Condor -s -c
/usr/bin/id
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:3a4ee764-4390-11dd-bb28-001422086c92
Termination time: 06/27/2008 14:57 GMT
globusrun-ws:
globus_service_engine.c:globus_l_service_engine_session_started_callback:2744:
Session failed to start
globus_xio_gsi.c:globus_l_xio_gsi_read_token_cb:1335:
The peer authenticated as
/DC=org/DC=doegrids/OU=Services/CN=fnpcosg1.fnal.gov.Expected the peer to
authenticate as /CN=host/fnpc3x1.fnal.gov
Any idea what else we might have to fix?
Steve Timm
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
[EMAIL PROTECTED] http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group
Leader.
On Thu, 26 Jun 2008, Charles Bacon wrote:
On Jun 26, 2008, at 9:09 AM, Steven Timm wrote:
On Thu, 26 Jun 2008, Charles Bacon wrote:
As an experiment, can you tell me what happens if you run the job in
two parts:
First, try -submit -batch -o foo.epr
Check what hostname/IP shows up in the EPR as the endpoint of the
service.
<ns00:EndpointReferenceType
xmlns:ns00="http://schemas.xmlsoap.org/ws/2004/03/ad
dressing"><ns00:Address>https://131.225.167.18:9443/wsrf/services/ManagedExecuta
bleJobService</ns00:Address><ns00:ReferenceProperties><ResourceID
xmlns="http://
www.globus.org/namespaces/2004/10/gram/job">da7e0c90-4388-11dd-96e1-d1739b31397d
</ResourceID></ns00:ReferenceProperties><wsa:ReferenceParameters
xmlns:wsa="http
://schemas.xmlsoap.org/ws/2004/03/addressing"/></ns00:EndpointReferenceType>
that's the wrong IP, it should be the other one.
Okay. So, that's going to be the difference between globus-job-run and
globusrun-ws. The globusrun-ws client is getting back an address from
the container that it will use to get further updates. The
(submit/batch) part of the job is using the address you hand-supplied on
the commandline, so it's working. The (monitor) part of the client is
failing because the service is returning a bad address.
The fix is to get the container to bind to the right address, which you
can do with GLOBUS_HOSTNAME.
as far as I can tell, GLOBUS_HOSTNAME is not set in the environment
of the container. What's the best way to set it in a VDT environment?
I did set GLOBUS_HOSTNAME before I installed the VDT, to fnpcosg1.
I am now running the container in full-out debug mode so if there
are any logs you need to see, let me know.
It's starting globus-start-container out of /etc/init.d/globus-ws. It
looks like it sources both setup.sh and vdt/etc/globus-options.sh.
globus-options.sh looks like it is intended to setup the JVM options used
by the container. If I were going to set GLOBUS_HOSTNAME, based on what
I've seen I'd put it in the init.d script, or the globus-options.sh file.
I'm not sure if those two are vulnerable to being overwritten during a
pacman update or by a vdt-control on/off.
The other place you can fix it that's not VDT-specific is under
$GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd. The options
are described at
http://www.globus.org/toolkit/docs/4.0/common/javawscore/admin-index.html#id2531913.
Basically, adding a "<parameter name="logicalHost"
value="the.right.ip.address"> to the globalConfiguration section is
equivalent to setting your GLOBUS_HOSTNAME to that IP address.
Charles
--
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
[EMAIL PROTECTED] http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.