Hi,
Progress has been made in this long discussion, but it has gotten rather
complicated.
Find attached a technical summary, including the observed problems,
diagnosis, and remedies, as well as further suggestions.
Let me address the question: is latency an important issue?
I contend that it is a matter of degree and of application. How big is
the latency? What latency is required by the user and application?
How small a latency could be expected from grid job submission?
In some cases, the latency has been so bad that a user might wonder if
something was wrong, while they get a coffee.
On the other hand, if an application really demands very fast interchange
of small messages, maybe the grid (and a wide-area network in general) is
not a good environment for it. Maybe a cluster is called for.
Currently, at best globusrun-ws imposes a delay on small jobs some five
times longer than that incurred by gsissh. That still amounts to a
humanly-palpable delay of several seconds.
Even if a code has been streamlined, it often in time picks up baggage;
code streamlined for one purpose could be unnecessarily slow for another.
In any case, it behooves us to strive for the best latency we can get with
reasonable effort.
Some suggestions are in the report.
Cheers!
| - - - - - - - - - - - - - - - - - - - - - - - - -
| Steve White +49(331)7499-202
| e-Science / AstroGrid-D Zi. 35 Bg. 20
| - - - - - - - - - - - - - - - - - - - - - - - - -
| Astrophysikalisches Institut Potsdam (AIP)
| An der Sternwarte 16, D-14482 Potsdam
|
| Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz
|
| Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026
| - - - - - - - - - - - - - - - - - - - - - - - - -
Latency of simple jobs submitted by globusrun-ws vs gsissh
1) Observed large latency of simple jobs run by globusrun-ws vs gsissh
There is more than one effect.
A) 30-sec timeout when used from remote machines, caused by xinetd
configuration, sometimes worked around by iptables configuration.
B) With a simple job description file that doesn't specify I/O,
globusrun-ws runs faster than with a command line argument.
C) Even with a job description file, some five times as many TCP/IP
packets are sent by globusrun-ws than by the equivalent gsissh command.
To see this, run wireshark on the client, and filter for
communications with the grid resource.
2) Diagnosis
A) The 30-sec timeout was caused by the xinetd configuration file
/etc/xinetd.d/gsiftp; the "USERID" in the lines
log_on_success += DURATION USERID
log_on_failure += USERID
These lines came from older versions of the Globus 4.0 manuals.
Their purpose was as an extra logging mechanism, but it is now judged
that this function is unnecessary, and that the problem was severe
enough to remove the lines from the on-line copies of the manuals.
B) By default, globusrun-ws opens gsiftp channels to handle
standard I/O. It does this when run from a simple command line.
C) Probably, globusrun-ws does some unnecessary communications.
At least some of these, for some purposes, the user does not want.
3) What we can do
A) Remove the log_on* lines from /etc/xinetd.d/gsiftp,
and re-start the xinetd service.
Delete any iptables rule
OUTPUT -p tcp -m tcp --dport 113 --tcp-flags SYN,RST,ACK SYN -j
that may have been added as part of a Globus installation.
B) It would be good to explain and document these globusrun-ws latency
issues somewhere, along with measures to improve latency.
C) Recommend: Globus personnel review these communications,
determine the cause of each of them, and at least document them.
If possible, give users a way to choose what communications they want.