i recommend changing the instances value to something between 10 and 50. If you assume that each transfer will get a 1/instance share of the endpoints BW values as low as 10 make more sense.I am not sure it will solve your problem entirely, but this has been the cause of very similar problems that others have had in the past.

Adam Bazinet wrote:
Here is the configuration (identical on both machines)

service gsiftp
{
instances               = 100
socket_type             = stream
wait                    = no
user                    = root
env                     += GLOBUS_LOCATION=/export/work/globus-4.2.1
env                     += LD_LIBRARY_PATH=/export/work/globus-4.2.1/lib
env                     += GLOBUS_TCP_PORT_RANGE=40000,45000
server                  =
/export/work/globus-4.2.1/sbin/globus-gridftp-server
server_args             = -i -control-idle-timeout 1200 -ipc-idle-timeout
1200 -control-preauth-timeout 120
log_on_success          += DURATION
nice                    = 10
disable                 = no
}

It hasn't changed much over the years, aside from the timeout settings you
see, which I've adjusted based on recommendations from others on the list
from time to time and just propagated along.  If you have any suggestions
for which settings to modify (on machine X and/or machine Y, in this
scenario), please let me know.

thanks,
Adam





On Mon, Feb 23, 2009 at 12:38 PM, John Bresnahan <[email protected]>wrote:

Can i see the configuration for the gridftp servers?  They should have a
--max-connections X if command line, or instances = X in from xinetd.  The
value of X should be around 50.  Sometimes when it is set too high (or
unlimited) and there are many simultaneous connections that it is attempting
to service OS and network thrashing occurs causing everything to slow down.
 You protect against this by limiting the number they will handle at once
and allowing RFT to manage the backoff and retries.


Adam Bazinet wrote:

I'm running a 4.2.1 container on machines X and Y.  X submits jobs to Y,
and
when those jobs finish, files are staged out (from Y -> X) --
approximately
800 files/job, a total of 30M.

GridFTP on both machines is configured in the default manner, as is RFT
(using the Derby database).

I have found, consistently, that machine Y gets overwhelmed staging files
back to X as jobs finish.  The container log (which I don't have handy at
the moment) shows things like transient RFT transfer errors, failures to
acquire a lock in time on the RFT database, and other general indications
of
overload.  Eventually, the container seems to become unresponsive.  When I
restart it, I see things like 'start already called on this transfer', and
the load average on the machine shoots way up as seemingly it attempts to
restart all the transfers simultaneously, or at least a large number of
them.  It gets nowhere.

This problem isn't really one-sided; it also happens during job
submissions
(stage in, ~200 files/job, 70M) but at least there introducting a delay
(of
approximately 15 minutes between submissions) has allowed GridFTP/RFT time
to keep up.

I've tried increasing the number of container threads to improve
performance; that hasn't helped.  These problems have only manifested
themselves since we upgraded our system to 4.2.1.  Any suggestions of
general settings or things to try would be much appreciated.

thanks,
Adam




Reply via email to