Here is the configuration (identical on both machines)
service gsiftp
{
instances = 100
socket_type = stream
wait = no
user = root
env += GLOBUS_LOCATION=/export/work/globus-4.2.1
env += LD_LIBRARY_PATH=/export/work/globus-4.2.1/lib
env += GLOBUS_TCP_PORT_RANGE=40000,45000
server =
/export/work/globus-4.2.1/sbin/globus-gridftp-server
server_args = -i -control-idle-timeout 1200 -ipc-idle-timeout
1200 -control-preauth-timeout 120
log_on_success += DURATION
nice = 10
disable = no
}
It hasn't changed much over the years, aside from the timeout settings you
see, which I've adjusted based on recommendations from others on the list
from time to time and just propagated along. If you have any suggestions
for which settings to modify (on machine X and/or machine Y, in this
scenario), please let me know.
thanks,
Adam
On Mon, Feb 23, 2009 at 12:38 PM, John Bresnahan <[email protected]>wrote:
> Can i see the configuration for the gridftp servers? They should have a
> --max-connections X if command line, or instances = X in from xinetd. The
> value of X should be around 50. Sometimes when it is set too high (or
> unlimited) and there are many simultaneous connections that it is attempting
> to service OS and network thrashing occurs causing everything to slow down.
> You protect against this by limiting the number they will handle at once
> and allowing RFT to manage the backoff and retries.
>
>
> Adam Bazinet wrote:
>
>> I'm running a 4.2.1 container on machines X and Y. X submits jobs to Y,
>> and
>> when those jobs finish, files are staged out (from Y -> X) --
>> approximately
>> 800 files/job, a total of 30M.
>>
>> GridFTP on both machines is configured in the default manner, as is RFT
>> (using the Derby database).
>>
>> I have found, consistently, that machine Y gets overwhelmed staging files
>> back to X as jobs finish. The container log (which I don't have handy at
>> the moment) shows things like transient RFT transfer errors, failures to
>> acquire a lock in time on the RFT database, and other general indications
>> of
>> overload. Eventually, the container seems to become unresponsive. When I
>> restart it, I see things like 'start already called on this transfer', and
>> the load average on the machine shoots way up as seemingly it attempts to
>> restart all the transfers simultaneously, or at least a large number of
>> them. It gets nowhere.
>>
>> This problem isn't really one-sided; it also happens during job
>> submissions
>> (stage in, ~200 files/job, 70M) but at least there introducting a delay
>> (of
>> approximately 15 minutes between submissions) has allowed GridFTP/RFT time
>> to keep up.
>>
>> I've tried increasing the number of container threads to improve
>> performance; that hasn't helped. These problems have only manifested
>> themselves since we upgraded our system to 4.2.1. Any suggestions of
>> general settings or things to try would be much appreciated.
>>
>> thanks,
>> Adam
>>
>>
>