I'm running a 4.2.1 container on machines X and Y.  X submits jobs to Y, and
when those jobs finish, files are staged out (from Y -> X) -- approximately
800 files/job, a total of 30M.

GridFTP on both machines is configured in the default manner, as is RFT
(using the Derby database).

I have found, consistently, that machine Y gets overwhelmed staging files
back to X as jobs finish.  The container log (which I don't have handy at
the moment) shows things like transient RFT transfer errors, failures to
acquire a lock in time on the RFT database, and other general indications of
overload.  Eventually, the container seems to become unresponsive.  When I
restart it, I see things like 'start already called on this transfer', and
the load average on the machine shoots way up as seemingly it attempts to
restart all the transfers simultaneously, or at least a large number of
them.  It gets nowhere.

This problem isn't really one-sided; it also happens during job submissions
(stage in, ~200 files/job, 70M) but at least there introducting a delay (of
approximately 15 minutes between submissions) has allowed GridFTP/RFT time
to keep up.

I've tried increasing the number of container threads to improve
performance; that hasn't helped.  These problems have only manifested
themselves since we upgraded our system to 4.2.1.  Any suggestions of
general settings or things to try would be much appreciated.

thanks,
Adam

Reply via email to