I'm running a 4.2.1 container on machines X and Y. X submits jobs to Y, and when those jobs finish, files are staged out (from Y -> X) -- approximately 800 files/job, a total of 30M.
GridFTP on both machines is configured in the default manner, as is RFT (using the Derby database). I have found, consistently, that machine Y gets overwhelmed staging files back to X as jobs finish. The container log (which I don't have handy at the moment) shows things like transient RFT transfer errors, failures to acquire a lock in time on the RFT database, and other general indications of overload. Eventually, the container seems to become unresponsive. When I restart it, I see things like 'start already called on this transfer', and the load average on the machine shoots way up as seemingly it attempts to restart all the transfers simultaneously, or at least a large number of them. It gets nowhere. This problem isn't really one-sided; it also happens during job submissions (stage in, ~200 files/job, 70M) but at least there introducting a delay (of approximately 15 minutes between submissions) has allowed GridFTP/RFT time to keep up. I've tried increasing the number of container threads to improve performance; that hasn't helped. These problems have only manifested themselves since we upgraded our system to 4.2.1. Any suggestions of general settings or things to try would be much appreciated. thanks, Adam
