I don't know what it might be, but I remember that I had it too in large-scale ws-gram tests. Having retries is in general a good idea. If you want to go down to the root of the problem though, I'd recommend sending gridftp server logs in debug mode and a detailed description to [email protected] (https://lists.globus.org/mailman/listinfo/gridftp-user) Maybe worth testing: Does the same happen if you push the GridFTP servers using globus-url-copy commands with a comparable level of concurrency?
Martin Andre Charbonneau wrote: > Hi, > > I was thinking more about this and I was wondering what could be the > cause of the failed control channel connections we are seeing when there > is >10 concurrent jobs? Maybe if I can track down the source of the > connection failures and fix this then my job throughput will be better > since the file transfers would not need to be retried. > > Any thoughts about this? > > Thanks, > Andre > > > Martin Feller wrote: >> Hi, >> >> RFT has a retry mechanism for failing transfers. If you didn't specify >> a maxAttempts elements in the staging elements of your job description, >> you can try to add it and see if it helps. >> maxAttempts specifies how often RFT will try a transfer in case of >> (transient) transfer errors. It defaults to "no retries". >> You can add this element to fileStageIn, fileStageOut and fileCleanUp: >> >> ... >> <fileStageIn> >> <maxAttempts>10</maxAttempts> >> <transfer> >> <sourceUrl>gsiftp://...</sourceUrl> >> <destinationUrl>gsiftp://...</destinationUrl> >> </transfer> >> </fileStageIn> >> ... >> >> -Martin >> >> Andre Charbonneau wrote: >> >>> Hello, >>> Lately I've been running some benchmarks against a globus resource (gt >>> 4.0.8) here and we are noticing some rft issues when multiple jobs are >>> submitted concurrently. >>> >>> The jobs are simple /bin/hostname jobs, with a small stagein and >>> stageout file in order to involve rft. The jobs are submitted >>> concurrently (to the Fork factory) by a small python script, that forks >>> a thread per globusrun-ws command, and then waits for all the threads to >>> return. >>> Everything looks ok when I submit the jobs one after the other, but when >>> I submit a number of jobs concurrently (>10), then I start seing some of >>> the globusrun-ws commands return with an exit code of 255 and the >>> following error message at the client side: >>> >>> globusrun-ws: Job failed: Staging error for RSL element fileStageOut. >>> Connection creation error [Caused by: java.io.EOFException] >>> Connection creation error [Caused by: java.io.EOFException] >>> >>> I could not find anything in the server side container.log. >>> >>> So I enabled debugging at the gridftp level on the server side and I >>> found the following: >>> >>> 2009-08-06 15:08:01,118 DEBUG vanilla.FTPControlChannel >>> [Thread-47,createSocketDNSRR:153] opening control channel to >>> xxxxxxxxxxxx/xxxxxxxxxxx : 2811 >>> >>> (...) >>> >>> 2009-08-06 15:08:01,180 DEBUG vanilla.Reply [Thread-47,<init>:65] read >>> 1st line >>> 2009-08-06 15:08:01,807 DEBUG vanilla.Reply [Thread-47,<init>:68] 1st >>> line: null >>> 2009-08-06 15:08:01,809 DEBUG vanilla.FTPControlChannel >>> [Thread-47,write:363] Control channel sending: QUIT >>> >>> 2009-08-06 15:08:01,810 DEBUG vanilla.FTPControlChannel >>> [Thread-47,close:260] ftp socket closed >>> 2009-08-06 15:08:01,812 DEBUG vanilla.FTPServerFacade >>> [Thread-47,close:340] close data channels >>> 2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade >>> [Thread-47,close:343] close server socket >>> 2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade >>> [Thread-47,stopTaskThread:369] stop master thread >>> 2009-08-06 15:08:01,814 ERROR cache.ConnectionManager >>> [Thread-47,createNewConnection:345] Can't create connection: >>> java.io.EOFException >>> 2009-08-06 15:08:01,820 ERROR service.TransferWork [Thread-47,run:408] >>> Transient transfer error >>> Connection creation error [Caused by: java.io.EOFException] >>> Connection creation error. Caused by java.io.EOFException >>> >>> >>> I not 100% sure that these errors are related, but the "Connection >>> creation error. Caused by java.io.EOFException" error string makes me >>> think they are. From the gridftp log above, it looks like the control >>> channel connection (port 2811) back to the submit machine (probably for >>> stageout step) fails. >>> >>> >>> >>> In order to debug this, we have tried making the gridftp connection >>> limit much higher in the /etc/inetd.d/gridftp script but that didn't >>> seem to help. We have a port range of 200, which I think should be >>> enough to handle 10 or so concurrent job with one stagein and 2 stageout >>> elements per job. We also experimented with that port range, but with >>> no success. >>> >>> Is this something that anyone experienced before? >>> >>> Maybe there some other configuration that I can change that might fix >>> this issue? >>> >>> Any help or feedback about this is much appreciated. >>> >>> Best regards, >>> Andre >>> >>> >> >> > >
