I don't know what it might be, but I remember that I had it too
in large-scale ws-gram tests. Having retries is in general a good idea.
If you want to go down to the root of the problem though, I'd
recommend sending gridftp server logs in debug mode and a detailed
description to [email protected]
(https://lists.globus.org/mailman/listinfo/gridftp-user)
Maybe worth testing: Does the same happen if you push the GridFTP
servers using globus-url-copy commands with a comparable level of
concurrency?

Martin

Andre Charbonneau wrote:
> Hi,
> 
> I was thinking more about this and I was wondering what could be the
> cause of the failed control channel connections we are seeing when there
> is >10 concurrent jobs?  Maybe if I can track down the source of the
> connection failures and fix this then my job throughput will be better
> since the file transfers would not need to be retried.
> 
> Any thoughts about this?
> 
> Thanks,
>  Andre
> 
> 
> Martin Feller wrote:
>> Hi,
>>
>> RFT has a retry mechanism for failing transfers. If you didn't specify
>> a maxAttempts elements in the staging elements of your job description,
>> you can try to add it and see if it helps.
>> maxAttempts specifies how often RFT will try a transfer in case of
>> (transient) transfer errors. It defaults to "no retries".
>> You can add this element to fileStageIn, fileStageOut and fileCleanUp:
>>
>> ...
>>     <fileStageIn>
>>         <maxAttempts>10</maxAttempts>
>>         <transfer>
>>           <sourceUrl>gsiftp://...</sourceUrl>
>>           <destinationUrl>gsiftp://...</destinationUrl>
>>         </transfer>
>>     </fileStageIn>
>> ...
>>
>> -Martin
>>
>> Andre Charbonneau wrote:
>>  
>>> Hello,
>>> Lately I've been running some benchmarks against a globus resource (gt
>>> 4.0.8) here and we are noticing some rft issues when multiple jobs are
>>> submitted concurrently.
>>>
>>> The jobs are simple /bin/hostname jobs, with a small stagein and
>>> stageout file in order to involve rft.  The jobs are submitted
>>> concurrently (to the Fork factory) by a small python script, that forks
>>> a thread per globusrun-ws command, and then waits for all the threads to
>>> return.
>>> Everything looks ok when I submit the jobs one after the other, but when
>>> I submit a number of jobs concurrently (>10), then I start seing some of
>>> the globusrun-ws commands return with an exit code of 255 and the
>>> following error message at the client side:
>>>
>>> globusrun-ws: Job failed: Staging error for RSL element fileStageOut.
>>> Connection creation error [Caused by: java.io.EOFException]
>>> Connection creation error [Caused by: java.io.EOFException]
>>>
>>> I could not find anything in the server side container.log.
>>>
>>> So I enabled debugging at the gridftp level on the server side and I
>>> found the following:
>>>
>>> 2009-08-06 15:08:01,118 DEBUG vanilla.FTPControlChannel
>>> [Thread-47,createSocketDNSRR:153] opening control channel to
>>> xxxxxxxxxxxx/xxxxxxxxxxx : 2811
>>>
>>> (...)
>>>
>>> 2009-08-06 15:08:01,180 DEBUG vanilla.Reply [Thread-47,<init>:65] read
>>> 1st line
>>> 2009-08-06 15:08:01,807 DEBUG vanilla.Reply [Thread-47,<init>:68] 1st
>>> line: null
>>> 2009-08-06 15:08:01,809 DEBUG vanilla.FTPControlChannel
>>> [Thread-47,write:363] Control channel sending: QUIT
>>>
>>> 2009-08-06 15:08:01,810 DEBUG vanilla.FTPControlChannel
>>> [Thread-47,close:260] ftp socket closed
>>> 2009-08-06 15:08:01,812 DEBUG vanilla.FTPServerFacade
>>> [Thread-47,close:340] close data channels
>>> 2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade
>>> [Thread-47,close:343] close server socket
>>> 2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade
>>> [Thread-47,stopTaskThread:369] stop master thread
>>> 2009-08-06 15:08:01,814 ERROR cache.ConnectionManager
>>> [Thread-47,createNewConnection:345] Can't create connection:
>>> java.io.EOFException
>>> 2009-08-06 15:08:01,820 ERROR service.TransferWork [Thread-47,run:408]
>>> Transient transfer error
>>> Connection creation error [Caused by: java.io.EOFException]
>>> Connection creation error. Caused by java.io.EOFException
>>>
>>>
>>> I not 100% sure that these errors are related, but the "Connection
>>> creation error. Caused by java.io.EOFException" error string makes me
>>> think they are.  From the gridftp log above, it looks like the control
>>> channel connection (port 2811) back to the submit machine (probably for
>>> stageout step) fails.
>>>
>>>
>>>
>>> In order to debug this, we have tried making the gridftp connection
>>> limit much higher in the /etc/inetd.d/gridftp script but that didn't
>>> seem to help.  We have a port range of 200, which I think should be
>>> enough to handle 10 or so concurrent job with one stagein and 2 stageout
>>> elements per job.  We also experimented with that port range, but with
>>> no success.
>>>
>>> Is this something that anyone experienced before?
>>>
>>> Maybe there some other configuration that I can change that might fix
>>> this issue?
>>>
>>> Any help or feedback about this is much appreciated.
>>>
>>> Best regards,
>>>    Andre
>>>
>>>     
>>
>>   
> 
> 

Reply via email to