Hi,

RFT has a retry mechanism for failing transfers. If you didn't specify
a maxAttempts elements in the staging elements of your job description,
you can try to add it and see if it helps.
maxAttempts specifies how often RFT will try a transfer in case of
(transient) transfer errors. It defaults to "no retries".
You can add this element to fileStageIn, fileStageOut and fileCleanUp:

...
    <fileStageIn>
        <maxAttempts>10</maxAttempts>
        <transfer>
          <sourceUrl>gsiftp://...</sourceUrl>
          <destinationUrl>gsiftp://...</destinationUrl>
        </transfer>
    </fileStageIn>
...

-Martin

Andre Charbonneau wrote:
> Hello,
> Lately I've been running some benchmarks against a globus resource (gt
> 4.0.8) here and we are noticing some rft issues when multiple jobs are
> submitted concurrently.
> 
> The jobs are simple /bin/hostname jobs, with a small stagein and
> stageout file in order to involve rft.  The jobs are submitted
> concurrently (to the Fork factory) by a small python script, that forks
> a thread per globusrun-ws command, and then waits for all the threads to
> return.
> Everything looks ok when I submit the jobs one after the other, but when
> I submit a number of jobs concurrently (>10), then I start seing some of
> the globusrun-ws commands return with an exit code of 255 and the
> following error message at the client side:
> 
> globusrun-ws: Job failed: Staging error for RSL element fileStageOut.
> Connection creation error [Caused by: java.io.EOFException]
> Connection creation error [Caused by: java.io.EOFException]
> 
> I could not find anything in the server side container.log.
> 
> So I enabled debugging at the gridftp level on the server side and I
> found the following:
> 
> 2009-08-06 15:08:01,118 DEBUG vanilla.FTPControlChannel
> [Thread-47,createSocketDNSRR:153] opening control channel to
> xxxxxxxxxxxx/xxxxxxxxxxx : 2811
> 
> (...)
> 
> 2009-08-06 15:08:01,180 DEBUG vanilla.Reply [Thread-47,<init>:65] read
> 1st line
> 2009-08-06 15:08:01,807 DEBUG vanilla.Reply [Thread-47,<init>:68] 1st
> line: null
> 2009-08-06 15:08:01,809 DEBUG vanilla.FTPControlChannel
> [Thread-47,write:363] Control channel sending: QUIT
> 
> 2009-08-06 15:08:01,810 DEBUG vanilla.FTPControlChannel
> [Thread-47,close:260] ftp socket closed
> 2009-08-06 15:08:01,812 DEBUG vanilla.FTPServerFacade
> [Thread-47,close:340] close data channels
> 2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade
> [Thread-47,close:343] close server socket
> 2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade
> [Thread-47,stopTaskThread:369] stop master thread
> 2009-08-06 15:08:01,814 ERROR cache.ConnectionManager
> [Thread-47,createNewConnection:345] Can't create connection:
> java.io.EOFException
> 2009-08-06 15:08:01,820 ERROR service.TransferWork [Thread-47,run:408]
> Transient transfer error
> Connection creation error [Caused by: java.io.EOFException]
> Connection creation error. Caused by java.io.EOFException
> 
> 
> I not 100% sure that these errors are related, but the "Connection
> creation error. Caused by java.io.EOFException" error string makes me
> think they are.  From the gridftp log above, it looks like the control
> channel connection (port 2811) back to the submit machine (probably for
> stageout step) fails.
> 
> 
> 
> In order to debug this, we have tried making the gridftp connection
> limit much higher in the /etc/inetd.d/gridftp script but that didn't
> seem to help.  We have a port range of 200, which I think should be
> enough to handle 10 or so concurrent job with one stagein and 2 stageout
> elements per job.  We also experimented with that port range, but with
> no success.
> 
> Is this something that anyone experienced before?
> 
> Maybe there some other configuration that I can change that might fix
> this issue?
> 
> Any help or feedback about this is much appreciated.
> 
> Best regards,
>    Andre
> 


Reply via email to