Hello,
Lately I've been running some benchmarks against a globus resource (gt 4.0.8) here and we are noticing some rft issues when multiple jobs are submitted concurrently.

The jobs are simple /bin/hostname jobs, with a small stagein and stageout file in order to involve rft. The jobs are submitted concurrently (to the Fork factory) by a small python script, that forks a thread per globusrun-ws command, and then waits for all the threads to return. Everything looks ok when I submit the jobs one after the other, but when I submit a number of jobs concurrently (>10), then I start seing some of the globusrun-ws commands return with an exit code of 255 and the following error message at the client side:

globusrun-ws: Job failed: Staging error for RSL element fileStageOut.
Connection creation error [Caused by: java.io.EOFException]
Connection creation error [Caused by: java.io.EOFException]

I could not find anything in the server side container.log.

So I enabled debugging at the gridftp level on the server side and I found the following:

2009-08-06 15:08:01,118 DEBUG vanilla.FTPControlChannel [Thread-47,createSocketDNSRR:153] opening control channel to xxxxxxxxxxxx/xxxxxxxxxxx : 2811

(...)

2009-08-06 15:08:01,180 DEBUG vanilla.Reply [Thread-47,<init>:65] read 1st line 2009-08-06 15:08:01,807 DEBUG vanilla.Reply [Thread-47,<init>:68] 1st line: null 2009-08-06 15:08:01,809 DEBUG vanilla.FTPControlChannel [Thread-47,write:363] Control channel sending: QUIT

2009-08-06 15:08:01,810 DEBUG vanilla.FTPControlChannel [Thread-47,close:260] ftp socket closed 2009-08-06 15:08:01,812 DEBUG vanilla.FTPServerFacade [Thread-47,close:340] close data channels 2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade [Thread-47,close:343] close server socket 2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade [Thread-47,stopTaskThread:369] stop master thread 2009-08-06 15:08:01,814 ERROR cache.ConnectionManager [Thread-47,createNewConnection:345] Can't create connection: java.io.EOFException 2009-08-06 15:08:01,820 ERROR service.TransferWork [Thread-47,run:408] Transient transfer error
Connection creation error [Caused by: java.io.EOFException]
Connection creation error. Caused by java.io.EOFException


I not 100% sure that these errors are related, but the "Connection creation error. Caused by java.io.EOFException" error string makes me think they are. From the gridftp log above, it looks like the control channel connection (port 2811) back to the submit machine (probably for stageout step) fails.



In order to debug this, we have tried making the gridftp connection limit much higher in the /etc/inetd.d/gridftp script but that didn't seem to help. We have a port range of 200, which I think should be enough to handle 10 or so concurrent job with one stagein and 2 stageout elements per job. We also experimented with that port range, but with no success.

Is this something that anyone experienced before?

Maybe there some other configuration that I can change that might fix this issue?

Any help or feedback about this is much appreciated.

Best regards,
   Andre

--
Andre Charbonneau
Research Computing Support, IMSB
National Research Council Canada
100 Sussex Drive, Rm 2025
Ottawa, ON, Canada K1A 0R6
613 993-3129

Reply via email to