Hello,
Lately I've been running some benchmarks against a globus resource (gt
4.0.8) here and we are noticing some rft issues when multiple jobs are
submitted concurrently.
The jobs are simple /bin/hostname jobs, with a small stagein and
stageout file in order to involve rft. The jobs are submitted
concurrently (to the Fork factory) by a small python script, that forks
a thread per globusrun-ws command, and then waits for all the threads to
return.
Everything looks ok when I submit the jobs one after the other, but when
I submit a number of jobs concurrently (>10), then I start seing some of
the globusrun-ws commands return with an exit code of 255 and the
following error message at the client side:
globusrun-ws: Job failed: Staging error for RSL element fileStageOut.
Connection creation error [Caused by: java.io.EOFException]
Connection creation error [Caused by: java.io.EOFException]
I could not find anything in the server side container.log.
So I enabled debugging at the gridftp level on the server side and I
found the following:
2009-08-06 15:08:01,118 DEBUG vanilla.FTPControlChannel
[Thread-47,createSocketDNSRR:153] opening control channel to
xxxxxxxxxxxx/xxxxxxxxxxx : 2811
(...)
2009-08-06 15:08:01,180 DEBUG vanilla.Reply [Thread-47,<init>:65] read
1st line
2009-08-06 15:08:01,807 DEBUG vanilla.Reply [Thread-47,<init>:68] 1st
line: null
2009-08-06 15:08:01,809 DEBUG vanilla.FTPControlChannel
[Thread-47,write:363] Control channel sending: QUIT
2009-08-06 15:08:01,810 DEBUG vanilla.FTPControlChannel
[Thread-47,close:260] ftp socket closed
2009-08-06 15:08:01,812 DEBUG vanilla.FTPServerFacade
[Thread-47,close:340] close data channels
2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade
[Thread-47,close:343] close server socket
2009-08-06 15:08:01,813 DEBUG vanilla.FTPServerFacade
[Thread-47,stopTaskThread:369] stop master thread
2009-08-06 15:08:01,814 ERROR cache.ConnectionManager
[Thread-47,createNewConnection:345] Can't create connection:
java.io.EOFException
2009-08-06 15:08:01,820 ERROR service.TransferWork [Thread-47,run:408]
Transient transfer error
Connection creation error [Caused by: java.io.EOFException]
Connection creation error. Caused by java.io.EOFException
I not 100% sure that these errors are related, but the "Connection
creation error. Caused by java.io.EOFException" error string makes me
think they are. From the gridftp log above, it looks like the control
channel connection (port 2811) back to the submit machine (probably for
stageout step) fails.
In order to debug this, we have tried making the gridftp connection
limit much higher in the /etc/inetd.d/gridftp script but that didn't
seem to help. We have a port range of 200, which I think should be
enough to handle 10 or so concurrent job with one stagein and 2 stageout
elements per job. We also experimented with that port range, but with
no success.
Is this something that anyone experienced before?
Maybe there some other configuration that I can change that might fix
this issue?
Any help or feedback about this is much appreciated.
Best regards,
Andre
--
Andre Charbonneau
Research Computing Support, IMSB
National Research Council Canada
100 Sussex Drive, Rm 2025
Ottawa, ON, Canada K1A 0R6
613 993-3129