Hi Adrian,

After this change, I am getting a lot of errors of the form:
[sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by
peer (104)

See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615

I have found this especially easy to reproduce if I run 16 processes all with just the tcp and self btls on the same machine, running the 'hello_c' program in the examples directory.

Tim


Adrian Knoth wrote:
Hi!

As of r18169, I've changed the acceptance rules for incoming BTL-TCP
connections.

The old code would have denied a connection in case of non-matching
addresses (comparison between source address and expected source
address).

Unfortunately, you cannot always say which source address an incoming
packet will have (it's the sender's kernel who decides), so rejecting a
connection due to "wrong" source address caused a complete hang.

I had several cases, mostly multi-cluster setups, where this has happend
all the time. (typical scenario: you're expecting the headnode's
internal address, but since you're talking to another cluster,
the kernel uses the headnode's external address)

Though I've tested it as much as possible, I don't know if it breaks
your setup, especially the multi-rail stuff. George?


Cheerio


Reply via email to