Hi Jeff,
Why do we want to set this value so low ? Well, just to see if it crashes
:-)
More seriously, we're working on lowering the memory usage of the openib
BTL, which is achieved at most by having only 1 send queue element (at
very large scale, send queues prevail).
This "extreme" configuration used to work with the 1.3/1.4 branches but
failed on 1.5.
Note that recent IB cards having very high issue rates, I don't know if we
are often waiting for the send queue to be empty. More importantly, it
often prevents remote receive queue to be filled to quickly (which
prevents RNR nacks, threads refilling the SRQ, ...). We didn't notice
major performance drops with this configuration.
Sylvain
On Tue, 22 Jun 2010, Jeff Squyres wrote:
I think your fix looks right.
But I'm getting my head warped trying to understand why you'd want
numbers so low (4, 2, 1) and exactly what our algorithm will re-post for
numbers that low, etc. Why do you want them so low?
On Jun 18, 2010, at 11:10 AM, nadia.derbey wrote:
Hi,
Reference is the v1.5 branch
If an SRQ has the following settings: S,<size>,4,2,1
1) setup_qps() sets the following:
mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_num=4
mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_init=rd_num/4=1
2) create_srq() sets the following:
openib_btl->qps[qp].u.srq_qp.rd_curr_num = 1 (rd_init value)
openib_btl->qps[qp].u.srq_qp.rd_low_local = rd_curr_num - (rd_curr_num
2) = rd_curr_num = 1
3) if mca_btl_openib_post_srr() is called with rd_posted=1:
rd_posted > rd_low_local is false
num_post=rd_curr_num-rd_posted=0
the loop is not executed
wr is never initialized (remains NULL)
wr->next: address not mapped
==> SIGSEGV
The attached patch solves the problem by ensuring that we'll actually
enter the loop and leave otherwise.
Can someone have a look please: the patch solves the problem with my
reproducer, but I'm not sure the fix covers all the situations.
Regards,
Nadia
<001_openib_low_rd_num.patch>_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel