[EMAIL PROTECTED] wrote on Mon, 15 Oct 2007 11:31 -0500:
> I am trying to run PVFS over IB on the lonestar cluster at TACC with
> BTIO: http://www.tacc.utexas.edu/services/userguides/lonestar/
>
> On the front end evth works perfect. However, when launching the PVFS2
> and the applications on the cluster they fail.
>
> [D 10:35:59.457502] PVFS2 Server version 2.6.3 starting.
> [E 10:35:59.476341] Error: openib_ib_initialize: ibv_create_cq failed.
> ....
>
> [E 10:35:59.548287] [bt] ./bt.B.16.mpi_io_full(error+0xf4) [0x53355c]
> [E 10:35:59.548589] [bt]
> ./bt.B.16.mpi_io_full(openib_ib_initialize+0x4c3) [0x5365a0]
>
> Did anyone see this problem before?
Haven't seen exactly this, but I'll guess that we're asking for
too many CQE slots. Try changing the value in this line
(pvfs2/src/io/bmi/bmi_ib/openib.c:85):
static const unsigned int IBV_NUM_CQ_ENTRIES = 1024;
to 100. More is better. You can fish around for something that
works. You can also debug the client to see how many it is
asking for:
PVFS2_DEBUGMASK=network ./bt.B.16
I'd like to see what these lines print out:
debug(1, "%s: max %d completion queue entries", __func__, hca_cap.max_cq);
cqe_num = IBV_NUM_CQ_ENTRIES;
od->nic_max_sge = hca_cap.max_sge;
od->nic_max_wr = hca_cap.max_qp_wr;
if (hca_cap.max_cq < cqe_num) {
cqe_num = hca_cap.max_cq;
warning("%s: hardly enough completion queue entries %d, hoping for %d",
__func__, hca_cap.max_cq, cqe_num);
}
There is code there to ask the NIC how many CQEs it can support,
then it is careful not to ask for too many, given the reported
limit. However the OpenFabrics API has this long-standing problem
where the reported limits can not always be used as reported.
Would be interesting to know the details of your NIC. We might want
to add some work-arounds for it.
-- Pete
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users