Hi Pete, I did the tracing that you are suggesting, this time with 1 client and 1 PVFS2 server. Apparently the queue has enough completion queue entries. The memory registration seems to be the problem (however as I said, on the front-end runs):
[D 10:04:01.500768] PVFS2 Server version 2.6.3 starting. [D 10:04:01.778135] BMI_ib_initialize: init. [D 10:04:01.778252] openib_ib_initialize: init. [D 10:04:01.779038] openib_ib_initialize: max 65408 completion queue entries. [D 10:04:01.779380] BMI_ib_initialize: done. [E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr. [E 10:04:01.781763] [bt] ./bt.A.1.mpi_io_full(error+0xf4) [0x533738] [E 10:04:01.781771] [bt] ./bt.A.1.mpi_io_full [0x53614a] [E 10:04:01.781776] [bt] ./bt.A.1.mpi_io_full [0x534214] [E 10:04:01.781780] [bt] ./bt.A.1.mpi_io_full [0x533166] [E 10:04:01.781784] [bt] ./bt.A.1.mpi_io_full [0x50a644] [E 10:04:01.781788] [bt] ./bt.A.1.mpi_io_full [0x504ac1] [E 10:04:01.781792] [bt] ./bt.A.1.mpi_io_full [0x4ce576] [E 10:04:01.781795] [bt] ./bt.A.1.mpi_io_full [0x4ce277] [E 10:04:01.781799] [bt] ./bt.A.1.mpi_io_full [0x4ed598] [E 10:04:01.781803] [bt] ./bt.A.1.mpi_io_full [0x4ed5d1] [E 10:04:01.781807] [bt] ./bt.A.1.mpi_io_full [0x4ff1b5] [D 10/19 10:04] PVFS2 Server: storage space created. Exiting. [D 10:04:01.896168] PVFS2 Server version 2.6.3 starting. Any suggestion? Florin On 10/16/07, Pete Wyckoff <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote on Mon, 15 Oct 2007 11:31 -0500: > > I am trying to run PVFS over IB on the lonestar cluster at TACC with > > BTIO: http://www.tacc.utexas.edu/services/userguides/lonestar/ > > > > On the front end evth works perfect. However, when launching the PVFS2 > > and the applications on the cluster they fail. > > > > [D 10:35:59.457502] PVFS2 Server version 2.6.3 starting. > > [E 10:35:59.476341] Error: openib_ib_initialize: ibv_create_cq failed. > > .... > > > > [E 10:35:59.548287] [bt] ./bt.B.16.mpi_io_full(error+0xf4) [0x53355c] > > [E 10:35:59.548589] [bt] > > ./bt.B.16.mpi_io_full(openib_ib_initialize+0x4c3) [0x5365a0] > > > > Did anyone see this problem before? > > Haven't seen exactly this, but I'll guess that we're asking for > too many CQE slots. Try changing the value in this line > (pvfs2/src/io/bmi/bmi_ib/openib.c:85): > > static const unsigned int IBV_NUM_CQ_ENTRIES = 1024; > > to 100. More is better. You can fish around for something that > works. You can also debug the client to see how many it is > asking for: > > PVFS2_DEBUGMASK=network ./bt.B.16 > > I'd like to see what these lines print out: > > debug(1, "%s: max %d completion queue entries", __func__, hca_cap.max_cq); > cqe_num = IBV_NUM_CQ_ENTRIES; > od->nic_max_sge = hca_cap.max_sge; > od->nic_max_wr = hca_cap.max_qp_wr; > > if (hca_cap.max_cq < cqe_num) { > cqe_num = hca_cap.max_cq; > warning("%s: hardly enough completion queue entries %d, hoping for > %d", > __func__, hca_cap.max_cq, cqe_num); > } > > There is code there to ask the NIC how many CQEs it can support, > then it is careful not to ask for too many, given the reported > limit. However the OpenFabrics API has this long-standing problem > where the reported limits can not always be used as reported. > > Would be interesting to know the details of your NIC. We might want > to add some work-arounds for it. > > -- Pete > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
