Hi Pete,

I did the tracing that you are suggesting, this time with 1 client and
1 PVFS2 server. Apparently the queue has enough completion queue
entries. The memory registration seems to be the problem (however as I
said, on the front-end runs):

[D 10:04:01.500768] PVFS2 Server version 2.6.3 starting.
[D 10:04:01.778135] BMI_ib_initialize: init.
[D 10:04:01.778252] openib_ib_initialize: init.
[D 10:04:01.779038] openib_ib_initialize: max 65408 completion queue entries.
[D 10:04:01.779380] BMI_ib_initialize: done.
[E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr.
[E 10:04:01.781763]     [bt] ./bt.A.1.mpi_io_full(error+0xf4) [0x533738]
[E 10:04:01.781771]     [bt] ./bt.A.1.mpi_io_full [0x53614a]
[E 10:04:01.781776]     [bt] ./bt.A.1.mpi_io_full [0x534214]
[E 10:04:01.781780]     [bt] ./bt.A.1.mpi_io_full [0x533166]
[E 10:04:01.781784]     [bt] ./bt.A.1.mpi_io_full [0x50a644]
[E 10:04:01.781788]     [bt] ./bt.A.1.mpi_io_full [0x504ac1]
[E 10:04:01.781792]     [bt] ./bt.A.1.mpi_io_full [0x4ce576]
[E 10:04:01.781795]     [bt] ./bt.A.1.mpi_io_full [0x4ce277]
[E 10:04:01.781799]     [bt] ./bt.A.1.mpi_io_full [0x4ed598]
[E 10:04:01.781803]     [bt] ./bt.A.1.mpi_io_full [0x4ed5d1]
[E 10:04:01.781807]     [bt] ./bt.A.1.mpi_io_full [0x4ff1b5]
[D 10/19 10:04] PVFS2 Server: storage space created. Exiting.
[D 10:04:01.896168] PVFS2 Server version 2.6.3 starting.

Any suggestion?
Florin


On 10/16/07, Pete Wyckoff <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote on Mon, 15 Oct 2007 11:31 -0500:
> > I am trying to run PVFS over IB on the lonestar cluster at TACC with
> > BTIO: http://www.tacc.utexas.edu/services/userguides/lonestar/
> >
> > On the front end evth works perfect. However, when launching the PVFS2
> > and the applications on the cluster they fail.
> >
> > [D 10:35:59.457502] PVFS2 Server version 2.6.3 starting.
> > [E 10:35:59.476341] Error: openib_ib_initialize: ibv_create_cq failed.
> > ....
> >
> > [E 10:35:59.548287]     [bt] ./bt.B.16.mpi_io_full(error+0xf4) [0x53355c]
> > [E 10:35:59.548589]     [bt]
> > ./bt.B.16.mpi_io_full(openib_ib_initialize+0x4c3) [0x5365a0]
> >
> > Did anyone see this problem before?
>
> Haven't seen exactly this, but I'll guess that we're asking for
> too many CQE slots.  Try changing the value in this line
> (pvfs2/src/io/bmi/bmi_ib/openib.c:85):
>
>     static const unsigned int IBV_NUM_CQ_ENTRIES = 1024;
>
> to 100.  More is better.  You can fish around for something that
> works.  You can also debug the client to see how many it is
> asking for:
>
>     PVFS2_DEBUGMASK=network ./bt.B.16
>
> I'd like to see what these lines print out:
>
>     debug(1, "%s: max %d completion queue entries", __func__, hca_cap.max_cq);
>     cqe_num = IBV_NUM_CQ_ENTRIES;
>     od->nic_max_sge = hca_cap.max_sge;
>     od->nic_max_wr = hca_cap.max_qp_wr;
>
>     if (hca_cap.max_cq < cqe_num) {
>         cqe_num = hca_cap.max_cq;
>         warning("%s: hardly enough completion queue entries %d, hoping for 
> %d",
>                 __func__, hca_cap.max_cq, cqe_num);
>     }
>
> There is code there to ask the NIC how many CQEs it can support,
> then it is careful not to ask for too many, given the reported
> limit.  However the OpenFabrics API has this long-standing problem
> where the reported limits can not always be used as reported.
>
> Would be interesting to know the details of your NIC.  We might want
> to add some work-arounds for it.
>
>                 -- Pete
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to