Hi,

I am coming back to a problem I still have with PVFS 2.6.3 over IB.

I run it  on Lonestar - Xeon Intel Duo-Core 64bit cluster at TACC:
http://www.tacc.utexas.edu/services/userguides/lonestar/

I remind you that PVFS-IB works on the front end, but fails when I try
to start it on the compute nodes.

As Pete suggested I had set the debug level to network.

I found out that there for each run one of  two types of errors show up:

1) this is from the previous message I sent to the list
> > [E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr.

2) this I just got (the full messages are at the end of this mail):
[E 12:05:07.676399] Error: openib_ib_initialize: ibv_create_cq failed.

As Pete suggested I looked in /etc/security/limits.conf: soft and hard
memlock are set to unlimited.

In do not have control over the nodes, I can not install things, I am
just a user :)

Pete, how can I find out what type of Infiniband fabric is installed?

The configuration file /etc/infiniband/openib.conf  :

# Start HCA driver upon boot
ONBOOT=yes
# Load UCM module
UCM_LOAD=no
# Load RDMA_CM module
RDMA_CM_LOAD=yes
# Load RDMA_UCM module
RDMA_UCM_LOAD=yes
# Increase ib_mad thread priority
RENICE_IB_MAD=no
# Load MTHCA
MTHCA_LOAD=yes
# Load IPATH
IPATH_LOAD=yes
# Load IPoIB
IPOIB_LOAD=yes


Here the full error message:

[D 12:05:07.675267] BMI_ib_initialize: init.
[D 12:05:07.675423] openib_ib_initialize: init.
[D 12:05:07.676266] openib_ib_initialize: max 65408 completion queue entries.
[E 12:05:07.676399] Error: openib_ib_initialize: ibv_create_cq failed.
[E 12:05:07.712529]     [bt] ./bt.S.1.mpi_io_full(error+0xf4) [0x598700]
[E 12:05:07.712545]     [bt]
./bt.S.1.mpi_io_full(openib_ib_initialize+0x4c3) [0x59b744]
[E 12:05:07.712550]     [bt] ./bt.S.1.mpi_io_full [0x5982eb]
[E 12:05:07.712555]     [bt] ./bt.S.1.mpi_io_full [0x570e86]
[E 12:05:07.712558]     [bt] ./bt.S.1.mpi_io_full [0x570122]
[E 12:05:07.712562]     [bt] ./bt.S.1.mpi_io_full [0x55233c]
[E 12:05:07.712566]     [bt] ./bt.S.1.mpi_io_full [0x552599]
[E 12:05:07.712570]     [bt] ./bt.S.1.mpi_io_full [0x56417d]
[E 12:05:07.712574]     [bt] ./bt.S.1.mpi_io_full [0x4fdef4]
[E 12:05:07.712577]     [bt] ./bt.S.1.mpi_io_full [0x4fdcd2]
[E 12:05:07.712581]     [bt] ./bt.S.1.mpi_io_full [0x4a5a73]

Thanks
Florin

On Oct 20, 2007 8:17 AM, Pete Wyckoff <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote on Fri, 19 Oct 2007 10:11 -0500:
> > I did the tracing that you are suggesting, this time with 1 client and
> > 1 PVFS2 server. Apparently the queue has enough completion queue
> > entries. The memory registration seems to be the problem (however as I
> > said, on the front-end runs):
> >
> > [D 10:04:01.500768] PVFS2 Server version 2.6.3 starting.
> > [D 10:04:01.778135] BMI_ib_initialize: init.
> > [D 10:04:01.778252] openib_ib_initialize: init.
> > [D 10:04:01.779038] openib_ib_initialize: max 65408 completion queue 
> > entries.
> > [D 10:04:01.779380] BMI_ib_initialize: done.
> > [E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr.
> > [E 10:04:01.781763]     [bt] ./bt.A.1.mpi_io_full(error+0xf4) [0x533738]
> > [E 10:04:01.781771]     [bt] ./bt.A.1.mpi_io_full [0x53614a]
> > [E 10:04:01.781776]     [bt] ./bt.A.1.mpi_io_full [0x534214]
> > [E 10:04:01.781780]     [bt] ./bt.A.1.mpi_io_full [0x533166]
> > [E 10:04:01.781784]     [bt] ./bt.A.1.mpi_io_full [0x50a644]
> > [E 10:04:01.781788]     [bt] ./bt.A.1.mpi_io_full [0x504ac1]
> > [E 10:04:01.781792]     [bt] ./bt.A.1.mpi_io_full [0x4ce576]
> > [E 10:04:01.781795]     [bt] ./bt.A.1.mpi_io_full [0x4ce277]
> > [E 10:04:01.781799]     [bt] ./bt.A.1.mpi_io_full [0x4ed598]
> > [E 10:04:01.781803]     [bt] ./bt.A.1.mpi_io_full [0x4ed5d1]
> > [E 10:04:01.781807]     [bt] ./bt.A.1.mpi_io_full [0x4ff1b5]
> > [D 10/19 10:04] PVFS2 Server: storage space created. Exiting.
> > [D 10:04:01.896168] PVFS2 Server version 2.6.3 starting.
>
> Then the CQ allocation fail did not happen this time around?  How
> did that get fixed?  65408 seems way too big.  I still wonder what
> type of silicon you have.
>
> This MR issue might be due to process locked memory limits.  Look
> around in the IB world for "ulimit -l" or /etc/security/limits.conf
> and set it to lots, or unlimited.
>
>                 -- Pete
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to