Hi, I am coming back to a problem I still have with PVFS 2.6.3 over IB.
I run it on Lonestar - Xeon Intel Duo-Core 64bit cluster at TACC: http://www.tacc.utexas.edu/services/userguides/lonestar/ I remind you that PVFS-IB works on the front end, but fails when I try to start it on the compute nodes. As Pete suggested I had set the debug level to network. I found out that there for each run one of two types of errors show up: 1) this is from the previous message I sent to the list > > [E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr. 2) this I just got (the full messages are at the end of this mail): [E 12:05:07.676399] Error: openib_ib_initialize: ibv_create_cq failed. As Pete suggested I looked in /etc/security/limits.conf: soft and hard memlock are set to unlimited. In do not have control over the nodes, I can not install things, I am just a user :) Pete, how can I find out what type of Infiniband fabric is installed? The configuration file /etc/infiniband/openib.conf : # Start HCA driver upon boot ONBOOT=yes # Load UCM module UCM_LOAD=no # Load RDMA_CM module RDMA_CM_LOAD=yes # Load RDMA_UCM module RDMA_UCM_LOAD=yes # Increase ib_mad thread priority RENICE_IB_MAD=no # Load MTHCA MTHCA_LOAD=yes # Load IPATH IPATH_LOAD=yes # Load IPoIB IPOIB_LOAD=yes Here the full error message: [D 12:05:07.675267] BMI_ib_initialize: init. [D 12:05:07.675423] openib_ib_initialize: init. [D 12:05:07.676266] openib_ib_initialize: max 65408 completion queue entries. [E 12:05:07.676399] Error: openib_ib_initialize: ibv_create_cq failed. [E 12:05:07.712529] [bt] ./bt.S.1.mpi_io_full(error+0xf4) [0x598700] [E 12:05:07.712545] [bt] ./bt.S.1.mpi_io_full(openib_ib_initialize+0x4c3) [0x59b744] [E 12:05:07.712550] [bt] ./bt.S.1.mpi_io_full [0x5982eb] [E 12:05:07.712555] [bt] ./bt.S.1.mpi_io_full [0x570e86] [E 12:05:07.712558] [bt] ./bt.S.1.mpi_io_full [0x570122] [E 12:05:07.712562] [bt] ./bt.S.1.mpi_io_full [0x55233c] [E 12:05:07.712566] [bt] ./bt.S.1.mpi_io_full [0x552599] [E 12:05:07.712570] [bt] ./bt.S.1.mpi_io_full [0x56417d] [E 12:05:07.712574] [bt] ./bt.S.1.mpi_io_full [0x4fdef4] [E 12:05:07.712577] [bt] ./bt.S.1.mpi_io_full [0x4fdcd2] [E 12:05:07.712581] [bt] ./bt.S.1.mpi_io_full [0x4a5a73] Thanks Florin On Oct 20, 2007 8:17 AM, Pete Wyckoff <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote on Fri, 19 Oct 2007 10:11 -0500: > > I did the tracing that you are suggesting, this time with 1 client and > > 1 PVFS2 server. Apparently the queue has enough completion queue > > entries. The memory registration seems to be the problem (however as I > > said, on the front-end runs): > > > > [D 10:04:01.500768] PVFS2 Server version 2.6.3 starting. > > [D 10:04:01.778135] BMI_ib_initialize: init. > > [D 10:04:01.778252] openib_ib_initialize: init. > > [D 10:04:01.779038] openib_ib_initialize: max 65408 completion queue > > entries. > > [D 10:04:01.779380] BMI_ib_initialize: done. > > [E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr. > > [E 10:04:01.781763] [bt] ./bt.A.1.mpi_io_full(error+0xf4) [0x533738] > > [E 10:04:01.781771] [bt] ./bt.A.1.mpi_io_full [0x53614a] > > [E 10:04:01.781776] [bt] ./bt.A.1.mpi_io_full [0x534214] > > [E 10:04:01.781780] [bt] ./bt.A.1.mpi_io_full [0x533166] > > [E 10:04:01.781784] [bt] ./bt.A.1.mpi_io_full [0x50a644] > > [E 10:04:01.781788] [bt] ./bt.A.1.mpi_io_full [0x504ac1] > > [E 10:04:01.781792] [bt] ./bt.A.1.mpi_io_full [0x4ce576] > > [E 10:04:01.781795] [bt] ./bt.A.1.mpi_io_full [0x4ce277] > > [E 10:04:01.781799] [bt] ./bt.A.1.mpi_io_full [0x4ed598] > > [E 10:04:01.781803] [bt] ./bt.A.1.mpi_io_full [0x4ed5d1] > > [E 10:04:01.781807] [bt] ./bt.A.1.mpi_io_full [0x4ff1b5] > > [D 10/19 10:04] PVFS2 Server: storage space created. Exiting. > > [D 10:04:01.896168] PVFS2 Server version 2.6.3 starting. > > Then the CQ allocation fail did not happen this time around? How > did that get fixed? 65408 seems way too big. I still wonder what > type of silicon you have. > > This MR issue might be due to process locked memory limits. Look > around in the IB world for "ulimit -l" or /etc/security/limits.conf > and set it to lots, or unlimited. > > -- Pete > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
