Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Mon, 10 Mar 2008 13:51 -0500:
I am trying to hack together a test case to implement what we had
talked about in the previous emails with a wr_credit...
I'm trying to keep track of it in the openib_device (od) structure
inside openib.c and would like to keep the necessary changes inside
openib.c if at all possible. The problem I'm running into, is that
I'm going to need to call check_cq() from inside the send_rdma writes
function, which lies in openib.c, not ib.c. openib.c has a
function for this but its really intended to work *with* ib.c's
check_cq() fucntionality...
In order to get around this I needed to make ib_check_cq() visible to
openib.c (got rid of the static and added a declaration to ib.h)..
but I'm getting weird things when I'm linking..
Any ideas how to get around this?
lib/libpvfs2-server.a(bmi-server.o):(.rodata+0x780): undefined
reference to `bmi_ib_ops'
collect2: ld returned 1 exit status
make: *** [src/server/pvfs2-server] Error 1
(I've attached a very rudimentary patch that sort of gets at what I'm
trying to do, not sure if its correct yet, still trying to compile)
Just hack up anything you like to get it to work. If it fixes the
situation, we'll go back and clean up the code later.
It is optimistic, what you're trying to do, but I'm not sure if it
will be sufficient. If there are no credits to get back from
checking the CQ, you'll just deadlock. I'm also nervous about
locking implications, as you're checking the CQ in the thread that
is trying to do the send. Not sure if we have done this before.
A simpler way would be just to just fail whatever operation got us
into this RDMA, by abandoning it, with another state that says we're
waiting on credits. An easier first step is just to add lots of
printfs to track the credits and see if you can correlate a credit
overflow with the rdma failures. If that works, a check at the top
of "post rdma" can say whether we should even bother and we won't
need your fixup step of looking at the CQ from the send.
-- Pete
I added debug code that incremented od->nic_wr_credit for every
ibv_post_send, and decremented it for every RDMA completion in
openib_poll_cq.. this is what I got:
We ended up posting about 70 RDMA's with ibv_post_send, and 3 with
signals, and then run out of resources. I suspect if I just had a loop
that called 'ib_block_for_activity', and then 'ib_poll_cq', and then
retried the post_send that it would work fine. I'll probably try that next.
[D 12:32:35.013945] openib_post_sr: 10.1.4.240:46814 bh 17 len 32 wr 6/70.
[D 12:32:35.014139] BMI_post_send_list: addr: 6663808, count: 1,
total_size: 24, tag: 4294
[D 12:32:35.014203] element 0: offset: 0x6155b0, size: 24
[D 12:32:35.014225] openib_post_sr: 10.1.4.240:46814 bh 18 len 40 wr 7/70.
[D 12:32:35.014293] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f43dc000 rkey 3a0004b nic_wr_credit 19011.
[D 12:32:35.014321] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f43fc000 rkey 3a0004b nic_wr_credit 19012.
[D 12:32:35.014340] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f441c000 rkey 3a0004b nic_wr_credit 19013.
[D 12:32:35.014362] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f443c000 rkey 3a0004b nic_wr_credit 19014.
[D 12:32:35.014381] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f445c000 rkey 3a0004b nic_wr_credit 19015.
[D 12:32:35.014399] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f447c000 rkey 3a0004b nic_wr_credit 19016.
[D 12:32:35.014418] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f449c000 rkey 3a0004b nic_wr_credit 19017.
[D 12:32:35.014436] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f44bc000 rkey 3a0004b nic_wr_credit 19018.
[D 12:32:35.014455] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f44dc000 rkey 3a0004b nic_wr_credit 19019.
[D 12:32:35.014474] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f44fc000 rkey 3a0004b nic_wr_credit 19020.
[D 12:32:35.014493] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f451c000 rkey 3a0004b nic_wr_credit 19021.
[D 12:32:35.014512] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f453c000 rkey 3a0004b nic_wr_credit 19022.
[D 12:32:35.014540] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f455c000 rkey 3a0004b nic_wr_credit 19023.
[D 12:32:35.014558] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f457c000 rkey 3a0004b nic_wr_credit 19024.
[D 12:32:35.014577] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f459c000 rkey 3a0004b nic_wr_credit 19025.
[D 12:32:35.014596] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f45bc000 rkey 3a0004b nic_wr_credit 19026.
[D 12:32:35.014615] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f45dc000 rkey 3a0004b nic_wr_credit 19027.
[D 12:32:35.014631] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f45fc000 rkey 3a0004b nic_wr_credit 19028.
[D 12:32:35.014649] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f461c000 rkey 3a0004b nic_wr_credit 19029.
[D 12:32:35.014668] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f463c000 rkey 3a0004b nic_wr_credit 19030.
[D 12:32:35.014688] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f465c000 rkey 3a0004b nic_wr_credit 19031.
[D 12:32:35.014707] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f467c000 rkey 3a0004b nic_wr_credit 19032.
[D 12:32:35.014726] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f469c000 rkey 3a0004b nic_wr_credit 19033.
[D 12:32:35.014745] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f46bc000 rkey 3a0004b nic_wr_credit 19034.
[D 12:32:35.014764] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f46dc000 rkey 3a0004b nic_wr_credit 19035.
[D 12:32:35.014783] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f46fc000 rkey 3a0004b nic_wr_credit 19036.
[D 12:32:35.014802] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f471c000 rkey 3a0004b nic_wr_credit 19037.
[D 12:32:35.014821] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f473c000 rkey 3a0004b nic_wr_credit 19038.
[D 12:32:35.014841] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f475c000 rkey 3a0004b nic_wr_credit 19039.
[D 12:32:35.014856] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f477c000 rkey 3a0004b nic_wr_credit 19040.
[D 12:32:35.014870] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f479c000 rkey 3a0004b nic_wr_credit 19041.
[D 12:32:35.014883] openib_post_sr_rdmaw: ibv_post_send wr_id 693390 to
10.1.4.240:46814 remote addr f47bc000 rkey 3a0004b nic_wr_credit 19042.
[D 12:32:35.014899] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f47ec000 rkey 3a0004c nic_wr_credit 19043.
[D 12:32:35.014913] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f480c000 rkey 3a0004c nic_wr_credit 19044.
[D 12:32:35.014927] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f482c000 rkey 3a0004c nic_wr_credit 19045.
[D 12:32:35.014941] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f484c000 rkey 3a0004c nic_wr_credit 19046.
[D 12:32:35.014955] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f486c000 rkey 3a0004c nic_wr_credit 19047.
[D 12:32:35.014969] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f488c000 rkey 3a0004c nic_wr_credit 19048.
[D 12:32:35.014983] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f48ac000 rkey 3a0004c nic_wr_credit 19049.
[D 12:32:35.014996] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f48cc000 rkey 3a0004c nic_wr_credit 19050.
[D 12:32:35.015010] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f48ec000 rkey 3a0004c nic_wr_credit 19051.
[D 12:32:35.015024] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f490c000 rkey 3a0004c nic_wr_credit 19052.
[D 12:32:35.015038] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f492c000 rkey 3a0004c nic_wr_credit 19053.
[D 12:32:35.015052] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f494c000 rkey 3a0004c nic_wr_credit 19054.
[D 12:32:35.015066] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f496c000 rkey 3a0004c nic_wr_credit 19055.
[D 12:32:35.015079] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f498c000 rkey 3a0004c nic_wr_credit 19056.
[D 12:32:35.015093] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f49ac000 rkey 3a0004c nic_wr_credit 19057.
[D 12:32:35.015107] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f49cc000 rkey 3a0004c nic_wr_credit 19058.
[D 12:32:35.015121] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f49ec000 rkey 3a0004c nic_wr_credit 19059.
[D 12:32:35.015135] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4a0c000 rkey 3a0004c nic_wr_credit 19060.
[D 12:32:35.015148] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4a2c000 rkey 3a0004c nic_wr_credit 19061.
[D 12:32:35.015162] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4a4c000 rkey 3a0004c nic_wr_credit 19062.
[D 12:32:35.015176] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4a6c000 rkey 3a0004c nic_wr_credit 19063.
[D 12:32:35.015190] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4a8c000 rkey 3a0004c nic_wr_credit 19064.
[D 12:32:35.015204] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4aac000 rkey 3a0004c nic_wr_credit 19065.
[D 12:32:35.015218] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4acc000 rkey 3a0004c nic_wr_credit 19066.
[D 12:32:35.015232] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4aec000 rkey 3a0004c nic_wr_credit 19067.
[D 12:32:35.015245] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4b0c000 rkey 3a0004c nic_wr_credit 19068.
[D 12:32:35.015261] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4b2c000 rkey 3a0004c nic_wr_credit 19069.
[D 12:32:35.015274] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4b4c000 rkey 3a0004c nic_wr_credit 19070.
[D 12:32:35.015288] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4b6c000 rkey 3a0004c nic_wr_credit 19071.
[D 12:32:35.015302] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4b8c000 rkey 3a0004c nic_wr_credit 19072.
[D 12:32:35.015316] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4bac000 rkey 3a0004c nic_wr_credit 19073.
[D 12:32:35.015329] openib_post_sr_rdmaw: ibv_post_send wr_id 64eab0 to
10.1.4.240:46814 remote addr f4bcc000 rkey 3a0004c nic_wr_credit 19074.
[D 12:32:35.015345] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4bec000 rkey 3a0004f nic_wr_credit 19075.
[D 12:32:35.015359] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4c0c000 rkey 3a0004f nic_wr_credit 19076.
[D 12:32:35.015373] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4c2c000 rkey 3a0004f nic_wr_credit 19077.
[D 12:32:35.015387] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4c4c000 rkey 3a0004f nic_wr_credit 19078.
[D 12:32:35.015401] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4c6c000 rkey 3a0004f nic_wr_credit 19079.
[D 12:32:35.015414] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to
10.1.4.240:46814 remote addr f4c8c000 rkey 3a0004f nic_wr_credit 19080.
[E 12:32:35.015429] openib_post_sr_rdmaw: ibv_post_send failed ret:
-1001 errno: 0
[E 12:32:35.015442] wr_id: 0x0 next: (nil) sg_list 0x65a970 num_sge 1
[E 12:32:35.015456] opcode: 0x0 send_flags: 0x0 imm_data: 0x0
[E 12:32:35.015468] sr.wr.rdma.remote_addr: 0xf4c8c000 rkey 0x3a0004f
[E 12:32:35.015480] od->nic_wr_credit 19081 od->nic_max_wr 65535
[E 12:32:35.015913] openib_post_sr_rdmaw: QP_request sge: 1
[E 12:32:35.015961] Error: openib_post_sr_rdmaw: QP_sge: 28
: Unknown error 18446744073709550615.
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers