Hi.
how much times does it take to reproduce this failure?
thanks
Dotan
Bharath Ramesh wrote:
* Dotan Barak ([EMAIL PROTECTED]) wrote:
Hi.
i need some more info.
Which IB HW do you use?
(you can get this info from ibv_devinfo)
The IB HW used are the Mellanox Cougar Cards.
output of ibv_devinfo:
hca_id: mthca0
fw_ver: 3.5.0
node_guid: 0002:c901:08fe:76a0
sys_image_guid: 0002:c901:08fe:76a3
vendor_id: 0x02c9
vendor_part_id: 23108
hw_ver: 0xA1
board_id: MT_0000000001
phys_port_cnt: 2
Which IB SW do you use?
(you can get this info from ofed_info)
The IB SW I am using is OFED 1.2. The linux kernel used are
2.6.21.1-xserve
I am not sure if this might help. Basically every time I send a message
I wait for an ack to be received. I wait on a pthread_cond_wait. Since
the message gets dropped my thread is blocked on pthread_cond_wait
forever. The other thread which occasionally sends messages is still
able to send/receive messages over the QP. Block for the ack and receive
the ack while this thread never receives the ack because of the dropped
message. To verify if the messages were being dropped I printed every
single message being sent and received on either ends. The dropped
message is sent but the receiver never receives it.
Thanks,
Bharath
Dotan
Bharath Ramesh wrote:
* Dotan Barak ([EMAIL PROTECTED]) wrote:
Hi.
Bharath Ramesh wrote:
I have a multi-threaded application. My application has its own message
exchange protocol, it uses IB as the communication layer. I send a lot
of messages which are normally of the order of few ten thousands. After
sometime it seems like one message from one of the node is lost. I am
using RC QP type. This causes the thread to deadlock. The other threads
are still able to communicate exchanging messages without any problem
over the same QP. Both ends are using SRQs and there is sufficient
buffers posted so that I dont run out of buffers. I even tried doubling
the buffers posted I see the same problem again. One message being lost.
The ibv_post_send doesnt report any error. I am trying to get this done
for a conference deadline early next week. I would really appreciate any
help in suggesting any possibilities which might cause the message to be
dropped without any error being returned.
If you don't have any bugs in your code, the described scenario should
work.
I need some more info in order to try to help you:
Do you use the same QP from several threads (and post send from all of
them)?
Yes, I use the same the QP from three threads. The application has close
to 5 threads. The receives are handled by a single thread. Most of the
sends are posted by a single thread. Occasionally a third thread posts a
few sends to the QP. The same QP is also used for RDMA Writes. Majority
of the RDMA Writes are also performed by the same thread that posts
majority of the send messages.
How do you poll the CQ (several threads/one)?
I have two CQs, one for receive and the other for send. The receive CQ
is polled only by the receive thread. The send CQ is polled by the three
threads. Occasionally by the receiver thread to clear out an send CQEs
because I use IBV_SEND_SIGNALED for every 16 IBV_SEND_INLINEs. Otherwise
the send CQ is polled by the single thread that does majority of the
sends. Occasionally the third thread when doing a send might poll the
send CQ as well for completion CQE in case of a RDMA Write.
which HW/SW do you use?
I am using Yellow Dog Linux 5.0 on Apple Xserves.
Thanks,
Bharath
---
Bharath Ramesh <[EMAIL PROTECTED]>
http://people.cs.vt.edu/~bramesh
---
Bharath Ramesh <[EMAIL PROTECTED]> http://people.cs.vt.edu/~bramesh
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general