Hi.

The value of the timeout that you sent me means that the QP will wait for ~ 0.25 second before any retry, so it might take some time for the QPs to start doing the APM (depend of the retry_cnt value).

From your description i believe that not all of the QPs have started to move to the alternate path. did you make sure that the APM state machine (in every QP) is ARMED before moving the port down ? (data packets must we sent between the local/remote QPs in order to make the QP in this state)


(If you can send me the code that handles this scenario i will try to reproduce it here, in out lab)


thanks
Dotan

lbt wrote:
Thanks for your reply Dotan!

The timeout is set to 16.

Here is some more info. Please let me know if there is any other info I can provide.
Setup:
- 2 Nodes, each has a dual-port HCA (board_id: MT_0150000001, InfiniHost III firmware 25218, v. 5.2.0) - this is the latest Mellanox firmware I believe - port 1 of each node is connected to one IB switch, and likewise for port 2 --> thus have 2 separate IB subnets, providing 2 possible paths between the 2 nodes
- IB switch is InfiniScale MT43132 **
- Using OFED 1.2 driver stack

Our software creates RCQPs between 2 nodes, with primary and alternate path specified.
Test does the following: Using 10 RCQPs
1. Hardware triggered migration by bringing down the port of the primary path (haven't ever seen a problem with the hardware triggered migrations)
2. Restore the port --> reloads alternate path
    - Local QPs send LAP
    - Remote QPs reply with APR
3. Redistributes RCQP's across both ports for load balancing using software triggered migrations for the RCQPs selected for migration. a. Local QPs: use ib_modify_qp to trigger migration --> get IB_EVENT_PATH_MIG on local QPs
b. Remote QPs: IB_EVENT_PATH_MIG
c. Local QPs: after software-triggered migration completes, reloads alternate path by sending LAP
d. Remote QPs: reply with APR

Keep doing this in a loop. The issue is that in 3b, not all the remote QP's reporte an IB_EVENT for the path migration triggered in 3a. I noticed that when this happens it's usually in the first and/or second cycle (subsequent cycles don't manifest this issue), and it occurs on the last RCQP's that were migrated in 3a.

BTW: Do you know if there there is a way I can determine/dump which events are in the Event Queue?

Thanks again!
Lan

On 10/15/07, *Dotan Barak* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    Hi.

    lbt wrote:
    > Hi,
    >
    > I'm trying out APM with OFED 1.2 , using Mellanox dual-port HCA
    > (ib_mthca driver).  When I have several RCQP's that I am trying to
    > migrate (software triggered migration using ib_modify_qp), I've
    > noticed that sometimes 1 or 2 of the remote QP's never generate an
    > IB_EVENT_PATH_MIG or even an IB_EVENT_PATH_MIG_ERR ... it seems that
    > it just gets lost. I looked through some of the ib_mthca patches in
    > git.kernel.org/?p=linux/kernel/git/roland/infiniband.git
    <http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git>
    > <
    http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git>, and
    > incorporated the mmiowb patch for ib_mthca commands
    > (
    
http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd
    > <
    
http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd>).
    > But still seeing same issue. I have a test case that repeates
    > software-triggered migrations + rearming in a loop, and this
    problem
    > usually occurs in the first few cycles, but is not too frequent. If
    > anyone has any ideas on what might be wrong, or tips on  where I can
    > look/do to debug this, that would be very much appreciated!
    >
    > For example, this is the console output I will see (printed out
    by our
    > rcqp event handler):
    > On the local end - initiates software triggered migration, using
    > ib_modify_qp:
    > Event IB_EVENT_PATH_MIG occurred on QP#1043
    > Event IB_EVENT_PATH_MIG occurred on QP#1040
    > Event IB_EVENT_PATH_MIG occurred on QP#1033
    >
    > On the remote end:
    > Event IB_EVENT_PATH_MIG occurred on QP#1040
    > Event IB_EVENT_PATH_MIG occurred on QP#1043
    Is
    the timeout value (in the QP attributes) is 0?
    If the answer is no, can you please supply some more details on this?


    thanks
    Dotan



_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to