Hi.
The value of the timeout that you sent me means that the QP will wait
for ~ 0.25 second before any retry,
so it might take some time for the QPs to start doing the APM (depend of
the retry_cnt value).
From your description i believe that not all of the QPs have started to
move to the alternate path.
did you make sure that the APM state machine (in every QP) is ARMED
before moving the port down ?
(data packets must we sent between the local/remote QPs in order to make
the QP in this state)
(If you can send me the code that handles this scenario i will try to
reproduce it here, in out lab)
thanks
Dotan
lbt wrote:
Thanks for your reply Dotan!
The timeout is set to 16.
Here is some more info. Please let me know if there is any other info
I can provide.
Setup:
- 2 Nodes, each has a dual-port HCA (board_id: MT_0150000001,
InfiniHost III firmware 25218, v. 5.2.0) - this is the latest Mellanox
firmware I believe
- port 1 of each node is connected to one IB switch, and likewise for
port 2 --> thus have 2 separate IB subnets, providing 2 possible paths
between the 2 nodes
- IB switch is InfiniScale MT43132 **
- Using OFED 1.2 driver stack
Our software creates RCQPs between 2 nodes, with primary and alternate
path specified.
Test does the following: Using 10 RCQPs
1. Hardware triggered migration by bringing down the port of the
primary path (haven't ever seen a problem with the hardware triggered
migrations)
2. Restore the port --> reloads alternate path
- Local QPs send LAP
- Remote QPs reply with APR
3. Redistributes RCQP's across both ports for load balancing using
software triggered migrations for the RCQPs selected for migration.
a. Local QPs: use ib_modify_qp to trigger migration --> get
IB_EVENT_PATH_MIG on local QPs
b. Remote QPs: IB_EVENT_PATH_MIG
c. Local QPs: after software-triggered migration completes, reloads
alternate path by sending LAP
d. Remote QPs: reply with APR
Keep doing this in a loop. The issue is that in 3b, not all the remote
QP's reporte an IB_EVENT for the path migration triggered in 3a. I
noticed that when this happens it's usually in the first and/or second
cycle (subsequent cycles don't manifest this issue), and it occurs on
the last RCQP's that were migrated in 3a.
BTW: Do you know if there there is a way I can determine/dump which
events are in the Event Queue?
Thanks again!
Lan
On 10/15/07, *Dotan Barak* <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Hi.
lbt wrote:
> Hi,
>
> I'm trying out APM with OFED 1.2 , using Mellanox dual-port HCA
> (ib_mthca driver). When I have several RCQP's that I am trying to
> migrate (software triggered migration using ib_modify_qp), I've
> noticed that sometimes 1 or 2 of the remote QP's never generate an
> IB_EVENT_PATH_MIG or even an IB_EVENT_PATH_MIG_ERR ... it seems that
> it just gets lost. I looked through some of the ib_mthca patches in
> git.kernel.org/?p=linux/kernel/git/roland/infiniband.git
<http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git>
> <
http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git>, and
> incorporated the mmiowb patch for ib_mthca commands
> (
http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd
> <
http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd>).
> But still seeing same issue. I have a test case that repeates
> software-triggered migrations + rearming in a loop, and this
problem
> usually occurs in the first few cycles, but is not too frequent. If
> anyone has any ideas on what might be wrong, or tips on where I can
> look/do to debug this, that would be very much appreciated!
>
> For example, this is the console output I will see (printed out
by our
> rcqp event handler):
> On the local end - initiates software triggered migration, using
> ib_modify_qp:
> Event IB_EVENT_PATH_MIG occurred on QP#1043
> Event IB_EVENT_PATH_MIG occurred on QP#1040
> Event IB_EVENT_PATH_MIG occurred on QP#1033
>
> On the remote end:
> Event IB_EVENT_PATH_MIG occurred on QP#1040
> Event IB_EVENT_PATH_MIG occurred on QP#1043
Is
the timeout value (in the QP attributes) is 0?
If the answer is no, can you please supply some more details on this?
thanks
Dotan
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general