Hi,

I am new to this list, and if my question is misplaced
please suggest a better forum on or off-list.

We are using InfiniBand (core & mlx4 of OFED 1.4.1 + OFED kernel patches)
in a light weight kernel named kitten, partially derived from linux.
http://code.google.com/p/kitten/

We see problems of one or two unhandled interrupts when doing RDMA_READ
data transfers with mlx4 cards.  (SEND and RDMA_WRITE works well)
It appears only with larger messages 1-4 Mb.
write-combining is turned off.

Below a pingpong test - 1000 iterations per messages size:
ex.
<8>(init_task)       Size     Average      Stddev         Min      Median       
  Max
...
<8>(init_task)     524288      271.79        7.09      138.96      271.51      
429.24
<4>irq_dispatch: Unhandled interrupt 74 (4a) [Owner]
<8>(init_task)    1048576      569.99      981.73      272.01      537.56    
31581.67
<8>(init_task)    2097152     1070.57       28.95      537.88     1069.66     
1779.97
<8>(init_task)    4194304     2135.99       52.86     1070.10     2134.70     
3124.28

This error is random and appears in about one of three runs. Note the high max
value for one 1Mb message, as I guess the connection recovers.

When investigating the error it seems to stem from next_eqe_sw in 
drivers/net/mlx4/eq.c
called by the interrupt handler.
What happens is that (eqe->owner & 0x80) is true causing the routine to return
NULL resulting in an unhandled interrupt (eg the interrupt routine returns 0)

My understanding is that when the interrupt gets flagged the card would
have given the eqe (event queue entry?) to the software, but it could very well 
be more complex.

The same message can be seen when starting the driver, but it does not cause 
any problems :
<6>mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
<4>irq_dispatch: Unhandled interrupt 74 (4a) [Owner]
 .... x 16

This problem could not be reproduced under linux so far.
The kitten interrupt handler is simple and just forwards the interrupt to the 
driver.

What does owner in the eqe struct mean ? Hardware or Software owns the entry ?
Has this bug been seen in Linux, even if we were not able to reproduce it ?
Can I get more debug information from the card ?
Any tips to what could go wrong in this context ? Are we missing some setup ?


Sincerely,

Fredrik Unger
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to