Hi Alex,

Thanks for the reply. To clarify,

1) The driver we have is not out-of-tree driver, it is the variant of 3.7.21.
2) We are not running with 82599.

If we are able to reproduce this problem, we will do what you have suggested to 
dump the ring, descriptors, and buffer info.
Will calling ixgbe_do_reset() be sufficient when this odd state occurs?

Do you have another suggestion for working around this issue? For example, 
would defining IXGBE_NO_LRO be another way of working around this problem?

Thanks,
--Steven



-----Original Message-----
From: Alexander Duyck [mailto:[email protected]] 
Sent: Friday, May 18, 2012 1:22 PM
To: Steven La
Cc: [email protected]; Arthur Kepner
Subject: Re: [E1000-devel] Bug found in ixgbe_clean_rx_irq_ps()

On 05/18/2012 11:25 AM, Steven La wrote:
> Hi,
>
> This is to report a bug we found in the Intel 10 Gigabit PCI Express Linux 
> driver (ixgbe).
> Our driver has patched up to version 3.7.21.
When you say "patched up to" do you mean that the driver is our 3.7.21 
out-of-tree driver or that you have applied patches that you believe will make 
it equivalent to 3.7.21?
>
> Recently, we hit a crash with the following dump.
>
> BUG: unable to handle kernel NULL pointer dereference at 
> 000000000000006c
> IP: [<ffffffffa005afef>] ixgbe_poll+0x9df/0x1720 [ixgbe] PGD 3fc6cc067 
> PUD 3fc223067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/virtual/bypass/10-11/ping_watchdog
> CPU 9
> Pid: 3268, comm: sport Tainted: P           ----------------   2.6.32-perf #1
> To Be Filled By O.E.M.
> RIP: 0010:[<ffffffffa005afef>]  [<ffffffffa005afef>] 
> ixgbe_poll+0x9df/0x1720 [ixgbe]
> RSP: 0018:ffff8804075dd8b0  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff880410e90680 RCX: 0000000000000000
> RDX: 0000000000000020 RSI: ffffc90004ca6000 RDI: ffff8804121ca580
> RBP: ffff8804075dd970 R08: 0000000000000100 R09: 0000000000000000
> R10: 0000000000000100 R11: 0000000000000000 R12: 0000000000000000
> R13: ffffc9000479b190 R14: ffff8804059a40a0 R15: 000000000000000a
> FS:  00007fe8a63a6700(0000) GS:ffff8804364c0000(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000000000006c CR3: 0000000414564000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
> Process sport (pid: 3268, threadinfo ffff8804075dc000, task 
> ffff880414bbe1c0)
> Stack:
> 0000000000000000 ffff8804075c3d40 000000000000000e ffff880414d36780 
> <0> ffff8804075c3d98 ffff8804075dd918 ffffffff815671ac 
> 000000010003503a <0> ffff880410538700 0000000100000040 
> ffff8804121ca580 0000004012f9d89c
>
> After comparing the disassembled code, we found that location 
> ixgbe_poll+0x9df points to this line marked in red below as showed in the 
> following subroutine.
>
> static bool ixgbe_clean_rx_irq_ps(struct ixgbe_q_vector *q_vector,
>                                   struct ixgbe_ring *rx_ring,
>                                   int budget) {
>         union ixgbe_adv_rx_desc *rx_desc;
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>         const int current_node = numa_node_id(); #ifdef IXGBE_FCOE
>         struct ixgbe_adapter *adapter = q_vector->adapter;
>        int ddp_bytes = 0;
> #endif /* IXGBE_FCOE */
>         u16 i = rx_ring->next_to_clean;
>         u16 cleaned_count = ixgbe_desc_unused(rx_ring);
>
>         rx_desc = IXGBE_RX_DESC(rx_ring, i);
>
>         while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) {
>                 union ixgbe_adv_rx_desc *next_rxd;
>                 struct ixgbe_rx_buffer *rx_buffer_info;
>                 struct sk_buff *skb;
>                 bool pkt_is_rsc, force_alloc = false;
>
>                 rx_buffer_info = &rx_ring->rx_buffer_info[i];
>                 skb = rx_buffer_info->skb;
>                 rx_buffer_info->skb = NULL;
>
>                 /*
>                  * This memory barrier is needed to keep us from reading
>                  * any other fields out of the rx_desc until we know the
>                  * RXD_STAT_DD bit is set
>                  */
>                 rmb();
>
>                 pkt_is_rsc = ixgbe_get_rsc_state(rx_ring, rx_desc);
>
>                 prefetch(skb->data);
>
>                 /* pull the header of the skb in if no data is already 
> present */
>                 if (!skb_is_nonlinear(skb)) {                                 
>                       <-- This where the crash occurred.
>                         __skb_put(skb, ixgbe_get_hlen(rx_ring, 
> rx_desc));
>
>
> The C source code shows that skb was de-referenced before the
>
> skb_is_nonlinear() is call and question was raised to why it did not crash on 
> the prefetch line below.
>
>
>
>                 prefetch(skb->data);         <--- line 2025
>
>
>
>                 /* pull the header of the skb in if no data is already 
> present
>
> \
>
> */
>
>                 if (!skb_is_nonlinear(skb)) {         <--- line 2028
>
>                         __skb_put(skb, ixgbe_get_hlen(rx_ring, 
> rx_desc));
>
>
>
> Looking at the assembly code, the compiler seems to generate code that 
> executes
>
> line 2028 before line 2025.
>
>
>
> ixgbe_clean_rx_irq_ps():
>
> /work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2028
>
>     705f:       41 8b 74 24 6c          mov    0x6c(%r12),%esi
>
> /work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2025
>
>     7064:       49 8b 8c 24 b8 01 00    mov    0x1b8(%r12),%rcx
>
>     706b:       00
>
> /work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2028
>
>     706c:       85 f6                   test   %esi,%esi
>
> /work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2025
>
> As can be seen from the following in the same routine, rx_buffer_info->skb is 
> assigned by the following lines.
>
> #ifndef IXGBE_NO_LRO
>                 if (ixgbe_can_lro(rx_ring, rx_desc, skb)) {
>                         ixgbe_lro_receive(q_vector, skb);
>                         rx_buffer_info->skb = 
> ixgbe_lro_recycled_skb(q_vector);
>
>                 } else {
> #endif
>
> Note that ixgbe_lro_recycled_skb() can return NULL if 
> qvector->lrolist->recycled is empty.
>
>
> The crash is not very reproducible we can only see it a few times right after 
> the machines were rebooted. If we were to run the machine in a warm state, we 
> don't likely see this problem. I wonder if you have heard the similar report 
> like this?
>
>
>
> There are two ways we are thinking of fixing this bug.
Actually there are two spots where rx_buffer_info->skb is populated. 
One is in the LRO recycling, the other is in ixgbe_alloc_rx_buffers_ps. 
If you look in there that is the function that should be allocating an skb, 
mapping it and giving it to the ring, and then updating the tail. 
Without the ixgbe_alloc_rx_buffers_ps call you will not have handed any buffers 
to the device to be received.

The path you have gotten into should not actually be reachable via the code and 
hardware, unless you are in some sort of odd state code wise due to patch 
ordering or some other data corruption.  Part of the updates I pushed a while 
back was an update to the ixgbe_alloc_rx_buffers code that made it so only it 
and the hardware will actually write to the Rx descriptors.  You might want to 
double check your ring layout to make certain that you always have at least one 
descriptor that has had it's status zeroed out.  If that has not happened one 
thing you could see is a situation where head passes over tail which would 
result in a null pointer reference as you have described.


>
>
>
> First approach
>
> ---------------
>
> Please note that ixgbe_clean_rx_irq_bb() has a different way to deal with 
> this problem. It checks if skb is NULL, if it is, it would try to the skb 
> from the recycle q, if fails, it will call netdev_alloc_skb_ip_align() to get 
> a skb. Would you suggest the following code (NULL_POINTER_FIX) is a proper 
> way to fix this problem.
>
>
>
>         while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) {
>                 union ixgbe_adv_rx_desc *next_rxd;
>                 struct ixgbe_rx_buffer *rx_buffer_info;
>                 struct sk_buff *skb;
>                 bool pkt_is_rsc;
>
>                 rx_buffer_info = &rx_ring->rx_buffer_info[i];
>                 skb = rx_buffer_info->skb;
>                 rx_buffer_info->skb = NULL;
>
>                 /*
>                  * This memory barrier is needed to keep us from reading
>                  * any other fields out of the rx_desc until we know the
>                  * RXD_STAT_DD bit is set
>                  */
>                 rmb();
>
>                 pkt_is_rsc = ixgbe_get_rsc_state(rx_ring, rx_desc);
>
> #define NULL_POINTER_FIX
> #ifdef NULL_POINTER_FIX
>
>                 if (!skb) {
>                         printk("%s:%d no skb\n", __FUNCTION__, 
> __LINE__); #ifndef IXGBE_NO_LRO
>                         /* retreive any recycled skbs first before allocating 
> */
>                         skb = ixgbe_lro_recycled_skb(q_vector);
>                         if (!skb)
> #endif
>                                 skb = 
> netdev_alloc_skb_ip_align(netdev_ring(rx_ring),
>                                                                 
> IXGBE_RX_HDR_SIZE);
>                         if (!skb) {
>                                 printk("%s:%d cannot alloc skb\n",
>                                        __FUNCTION__, __LINE__);
>                                 rx_ring->rx_stats.alloc_rx_buff_failed++;
>                                 break;
>                         }
>
>                         /*
>                          * Delay unmapping of the first packet. It carries the
>                          * header information, HW may still access the header
>                          * after the writeback.  Only unmap it when EOP is
>                          * reached
>                          */
>                         IXGBE_CB(skb)->dma = rx_buffer_info->page_dma;
>
>                         /* initialize skb for ring */
>                         skb_record_rx_queue(skb, 
> ring_queue_index(rx_ring));
>
>                 }
> #endif // NULL_POINTER_FIX
The page based receive approach isn't really compatible with the legacy packet 
split receive path.  The problem is that in packet split the head is expected 
to be written to the skb->data and without that you will receive a malformed 
frame if you hit the error you are reporting.  Since you are trying to copy the 
bounce buffer path does that mean that the bounce buffer path does not have 
this issue?

Also what hardware are you running this driver on?  The 82599 has HW errata #45 
that says you should not be using packet split because it can result in 
unpredictable behaviour.  That was one of the reasons for implementing the 
bounce buffer path in the first place.

> Second Approach
> ---------------------
> Another approach we have in mind is to force ixgbe_alloc_rx_buffer_ps() to be 
> called whenever the recycle queue is empty.
>
>        while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) { ...
>                 bool pkt_is_rsc, force_alloc = false;         // added 
> force_alloc
> ...
>
> #ifndef IXGBE_NO_LRO
>                 if (ixgbe_can_lro(rx_ring, rx_desc, skb)) {    // force 
> allocation whenever the recycle q is empty
>                         ixgbe_lro_receive(q_vector, skb);
>                         rx_buffer_info->skb = 
> ixgbe_lro_recycled_skb(q_vector);
>                         if (rx_buffer_info->skb == NULL) {
>                                 rx_buffer_info->dma = 0; /* XXX */
>                                 printk(KERN_CRIT "%s: forcing allocation\n",
>                                        __func__);
>                                 force_alloc = true;
>                         }
> ...
>                 /* return some buffers to hardware, one at a time is too slow 
> */   // when force_alloc is true, we would get more skb
>                 if (force_alloc || cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
>                         ixgbe_alloc_rx_buffers_ps(rx_ring, cleaned_count);
>                         cleaned_count = 0;
>                 }
>
> ...
>       }
>
>
> This is crash is tough to reproduce so it is very hard to test again either 
> one of the proposed fix above.
> Would you please kindly make any suggestion to the fix of this problem?
This doesn't do anything because the actual allocation should be happening in 
ixgbe_alloc_rx_buffers_ps. 

It seems like there may be some sort of data corruption going on since the 
descriptor is reporting that the DD bit is set, but there is no sk_buff pointer 
in the buffer_info.  For now your best workaround would be to simply stop 
cleaning the ring if you get to the case where the DD bit is set and the skb 
pointer is NULL.  You will likely also not want to allocate any new buffers on 
the ring.  You could then add some code to dump the ring, descriptors, and 
buffer info.  This should provide us with significantly more information on how 
the driver could have gotten into such a state.  You could then reset the 
adapter after such an event since that is a fairly serious sign of some sort of 
data corruption.

Thanks,

Alex

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to