On 05/18/2012 11:25 AM, Steven La wrote:
> Hi,
>
> This is to report a bug we found in the Intel 10 Gigabit PCI Express Linux
> driver (ixgbe).
> Our driver has patched up to version 3.7.21.
When you say "patched up to" do you mean that the driver is our 3.7.21
out-of-tree driver or that you have applied patches that you believe
will make it equivalent to 3.7.21?
>
> Recently, we hit a crash with the following dump.
>
> BUG: unable to handle kernel NULL pointer dereference at 000000000000006c
> IP: [<ffffffffa005afef>] ixgbe_poll+0x9df/0x1720 [ixgbe]
> PGD 3fc6cc067 PUD 3fc223067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/virtual/bypass/10-11/ping_watchdog
> CPU 9
> Pid: 3268, comm: sport Tainted: P ---------------- 2.6.32-perf #1
> To Be Filled By O.E.M.
> RIP: 0010:[<ffffffffa005afef>] [<ffffffffa005afef>] ixgbe_poll+0x9df/0x1720
> [ixgbe]
> RSP: 0018:ffff8804075dd8b0 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff880410e90680 RCX: 0000000000000000
> RDX: 0000000000000020 RSI: ffffc90004ca6000 RDI: ffff8804121ca580
> RBP: ffff8804075dd970 R08: 0000000000000100 R09: 0000000000000000
> R10: 0000000000000100 R11: 0000000000000000 R12: 0000000000000000
> R13: ffffc9000479b190 R14: ffff8804059a40a0 R15: 000000000000000a
> FS: 00007fe8a63a6700(0000) GS:ffff8804364c0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000000000006c CR3: 0000000414564000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process sport (pid: 3268, threadinfo ffff8804075dc000, task ffff880414bbe1c0)
> Stack:
> 0000000000000000 ffff8804075c3d40 000000000000000e ffff880414d36780
> <0> ffff8804075c3d98 ffff8804075dd918 ffffffff815671ac 000000010003503a
> <0> ffff880410538700 0000000100000040 ffff8804121ca580 0000004012f9d89c
>
> After comparing the disassembled code, we found that location
> ixgbe_poll+0x9df points to this line marked in red below as showed in the
> following subroutine.
>
> static bool ixgbe_clean_rx_irq_ps(struct ixgbe_q_vector *q_vector,
> struct ixgbe_ring *rx_ring,
> int budget)
> {
> union ixgbe_adv_rx_desc *rx_desc;
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> const int current_node = numa_node_id();
> #ifdef IXGBE_FCOE
> struct ixgbe_adapter *adapter = q_vector->adapter;
> int ddp_bytes = 0;
> #endif /* IXGBE_FCOE */
> u16 i = rx_ring->next_to_clean;
> u16 cleaned_count = ixgbe_desc_unused(rx_ring);
>
> rx_desc = IXGBE_RX_DESC(rx_ring, i);
>
> while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) {
> union ixgbe_adv_rx_desc *next_rxd;
> struct ixgbe_rx_buffer *rx_buffer_info;
> struct sk_buff *skb;
> bool pkt_is_rsc, force_alloc = false;
>
> rx_buffer_info = &rx_ring->rx_buffer_info[i];
> skb = rx_buffer_info->skb;
> rx_buffer_info->skb = NULL;
>
> /*
> * This memory barrier is needed to keep us from reading
> * any other fields out of the rx_desc until we know the
> * RXD_STAT_DD bit is set
> */
> rmb();
>
> pkt_is_rsc = ixgbe_get_rsc_state(rx_ring, rx_desc);
>
> prefetch(skb->data);
>
> /* pull the header of the skb in if no data is already
> present */
> if (!skb_is_nonlinear(skb)) {
> <-- This where the crash occurred.
> __skb_put(skb, ixgbe_get_hlen(rx_ring, rx_desc));
>
>
> The C source code shows that skb was de-referenced before the
>
> skb_is_nonlinear() is call and question was raised to why it did not crash on
> the prefetch line below.
>
>
>
> prefetch(skb->data); <--- line 2025
>
>
>
> /* pull the header of the skb in if no data is already present
>
> \
>
> */
>
> if (!skb_is_nonlinear(skb)) { <--- line 2028
>
> __skb_put(skb, ixgbe_get_hlen(rx_ring, rx_desc));
>
>
>
> Looking at the assembly code, the compiler seems to generate code that
> executes
>
> line 2028 before line 2025.
>
>
>
> ixgbe_clean_rx_irq_ps():
>
> /work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2028
>
> 705f: 41 8b 74 24 6c mov 0x6c(%r12),%esi
>
> /work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2025
>
> 7064: 49 8b 8c 24 b8 01 00 mov 0x1b8(%r12),%rcx
>
> 706b: 00
>
> /work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2028
>
> 706c: 85 f6 test %esi,%esi
>
> /work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2025
>
> As can be seen from the following in the same routine, rx_buffer_info->skb is
> assigned by the following lines.
>
> #ifndef IXGBE_NO_LRO
> if (ixgbe_can_lro(rx_ring, rx_desc, skb)) {
> ixgbe_lro_receive(q_vector, skb);
> rx_buffer_info->skb =
> ixgbe_lro_recycled_skb(q_vector);
>
> } else {
> #endif
>
> Note that ixgbe_lro_recycled_skb() can return NULL if
> qvector->lrolist->recycled is empty.
>
>
> The crash is not very reproducible we can only see it a few times right after
> the machines were rebooted. If we were to run the machine in a warm state, we
> don't likely see this problem. I wonder if you have heard the similar report
> like this?
>
>
>
> There are two ways we are thinking of fixing this bug.
Actually there are two spots where rx_buffer_info->skb is populated.
One is in the LRO recycling, the other is in ixgbe_alloc_rx_buffers_ps.
If you look in there that is the function that should be allocating an
skb, mapping it and giving it to the ring, and then updating the tail.
Without the ixgbe_alloc_rx_buffers_ps call you will not have handed any
buffers to the device to be received.
The path you have gotten into should not actually be reachable via the
code and hardware, unless you are in some sort of odd state code wise
due to patch ordering or some other data corruption. Part of the
updates I pushed a while back was an update to the
ixgbe_alloc_rx_buffers code that made it so only it and the hardware
will actually write to the Rx descriptors. You might want to double
check your ring layout to make certain that you always have at least one
descriptor that has had it's status zeroed out. If that has not
happened one thing you could see is a situation where head passes over
tail which would result in a null pointer reference as you have described.
>
>
>
> First approach
>
> ---------------
>
> Please note that ixgbe_clean_rx_irq_bb() has a different way to deal with
> this problem. It checks if skb is NULL, if it is, it would try to the skb
> from the recycle q, if fails, it will call netdev_alloc_skb_ip_align() to get
> a skb. Would you suggest the following code (NULL_POINTER_FIX) is a proper
> way to fix this problem.
>
>
>
> while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) {
> union ixgbe_adv_rx_desc *next_rxd;
> struct ixgbe_rx_buffer *rx_buffer_info;
> struct sk_buff *skb;
> bool pkt_is_rsc;
>
> rx_buffer_info = &rx_ring->rx_buffer_info[i];
> skb = rx_buffer_info->skb;
> rx_buffer_info->skb = NULL;
>
> /*
> * This memory barrier is needed to keep us from reading
> * any other fields out of the rx_desc until we know the
> * RXD_STAT_DD bit is set
> */
> rmb();
>
> pkt_is_rsc = ixgbe_get_rsc_state(rx_ring, rx_desc);
>
> #define NULL_POINTER_FIX
> #ifdef NULL_POINTER_FIX
>
> if (!skb) {
> printk("%s:%d no skb\n", __FUNCTION__, __LINE__);
> #ifndef IXGBE_NO_LRO
> /* retreive any recycled skbs first before allocating
> */
> skb = ixgbe_lro_recycled_skb(q_vector);
> if (!skb)
> #endif
> skb =
> netdev_alloc_skb_ip_align(netdev_ring(rx_ring),
>
> IXGBE_RX_HDR_SIZE);
> if (!skb) {
> printk("%s:%d cannot alloc skb\n",
> __FUNCTION__, __LINE__);
> rx_ring->rx_stats.alloc_rx_buff_failed++;
> break;
> }
>
> /*
> * Delay unmapping of the first packet. It carries the
> * header information, HW may still access the header
> * after the writeback. Only unmap it when EOP is
> * reached
> */
> IXGBE_CB(skb)->dma = rx_buffer_info->page_dma;
>
> /* initialize skb for ring */
> skb_record_rx_queue(skb, ring_queue_index(rx_ring));
>
> }
> #endif // NULL_POINTER_FIX
The page based receive approach isn't really compatible with the legacy
packet split receive path. The problem is that in packet split the head
is expected to be written to the skb->data and without that you will
receive a malformed frame if you hit the error you are reporting. Since
you are trying to copy the bounce buffer path does that mean that the
bounce buffer path does not have this issue?
Also what hardware are you running this driver on? The 82599 has HW
errata #45 that says you should not be using packet split because it can
result in unpredictable behaviour. That was one of the reasons for
implementing the bounce buffer path in the first place.
> Second Approach
> ---------------------
> Another approach we have in mind is to force ixgbe_alloc_rx_buffer_ps() to be
> called whenever the recycle queue is empty.
>
> while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) {
> ...
> bool pkt_is_rsc, force_alloc = false; // added
> force_alloc
> ...
>
> #ifndef IXGBE_NO_LRO
> if (ixgbe_can_lro(rx_ring, rx_desc, skb)) { // force
> allocation whenever the recycle q is empty
> ixgbe_lro_receive(q_vector, skb);
> rx_buffer_info->skb =
> ixgbe_lro_recycled_skb(q_vector);
> if (rx_buffer_info->skb == NULL) {
> rx_buffer_info->dma = 0; /* XXX */
> printk(KERN_CRIT "%s: forcing allocation\n",
> __func__);
> force_alloc = true;
> }
> ...
> /* return some buffers to hardware, one at a time is too slow
> */ // when force_alloc is true, we would get more skb
> if (force_alloc || cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
> ixgbe_alloc_rx_buffers_ps(rx_ring, cleaned_count);
> cleaned_count = 0;
> }
>
> ...
> }
>
>
> This is crash is tough to reproduce so it is very hard to test again either
> one of the proposed fix above.
> Would you please kindly make any suggestion to the fix of this problem?
This doesn't do anything because the actual allocation should be
happening in ixgbe_alloc_rx_buffers_ps.
It seems like there may be some sort of data corruption going on since
the descriptor is reporting that the DD bit is set, but there is no
sk_buff pointer in the buffer_info. For now your best workaround would
be to simply stop cleaning the ring if you get to the case where the DD
bit is set and the skb pointer is NULL. You will likely also not want
to allocate any new buffers on the ring. You could then add some code
to dump the ring, descriptors, and buffer info. This should provide us
with significantly more information on how the driver could have gotten
into such a state. You could then reset the adapter after such an event
since that is a fairly serious sign of some sort of data corruption.
Thanks,
Alex
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit
http://communities.intel.com/community/wired