Hi,
This is to report a bug we found in the Intel 10 Gigabit PCI Express Linux
driver (ixgbe).
Our driver has patched up to version 3.7.21.
Recently, we hit a crash with the following dump.
BUG: unable to handle kernel NULL pointer dereference at 000000000000006c
IP: [<ffffffffa005afef>] ixgbe_poll+0x9df/0x1720 [ixgbe]
PGD 3fc6cc067 PUD 3fc223067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/bypass/10-11/ping_watchdog
CPU 9
Pid: 3268, comm: sport Tainted: P ---------------- 2.6.32-perf #1
To Be Filled By O.E.M.
RIP: 0010:[<ffffffffa005afef>] [<ffffffffa005afef>] ixgbe_poll+0x9df/0x1720
[ixgbe]
RSP: 0018:ffff8804075dd8b0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880410e90680 RCX: 0000000000000000
RDX: 0000000000000020 RSI: ffffc90004ca6000 RDI: ffff8804121ca580
RBP: ffff8804075dd970 R08: 0000000000000100 R09: 0000000000000000
R10: 0000000000000100 R11: 0000000000000000 R12: 0000000000000000
R13: ffffc9000479b190 R14: ffff8804059a40a0 R15: 000000000000000a
FS: 00007fe8a63a6700(0000) GS:ffff8804364c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000006c CR3: 0000000414564000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sport (pid: 3268, threadinfo ffff8804075dc000, task ffff880414bbe1c0)
Stack:
0000000000000000 ffff8804075c3d40 000000000000000e ffff880414d36780
<0> ffff8804075c3d98 ffff8804075dd918 ffffffff815671ac 000000010003503a
<0> ffff880410538700 0000000100000040 ffff8804121ca580 0000004012f9d89c
After comparing the disassembled code, we found that location ixgbe_poll+0x9df
points to this line marked in red below as showed in the following subroutine.
static bool ixgbe_clean_rx_irq_ps(struct ixgbe_q_vector *q_vector,
struct ixgbe_ring *rx_ring,
int budget)
{
union ixgbe_adv_rx_desc *rx_desc;
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
const int current_node = numa_node_id();
#ifdef IXGBE_FCOE
struct ixgbe_adapter *adapter = q_vector->adapter;
int ddp_bytes = 0;
#endif /* IXGBE_FCOE */
u16 i = rx_ring->next_to_clean;
u16 cleaned_count = ixgbe_desc_unused(rx_ring);
rx_desc = IXGBE_RX_DESC(rx_ring, i);
while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) {
union ixgbe_adv_rx_desc *next_rxd;
struct ixgbe_rx_buffer *rx_buffer_info;
struct sk_buff *skb;
bool pkt_is_rsc, force_alloc = false;
rx_buffer_info = &rx_ring->rx_buffer_info[i];
skb = rx_buffer_info->skb;
rx_buffer_info->skb = NULL;
/*
* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
* RXD_STAT_DD bit is set
*/
rmb();
pkt_is_rsc = ixgbe_get_rsc_state(rx_ring, rx_desc);
prefetch(skb->data);
/* pull the header of the skb in if no data is already present
*/
if (!skb_is_nonlinear(skb)) {
<-- This where the crash occurred.
__skb_put(skb, ixgbe_get_hlen(rx_ring, rx_desc));
The C source code shows that skb was de-referenced before the
skb_is_nonlinear() is call and question was raised to why it did not crash on
the prefetch line below.
prefetch(skb->data); <--- line 2025
/* pull the header of the skb in if no data is already present
\
*/
if (!skb_is_nonlinear(skb)) { <--- line 2028
__skb_put(skb, ixgbe_get_hlen(rx_ring, rx_desc));
Looking at the assembly code, the compiler seems to generate code that executes
line 2028 before line 2025.
ixgbe_clean_rx_irq_ps():
/work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2028
705f: 41 8b 74 24 6c mov 0x6c(%r12),%esi
/work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2025
7064: 49 8b 8c 24 b8 01 00 mov 0x1b8(%r12),%rcx
706b: 00
/work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2028
706c: 85 f6 test %esi,%esi
/work/sla/kernel_trunk/drivers/net/ixgbe/ixgbe_main.c:2025
As can be seen from the following in the same routine, rx_buffer_info->skb is
assigned by the following lines.
#ifndef IXGBE_NO_LRO
if (ixgbe_can_lro(rx_ring, rx_desc, skb)) {
ixgbe_lro_receive(q_vector, skb);
rx_buffer_info->skb = ixgbe_lro_recycled_skb(q_vector);
} else {
#endif
Note that ixgbe_lro_recycled_skb() can return NULL if
qvector->lrolist->recycled is empty.
The crash is not very reproducible we can only see it a few times right after
the machines were rebooted. If we were to run the machine in a warm state, we
don't likely see this problem. I wonder if you have heard the similar report
like this?
There are two ways we are thinking of fixing this bug.
First approach
---------------
Please note that ixgbe_clean_rx_irq_bb() has a different way to deal with this
problem. It checks if skb is NULL, if it is, it would try to the skb from the
recycle q, if fails, it will call netdev_alloc_skb_ip_align() to get a skb.
Would you suggest the following code (NULL_POINTER_FIX) is a proper way to fix
this problem.
while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) {
union ixgbe_adv_rx_desc *next_rxd;
struct ixgbe_rx_buffer *rx_buffer_info;
struct sk_buff *skb;
bool pkt_is_rsc;
rx_buffer_info = &rx_ring->rx_buffer_info[i];
skb = rx_buffer_info->skb;
rx_buffer_info->skb = NULL;
/*
* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
* RXD_STAT_DD bit is set
*/
rmb();
pkt_is_rsc = ixgbe_get_rsc_state(rx_ring, rx_desc);
#define NULL_POINTER_FIX
#ifdef NULL_POINTER_FIX
if (!skb) {
printk("%s:%d no skb\n", __FUNCTION__, __LINE__);
#ifndef IXGBE_NO_LRO
/* retreive any recycled skbs first before allocating */
skb = ixgbe_lro_recycled_skb(q_vector);
if (!skb)
#endif
skb =
netdev_alloc_skb_ip_align(netdev_ring(rx_ring),
IXGBE_RX_HDR_SIZE);
if (!skb) {
printk("%s:%d cannot alloc skb\n",
__FUNCTION__, __LINE__);
rx_ring->rx_stats.alloc_rx_buff_failed++;
break;
}
/*
* Delay unmapping of the first packet. It carries the
* header information, HW may still access the header
* after the writeback. Only unmap it when EOP is
* reached
*/
IXGBE_CB(skb)->dma = rx_buffer_info->page_dma;
/* initialize skb for ring */
skb_record_rx_queue(skb, ring_queue_index(rx_ring));
}
#endif // NULL_POINTER_FIX
Second Approach
---------------------
Another approach we have in mind is to force ixgbe_alloc_rx_buffer_ps() to be
called whenever the recycle queue is empty.
while (ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) {
...
bool pkt_is_rsc, force_alloc = false; // added
force_alloc
...
#ifndef IXGBE_NO_LRO
if (ixgbe_can_lro(rx_ring, rx_desc, skb)) { // force
allocation whenever the recycle q is empty
ixgbe_lro_receive(q_vector, skb);
rx_buffer_info->skb = ixgbe_lro_recycled_skb(q_vector);
if (rx_buffer_info->skb == NULL) {
rx_buffer_info->dma = 0; /* XXX */
printk(KERN_CRIT "%s: forcing allocation\n",
__func__);
force_alloc = true;
}
...
/* return some buffers to hardware, one at a time is too slow
*/ // when force_alloc is true, we would get more skb
if (force_alloc || cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
ixgbe_alloc_rx_buffers_ps(rx_ring, cleaned_count);
cleaned_count = 0;
}
...
}
This is crash is tough to reproduce so it is very hard to test again either one
of the proposed fix above.
Would you please kindly make any suggestion to the fix of this problem?
Thank you very much.
--Steven
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit
http://communities.intel.com/community/wired