As reported a few days ago, my DVEVM was producing a kernel panic with
kernel version 2.6.25-davinci1, 2.6.26-rc5 and 2.6.27-davinci1. It was
consistently in net_rx_action. It would happen after anything from a
few seconds up to about 4 hours of operation. A copy of the console log
is attached at the end of this email.
This happened only when the DVEVM was attached to a corporate network
with thousands of packets per second. On a quiet, development network,
this problem did not happen.
I have discovered a few issues in drivers/net/davinci_emac.c. These
problems have not been fixed in kernel 2.6.28 or 2.6.29. In the
descriptions below, line numbers refer to the head of Davinci GIT.
This email is a description of the problem and some solutions. Patches
will come shortly.
Stephen Irons
Senior Engineer
Tait Electronics Ltd
Function emac_poll(), around line 2189:
Original code:
if (txpending || rxpending) {
if (likely(netif_rx_schedule_prep(&priv->napi))) {
emac_int_disable(priv);
__netif_rx_schedule(&priv->napi);
}
} else {
netif_rx_complete(napi);
emac_int_enable(priv);
}
This code tries to decide whether to perform another NAPI poll (if
there is still transmit or receive work to do), or if it should stop
NAPI polling and re-enable the EMAC interrupt. If there is no more work
to do (both txpending and rxpending are false), it removes the NAPI
poll from the poll list, and re-enables the EMAC interrupt.
Now, in the rare situation that the driver performs EXACTLY the
budgetted amount of work and there is now no more to do, then it
violates the comment in net/core/dev.c, function net_rx_action(),
/* Drivers must not modify the NAPI state if they
* consume the entire weight. In such cases this code
* still "owns" the NAPI instance and therefore can
* move the instance around on the list at-will.
*/
if (unlikely(work == weight)) {
if (unlikely(napi_disable_pending(n)))
__napi_complete(n);
else
list_move_tail(&n->poll_list, list);
}
This code executes list_move_tail() to move the NAPI poll to the end of
the list. However, in the rare situation described above, the NAPI poll
has already been moved, and list_move_tail() tries to access a
list-item that has been deleted. The ->next and ->prev pointers
are set to the LIST_POISON value (0x00100100) and when these are
followed, the corresponding addresses are not mapped into virtual
memory and the kernel panic described above occurs.
Solution
In function emac_poll(), around line 2189, only disable the NAPI poll
if there is no more work to do AND you have not consumed exactly your
budget:
if (txpending || rxpending) {
if (likely(netif_rx_schedule_prep(&priv->napi))) {
emac_int_disable(priv);
__netif_rx_schedule(&priv->napi);
}
} else {
if (num_pkts < budget) {
netif_rx_complete(napi);
emac_int_enable(priv);
}
}
This has the effect that there will be one more poll even though there
is no more work to do. However, the next time around it will process no
packets (because there are no more) and will disable the NAPI poll.
Testing
This problem usually occurs on a busy network. However, it is possible
to cause the problem to happen on a quiet network by reducing the
Davinci EMAC driver poll weight from 64 down to 4 or 8. If set to 4, my
DVEVM would not even get to the login: prompt before crashing (booting
from NFS-mounted root FS). If set to 8, it would often get to the login
prompt, but would crash when flood-pinging the board.
In drivers/net/davinci_emac.c:
#define EMAC_POLL_WEIGHT (4) /* Default NAPI poll weight */
/* Buffer descriptor parameters */
#define EMAC_DEF_TX_MAX_SERVICE (4) /* TX max service BD's */
#define EMAC_DEF_RX_MAX_SERVICE (4) /* should = netdev->weight */
This works by making the emac_poll() function consumes its entire
budget more often. When set to the default of 64, it very seldom
consumed its budget, except on a busy network.
It does also load up the processor significantly: it spends 50--60% of
its time in doing soft irq activity.
Function emac_poll(), around line 2137:
The change above highlighted another issue: the transmit function can,
and often will, do MORE than the budgetted amount of work in one poll.
if (status & mask) {
num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
EMAC_DEF_TX_MAX_SERVICE,
&txpending);
} /* TX processing */
EMAC_DEF_TX_MAX_SERVICE is set to 32. If there is lots of transmit work
to do, it will do all of it, irrespective of the budget. If there is no
receive work to do, emac_poll() will return the number of transmit
packets processed. This could be a value larger than its budget. This
is a bad thing to do:
- it will earn the driver a warning in dev.c, function
net_rx_action(), when it executes WARN_ON_ONCE(work > weight);
- it will cause untold chaos in netpoll.c, function poll_one_napi()
and poll_napi(), because
- poll_one_napi() will return a negative number (budget - work)
- poll_napi() will keep on looping when it should have stopped
if (!budget)
break;
Solution
In drivers/net/davinci_emac.c, function emac_poll(), ensure that
transmit processing also honours the budget.
if (status & mask) {
num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
budget,
&txpending);
} /* TX processing */
In net/core/netpoll.c, function poll_napi(), change the test for end of
loop to if (budget <= 0). This is defensive coding to ensure that
buggy drivers do not make poll_napi() loop too many times.
if (budget <= 0)
break;
Additional problem
Consider the situation where there is both transmit and receive work to
do. It will process up to the budgetted number of transmit AND receive
packets. However, num_pkts is set to the number of receive packets
processed, so it lies about the amount of work it has done. This is not
a serious problem. No-one else in the kernel actually knows or cares
how much time we spend doing stuff.
Ideally, the emac_poll() function would:
- Find out how much transmit and receive work is queued.
- Portion this out between transmit and receive.
- Tell the caller how much work it actually did.
In practice, it might be acceptable to:
- Do up to the budgetted amount of receive work
- Do the remainder of the budget in transmit work
- Tell the caller how much work it actually did.
mask = EMAC_DM644X_MAC_IN_VECTOR_RX_INT_VEC;
if (priv->version == EMAC_VERSION_2)
mask = EMAC_DM646X_MAC_IN_VECTOR_RX_INT_VEC;
if (status & mask) {
num_rx_pkts = emac_rx_bdproc(priv, EMAC_DEF_RX_CH,
budget, &rxpending);
} /* RX processing */
tx_budget = budget - num_rx_pkts;
mask = EMAC_DM644X_MAC_IN_VECTOR_TX_INT_VEC;
if (priv->version == EMAC_VERSION_2)
mask = EMAC_DM646X_MAC_IN_VECTOR_TX_INT_VEC;
if (status & mask) {
num_tx_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
tx_budget,
&txpending);
} /* TX processing */
I have not tested the above. It might just make the Davinci EMAC driver
a bit more friendly.
Console log
20090306_153918 < r...@davinci-dcb:~# uname -r. .a
20090306_153918 < Linux davinci-dcb 2.6.27-davinci1-svn17242 #3
PREEMPT Fri Mar 6 09:34:36 NZDT 2009 armv5tejl unknown
20090306_153918 < r...@davinci-dcb:~#
20090306_161241 < r...@davinci-dcb:~# Unable to handle kernel paging
request at virtual address 00100104
20090306_161241 < pgd = c0004000
20090306_161241 < [00100104] *pgd=00000000
20090306_161241 < Internal error: Oops: 817 [#1] PREEMPT
20090306_161241 < Modules linked in:
20090306_161241 < CPU: 0 Not tainted (2.6.27-davinci1-svn17242
#3)
20090306_161241 < PC is at net_rx_action+0x158/0x260
20090306_161241 < LR is at emac_poll+0x5bc/0x868
20090306_161241 < pc : [<c01845ec>] lr :
[<c015f3ac>] psr: 60000093
20090306_161241 < sp : c0289e98 ip : c0289e38 fp : c0289ecc
20090306_161241 < r10: 0000012c r9 : c02c2464 r8 : 00000040
20090306_161241 < r7 : c0288000 r6 : 00000000 r5 : 00000040 r4 :
c30a238c
20090306_161241 < r3 : 00200200 r2 : 00100100 r1 : 00000100 r0 :
00000040
20090306_161241 < Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA
ARM Segment kernel
20090306_161241 < Control: 0005317f Table: 8318c000 DAC: 00000017
20090306_161241 < Process swapper (pid: 0, stack limit = 0xc0288268)
20090306_161241 < Stack: (0xc0289e98 to 0xc028a000)
20090306_161241 <
9e80: c0289ebc
000320ce
20090306_161241 < 9ea0: c0180e48 c02a57c0 00000001 0000000a c02a5780
00000001 c02b47ec 00000000
20090306_161241 < 9ec0: c0289efc c0289ed0 c003df44 c01844a4 c3144e20
0000000d c028f5ec 00000000
20090306_161241 < 9ee0: 00000002 00000001 c0288000 8001fd9c c0289f14
c0289f00 c003e338 c003deec
20090306_161241 < 9f00: 00000002 0000000d c0289f34 c0289f18 c002404c
c003e2fc c0063140 ffffffff
20090306_161241 < 9f20: fec48000 c028bdb8 c0289f8c c0289f38 c0024868
c0024010 00000000 0005317f
20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8
c0025d8c c029efc8 41069265
20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0
c0025ddc 60000013 ffffffff
20090306_161241 < 9f80: c0289fb4 c0289f90 c0025c58 c0025d9c c029eb68
c02b4a8c c029eb68 c0021f14
20090306_161241 < 9fa0: c028bc50 8001fdd0 c0289fc4 c0289fb8 c01fa980
c0025c10 c0289ff4 c0289fc8
20090306_161241 < 9fc0: c0008a28 c01fa928 c0008394 00000000 00000000
c0021f14 00000000 00053175
20090306_161241 < 9fe0: c029f02c c0022318 00000000 c0289ff8 80008034
c00087cc 00000000 00000000
20090306_161241 < Backtrace:
20090306_161241 < [<c0184494>] (net_rx_action+0x0/0x260) from
[<c003df44>] (__do_softirq+0x68/0xd4)
20090306_161241 < [<c003dedc>] (__do_softirq+0x0/0xd4) from
[<c003e338>] (irq_exit+0x4c/0x9c)
20090306_161241 < [<c003e2ec>] (irq_exit+0x0/0x9c) from
[<c002404c>] (__exception_text_start+0x4c/0x64)
20090306_161241 < r4:0000000d
20090306_161241 < [<c0024000>]
(__exception_text_start+0x0/0x64) from [<c0024868>]
(__irq_svc+0x48/0x88)
20090306_161241 < Exception stack(0xc0289f38 to 0xc0289f80)
20090306_161241 <
9f20: 00000000
0005317f
20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8
c0025d8c c029efc8 41069265
20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0
c0025ddc 60000013 ffffffff
20090306_161241 < r6:c028bdb8 r5:fec48000 r4:ffffffff
20090306_161241 < [<c0025d8c>] (default_idle+0x0/0x58) from
[<c0025c58>] (cpu_idle+0x58/0x98)
20090306_161241 < [<c0025c00>] (cpu_idle+0x0/0x98) from
[<c01fa980>] (rest_init+0x68/0x7c)
20090306_161241 < r8:8001fdd0 r7:c028bc50 r6:c0021f14 r5:c029eb68
r4:c02b4a8c
20090306_161241 < [<c01fa918>] (rest_init+0x0/0x7c) from
[<c0008a28>] (start_kernel+0x26c/0x2dc)
20090306_161241 < [<c00087bc>] (start_kernel+0x0/0x2dc) from
[<80008034>] (0x80008034)
20090306_161241 < r6:c0022318 r5:c029f02c r4:00053175
20090306_161241 < Code: e5843008 e121f002 ea000008 e894000c
(e5823004)
20090306_161241 < Kernel panic - not syncing: Fatal exception in
interrupt
This email, including any attachments, is only for the intended addressee. It is subject to copyright, confidential and may be the subject of legal or other privilege, none of which is waived or lost by reason of this transmission. If the receiver is not the intended addressee, please accept our apologies, notify us by return, delete all copies and perform no other act on the email. Unfortunately, we cannot warrant that the email has not been altered or corrupted during transmission.
|