From: Stephen Irons
Sent: Tuesday, March 24, 2009 4:18 AM
As reported a few days ago, my DVEVM was producing a kernel panic with kernel
version 2.6.25-davinci1, 2.6.26-rc5 and 2.6.27-davinci1. It was consistently in
net_rx_action. It would happen after anything from a few seconds up to about 4
hours of operation. A copy of the console log is attached at the end of this
email.
This happened only when the DVEVM was attached to a corporate network with
thousands of packets per second. On a quiet, development network, this problem
did
not happen.
I have discovered a few issues in drivers/net/davinci_emac.c. These problems
have
not been fixed in kernel 2.6.28 or 2.6.29. In the descriptions below, line
numbers
refer to the head of Davinci GIT.
This email is a description of the problem and some solutions. Patches will come
shortly.
Stephen Irons
Senior Engineer
Tait Electronics Ltd
Function emac_poll(), around line 2189:
Original code:
if (txpending || rxpending) {
if (likely(netif_rx_schedule_prep(&priv->napi))) {
emac_int_disable(priv);
__netif_rx_schedule(&priv->napi);
}
} else {
netif_rx_complete(napi);
emac_int_enable(priv);
}
This code tries to decide whether to perform another NAPI poll (if there is
still
transmit or receive work to do), or if it should stop NAPI polling and re-enable
the EMAC interrupt. If there is no more work to do (both txpending and rxpending
are false), it removes the NAPI poll from the poll list, and re-enables the EMAC
interrupt.
Now, in the rare situation that the driver performs EXACTLY the budgetted
amount of
work and there is now no more to do, then it violates the comment in
net/core/dev.c, function net_rx_action(),
/* Drivers must not modify the NAPI state if they
* consume the entire weight. In such cases this code
* still "owns" the NAPI instance and therefore can
* move the instance around on the list at-will.
*/
if (unlikely(work == weight)) {
if (unlikely(napi_disable_pending(n)))
__napi_complete(n);
else
list_move_tail(&n->poll_list, list);
}
This code executes list_move_tail() to move the NAPI poll to the end of the
list.
However, in the rare situation described above, the NAPI poll has already been
moved, and list_move_tail() tries to access a list-item that has been deleted.
The
->next and ->prev pointers are set to the LIST_POISON value (0x00100100) and
when
these are followed, the corresponding addresses are not mapped into virtual
memory
and the kernel panic described above occurs.
Hmm, good catch!
Solution
In function emac_poll(), around line 2189, only disable the NAPI poll if there
is
no more work to do AND you have not consumed exactly your budget:
if (txpending || rxpending) {
if (likely(netif_rx_schedule_prep(&priv->napi))) {
emac_int_disable(priv);
__netif_rx_schedule(&priv->napi);
}
} else {
if (num_pkts < budget) {
netif_rx_complete(napi);
emac_int_enable(priv);
}
}
Right, seems like this is what is being done in other drivers.
Drivers/net/e100.c seems a good example to follow. Looking at that driver, the
call to __netif_rx_schedule above could be superfluous as well. The poll
function will anyway keep getting called until it is knocked off the poll list
by driver poll function or by net_rx_action.
This has the effect that there will be one more poll even though there is no
more
work to do. However, the next time around it will process no packets (because
there
are no more) and will disable the NAPI poll.
Testing
This problem usually occurs on a busy network. However, it is possible to cause
the
problem to happen on a quiet network by reducing the Davinci EMAC driver poll
weight from 64 down to 4 or 8. If set to 4, my DVEVM would not even get to the
login: prompt before crashing (booting from NFS-mounted root FS). If set to 8,
it
would often get to the login prompt, but would crash when flood-pinging the
board.
In drivers/net/davinci_emac.c:
#define EMAC_POLL_WEIGHT (4) /* Default NAPI poll weight */
/* Buffer descriptor parameters */
#define EMAC_DEF_TX_MAX_SERVICE (4) /* TX max service BD's */
#define EMAC_DEF_RX_MAX_SERVICE (4) /* should = netdev->weight */
This works by making the emac_poll() function consumes its entire budget more
often. When set to the default of 64, it very seldom consumed its budget,
except on
a busy network.
It does also load up the processor significantly: it spends 50--60% of its time
in
doing soft irq activity.
Function emac_poll(), around line 2137:
The change above highlighted another issue: the transmit function can, and often
will, do MORE than the budgetted amount of work in one poll.
if (status & mask) {
num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
EMAC_DEF_TX_MAX_SERVICE,
&txpending);
} /* TX processing */
EMAC_DEF_TX_MAX_SERVICE is set to 32. If there is lots of transmit work to do,
it
will do all of it, irrespective of the budget. If there is no receive work to
do,
emac_poll() will return the number of transmit packets processed. This could be
a
value larger than its budget. This is a bad thing to do:
* it will earn the driver a warning in dev.c, function net_rx_action(), when it
executes WARN_ON_ONCE(work > weight);
* it will cause untold chaos in netpoll.c, function poll_one_napi() and
poll_napi(), because
o poll_one_napi() will return a negative number (budget - work)
o poll_napi() will keep on looping when it should have stopped
if (!budget)
break;
Solution
In drivers/net/davinci_emac.c, function emac_poll(), ensure that transmit
processing also honours the budget.
if (status & mask) {
num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
budget,
&txpending);
} /* TX processing */
I think NAPI was meant for RX side only. Tx is supposed to throttle itself
since that traffic originates from the system itself.
Again e100 looks a good example to follow. It allows Tx to piggyback the poll
function, but Tx does not influence the work done. So in our case, not setting
num_pkts for Tx case should be good?
Thanks,
Sekhar
In net/core/netpoll.c, function poll_napi(), change the test for end of loop to
if
(budget <= 0). This is defensive coding to ensure that buggy drivers do not make
poll_napi() loop too many times.
if (budget <= 0)
break;
Additional problem
Consider the situation where there is both transmit and receive work to do. It
will
process up to the budgetted number of transmit AND receive packets. However,
num_pkts is set to the number of receive packets processed, so it lies about the
amount of work it has done. This is not a serious problem. No-one else in the
kernel actually knows or cares how much time we spend doing stuff.
Ideally, the emac_poll() function would:
* Find out how much transmit and receive work is queued.
* Portion this out between transmit and receive.
* Tell the caller how much work it actually did.
In practice, it might be acceptable to:
* Do up to the budgetted amount of receive work
* Do the remainder of the budget in transmit work
* Tell the caller how much work it actually did.
mask = EMAC_DM644X_MAC_IN_VECTOR_RX_INT_VEC;
if (priv->version == EMAC_VERSION_2)
mask = EMAC_DM646X_MAC_IN_VECTOR_RX_INT_VEC;
if (status & mask) {
num_rx_pkts = emac_rx_bdproc(priv, EMAC_DEF_RX_CH,
budget, &rxpending);
} /* RX processing */
tx_budget = budget - num_rx_pkts;
mask = EMAC_DM644X_MAC_IN_VECTOR_TX_INT_VEC;
if (priv->version == EMAC_VERSION_2)
mask = EMAC_DM646X_MAC_IN_VECTOR_TX_INT_VEC;
if (status & mask) {
num_tx_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
tx_budget,
&txpending);
} /* TX processing */
I have not tested the above. It might just make the Davinci EMAC driver a bit
more
friendly.
Console log
20090306_153918 < r...@davinci-dcb:~# uname -r. .a
20090306_153918 < Linux davinci-dcb 2.6.27-davinci1-svn17242 #3 PREEMPT Fri Mar
6
09:34:36 NZDT 2009 armv5tejl unknown
20090306_153918 < r...@davinci-dcb:~#
20090306_161241 < r...@davinci-dcb:~# Unable to handle kernel paging request at
virtual address 00100104
20090306_161241 < pgd = c0004000
20090306_161241 < [00100104] *pgd=00000000
20090306_161241 < Internal error: Oops: 817 [#1] PREEMPT
20090306_161241 < Modules linked in:
20090306_161241 < CPU: 0 Not tainted (2.6.27-davinci1-svn17242 #3)
20090306_161241 < PC is at net_rx_action+0x158/0x260
20090306_161241 < LR is at emac_poll+0x5bc/0x868
20090306_161241 < pc : [<c01845ec>] lr : [<c015f3ac>] psr: 60000093
20090306_161241 < sp : c0289e98 ip : c0289e38 fp : c0289ecc
20090306_161241 < r10: 0000012c r9 : c02c2464 r8 : 00000040
20090306_161241 < r7 : c0288000 r6 : 00000000 r5 : 00000040 r4 : c30a238c
20090306_161241 < r3 : 00200200 r2 : 00100100 r1 : 00000100 r0 : 00000040
20090306_161241 < Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment
kernel
20090306_161241 < Control: 0005317f Table: 8318c000 DAC: 00000017
20090306_161241 < Process swapper (pid: 0, stack limit = 0xc0288268)
20090306_161241 < Stack: (0xc0289e98 to 0xc028a000)
20090306_161241 < 9e80:
c0289ebc 000320ce
20090306_161241 < 9ea0: c0180e48 c02a57c0 00000001 0000000a c02a5780 00000001
c02b47ec 00000000
20090306_161241 < 9ec0: c0289efc c0289ed0 c003df44 c01844a4 c3144e20 0000000d
c028f5ec 00000000
20090306_161241 < 9ee0: 00000002 00000001 c0288000 8001fd9c c0289f14 c0289f00
c003e338 c003deec
20090306_161241 < 9f00: 00000002 0000000d c0289f34 c0289f18 c002404c c003e2fc
c0063140 ffffffff
20090306_161241 < 9f20: fec48000 c028bdb8 c0289f8c c0289f38 c0024868 c0024010
00000000 0005317f
20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8 c0025d8c
c029efc8 41069265
20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0 c0025ddc
60000013 ffffffff
20090306_161241 < 9f80: c0289fb4 c0289f90 c0025c58 c0025d9c c029eb68 c02b4a8c
c029eb68 c0021f14
20090306_161241 < 9fa0: c028bc50 8001fdd0 c0289fc4 c0289fb8 c01fa980 c0025c10
c0289ff4 c0289fc8
20090306_161241 < 9fc0: c0008a28 c01fa928 c0008394 00000000 00000000 c0021f14
00000000 00053175
20090306_161241 < 9fe0: c029f02c c0022318 00000000 c0289ff8 80008034 c00087cc
00000000 00000000
20090306_161241 < Backtrace:
20090306_161241 < [<c0184494>] (net_rx_action+0x0/0x260) from [<c003df44>]
(__do_softirq+0x68/0xd4)
20090306_161241 < [<c003dedc>] (__do_softirq+0x0/0xd4) from [<c003e338>]
(irq_exit+0x4c/0x9c)
20090306_161241 < [<c003e2ec>] (irq_exit+0x0/0x9c) from [<c002404c>]
(__exception_text_start+0x4c/0x64)
20090306_161241 < r4:0000000d
20090306_161241 < [<c0024000>] (__exception_text_start+0x0/0x64) from
[<c0024868>]
(__irq_svc+0x48/0x88)
20090306_161241 < Exception stack(0xc0289f38 to 0xc0289f80)
20090306_161241 < 9f20:
00000000 0005317f
20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8 c0025d8c
c029efc8 41069265
20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0 c0025ddc
60000013 ffffffff
20090306_161241 < r6:c028bdb8 r5:fec48000 r4:ffffffff
20090306_161241 < [<c0025d8c>] (default_idle+0x0/0x58) from [<c0025c58>]
(cpu_idle+0x58/0x98)
20090306_161241 < [<c0025c00>] (cpu_idle+0x0/0x98) from [<c01fa980>]
(rest_init+0x68/0x7c)
20090306_161241 < r8:8001fdd0 r7:c028bc50 r6:c0021f14 r5:c029eb68 r4:c02b4a8c
20090306_161241 < [<c01fa918>] (rest_init+0x0/0x7c) from [<c0008a28>]
(start_kernel+0x26c/0x2dc)
20090306_161241 < [<c00087bc>] (start_kernel+0x0/0x2dc) from [<80008034>]
(0x80008034)
20090306_161241 < r6:c0022318 r5:c029f02c r4:00053175
20090306_161241 < Code: e5843008 e121f002 ea000008 e894000c (e5823004)
20090306_161241 < Kernel panic - not syncing: Fatal exception in interrupt
________________________________________
This email, including any attachments, is only for the intended addressee. It is
subject to copyright, confidential and may be the subject of legal or other
privilege, none of which is waived or lost by reason of this transmission. If
the
receiver is not the intended addressee, please accept our apologies, notify us
by
return, delete all copies and perform no other act on the email. Unfortunately,
we
cannot warrant that the email has not been altered or corrupted during
transmission.
________________________________________