Kernel panic in net_rx_action + solutions (patches coming)

Stephen Irons Mon, 23 Mar 2009 15:48:00 -0700

As reported a few days ago, my DVEVM was producing a kernel panic with kernel version 2.6.25-davinci1, 2.6.26-rc5 and 2.6.27-davinci1. It was consistently in net_rx_action. It would happen after anything from a few seconds up to about 4 hours of operation. A copy of the console log is attached at the end of this email.

This happened only when the DVEVM was attached to a corporate network with thousands of packets per second. On a quiet, development network, this problem did not happen.

I have discovered a few issues in drivers/net/davinci_emac.c. These problems have not been fixed in kernel 2.6.28 or 2.6.29. In the descriptions below, line numbers refer to the head of Davinci GIT.

This email is a description of the problem and some solutions. Patches will come shortly.

Stephen Irons
Senior Engineer
Tait Electronics Ltd

Function emac_poll(), around line 2189:

Original code:

	if (txpending || rxpending) {
		if (likely(netif_rx_schedule_prep(&priv->napi))) {
			emac_int_disable(priv);
			__netif_rx_schedule(&priv->napi);
		}
	} else {
		netif_rx_complete(napi);
		emac_int_enable(priv);
	}

This code tries to decide whether to perform another NAPI poll (if there is still transmit or receive work to do), or if it should stop NAPI polling and re-enable the EMAC interrupt. If there is no more work to do (both txpending and rxpending are false), it removes the NAPI poll from the poll list, and re-enables the EMAC interrupt.

Now, in the rare situation that the driver performs EXACTLY the budgetted amount of work and there is now no more to do, then it violates the comment in net/core/dev.c, function net_rx_action(),

		/* Drivers must not modify the NAPI state if they
		 * consume the entire weight.  In such cases this code
		 * still "owns" the NAPI instance and therefore can
		 * move the instance around on the list at-will.
		 */
		if (unlikely(work == weight)) {
			if (unlikely(napi_disable_pending(n)))
				__napi_complete(n);
			else
				list_move_tail(&n->poll_list, list);
		}

This code executes list_move_tail() to move the NAPI poll to the end of the list. However, in the rare situation described above, the NAPI poll has already been moved, and list_move_tail() tries to access a list-item that has been deleted. The ->next and ->prev pointers are set to the LIST_POISON value (0x00100100) and when these are followed, the corresponding addresses are not mapped into virtual memory and the kernel panic described above occurs.

Solution

In function emac_poll(), around line 2189, only disable the NAPI poll if there is no more work to do AND you have not consumed exactly your budget:

	if (txpending || rxpending) {
		if (likely(netif_rx_schedule_prep(&priv->napi))) {
			emac_int_disable(priv);
			__netif_rx_schedule(&priv->napi);
		}
	} else {
		if (num_pkts < budget) {
			netif_rx_complete(napi);
			emac_int_enable(priv);
		}
	}

This has the effect that there will be one more poll even though there is no more work to do. However, the next time around it will process no packets (because there are no more) and will disable the NAPI poll.

Testing

This problem usually occurs on a busy network. However, it is possible to cause the problem to happen on a quiet network by reducing the Davinci EMAC driver poll weight from 64 down to 4 or 8. If set to 4, my DVEVM would not even get to the login: prompt before crashing (booting from NFS-mounted root FS). If set to 8, it would often get to the login prompt, but would crash when flood-pinging the board.

In drivers/net/davinci_emac.c:

#define EMAC_POLL_WEIGHT        (4) /* Default NAPI poll weight */


/* Buffer descriptor parameters */
#define EMAC_DEF_TX_MAX_SERVICE     (4) /* TX max service BD's */
#define EMAC_DEF_RX_MAX_SERVICE     (4) /* should = netdev->weight */

This works by making the emac_poll() function consumes its entire budget more often. When set to the default of 64, it very seldom consumed its budget, except on a busy network.

It does also load up the processor significantly: it spends 50--60% of its time in doing soft irq activity.

Function emac_poll(), around line 2137:

The change above highlighted another issue: the transmit function can, and often will, do MORE than the budgetted amount of work in one poll.

	if (status & mask) {
		num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
					  EMAC_DEF_TX_MAX_SERVICE,
					  &txpending);
	} /* TX processing */

EMAC_DEF_TX_MAX_SERVICE is set to 32. If there is lots of transmit work to do, it will do all of it, irrespective of the budget. If there is no receive work to do, emac_poll() will return the number of transmit packets processed. This could be a value larger than its budget. This is a bad thing to do:

it will earn the driver a warning in dev.c, function net_rx_action(), when it executes WARN_ON_ONCE(work > weight);
it will cause untold chaos in netpoll.c, function poll_one_napi() and poll_napi(), because

poll_one_napi() will return a negative number (budget - work)
poll_napi() will keep on looping when it should have stopped

            if (!budget)
                break;

Solution

In drivers/net/davinci_emac.c, function emac_poll(), ensure that transmit processing also honours the budget.

	if (status & mask) {
		num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
					  budget,
					  &txpending);
	} /* TX processing */

In net/core/netpoll.c, function poll_napi(), change the test for end of loop to if (budget <= 0). This is defensive coding to ensure that buggy drivers do not make poll_napi() loop too many times.

            if (budget <= 0)
                break;

Additional problem

Consider the situation where there is both transmit and receive work to do. It will process up to the budgetted number of transmit AND receive packets. However, num_pkts is set to the number of receive packets processed, so it lies about the amount of work it has done. This is not a serious problem. No-one else in the kernel actually knows or cares how much time we spend doing stuff.

Ideally, the emac_poll() function would:

Find out how much transmit and receive work is queued.
Portion this out between transmit and receive.
Tell the caller how much work it actually did.

In practice, it might be acceptable to:

Do up to the budgetted amount of receive work
Do the remainder of the budget in transmit work
Tell the caller how much work it actually did.

	mask = EMAC_DM644X_MAC_IN_VECTOR_RX_INT_VEC;

	if (priv->version == EMAC_VERSION_2)
		mask = EMAC_DM646X_MAC_IN_VECTOR_RX_INT_VEC;

	if (status & mask) {
		num_rx_pkts = emac_rx_bdproc(priv, EMAC_DEF_RX_CH,
					  budget, &rxpending);
	} /* RX processing */

	tx_budget = budget - num_rx_pkts;

	mask = EMAC_DM644X_MAC_IN_VECTOR_TX_INT_VEC;

	if (priv->version == EMAC_VERSION_2)
		mask = EMAC_DM646X_MAC_IN_VECTOR_TX_INT_VEC;

	if (status & mask) {
		num_tx_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
					  tx_budget,
					  &txpending);
	} /* TX processing */

I have not tested the above. It might just make the Davinci EMAC driver a bit more friendly.

Console log

20090306_153918 < r...@davinci-dcb:~# uname -r. .a
20090306_153918 < Linux davinci-dcb 2.6.27-davinci1-svn17242 #3 PREEMPT Fri Mar 6 09:34:36 NZDT 2009 armv5tejl unknown
20090306_153918 < r...@davinci-dcb:~#
20090306_161241 < r...@davinci-dcb:~# Unable to handle kernel paging request at virtual address 00100104
20090306_161241 < pgd = c0004000
20090306_161241 < [00100104] *pgd=00000000
20090306_161241 < Internal error: Oops: 817 [#1] PREEMPT
20090306_161241 < Modules linked in:
20090306_161241 < CPU: 0    Not tainted (2.6.27-davinci1-svn17242 #3)
20090306_161241 < PC is at net_rx_action+0x158/0x260
20090306_161241 < LR is at emac_poll+0x5bc/0x868
20090306_161241 < pc : [<c01845ec>]    lr : [<c015f3ac>]    psr: 60000093
20090306_161241 < sp : c0289e98 ip : c0289e38 fp : c0289ecc
20090306_161241 < r10: 0000012c r9 : c02c2464 r8 : 00000040
20090306_161241 < r7 : c0288000 r6 : 00000000 r5 : 00000040 r4 : c30a238c
20090306_161241 < r3 : 00200200 r2 : 00100100 r1 : 00000100 r0 : 00000040
20090306_161241 < Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment kernel
20090306_161241 < Control: 0005317f Table: 8318c000 DAC: 00000017
20090306_161241 < Process swapper (pid: 0, stack limit = 0xc0288268)
20090306_161241 < Stack: (0xc0289e98 to 0xc028a000)
20090306_161241 < 9e80:                                                       c0289ebc 000320ce
20090306_161241 < 9ea0: c0180e48 c02a57c0 00000001 0000000a c02a5780 00000001 c02b47ec 00000000
20090306_161241 < 9ec0: c0289efc c0289ed0 c003df44 c01844a4 c3144e20 0000000d c028f5ec 00000000
20090306_161241 < 9ee0: 00000002 00000001 c0288000 8001fd9c c0289f14 c0289f00 c003e338 c003deec
20090306_161241 < 9f00: 00000002 0000000d c0289f34 c0289f18 c002404c c003e2fc c0063140 ffffffff
20090306_161241 < 9f20: fec48000 c028bdb8 c0289f8c c0289f38 c0024868 c0024010 00000000 0005317f
20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8 c0025d8c c029efc8 41069265
20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0 c0025ddc 60000013 ffffffff
20090306_161241 < 9f80: c0289fb4 c0289f90 c0025c58 c0025d9c c029eb68 c02b4a8c c029eb68 c0021f14
20090306_161241 < 9fa0: c028bc50 8001fdd0 c0289fc4 c0289fb8 c01fa980 c0025c10 c0289ff4 c0289fc8
20090306_161241 < 9fc0: c0008a28 c01fa928 c0008394 00000000 00000000 c0021f14 00000000 00053175
20090306_161241 < 9fe0: c029f02c c0022318 00000000 c0289ff8 80008034 c00087cc 00000000 00000000
20090306_161241 < Backtrace:
20090306_161241 < [<c0184494>] (net_rx_action+0x0/0x260) from [<c003df44>] (__do_softirq+0x68/0xd4)
20090306_161241 < [<c003dedc>] (__do_softirq+0x0/0xd4) from [<c003e338>] (irq_exit+0x4c/0x9c)
20090306_161241 < [<c003e2ec>] (irq_exit+0x0/0x9c) from [<c002404c>] (__exception_text_start+0x4c/0x64)
20090306_161241 < r4:0000000d
20090306_161241 < [<c0024000>] (__exception_text_start+0x0/0x64) from [<c0024868>] (__irq_svc+0x48/0x88)
20090306_161241 < Exception stack(0xc0289f38 to 0xc0289f80)
20090306_161241 < 9f20:                                                       00000000 0005317f
20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8 c0025d8c c029efc8 41069265
20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0 c0025ddc 60000013 ffffffff
20090306_161241 < r6:c028bdb8 r5:fec48000 r4:ffffffff
20090306_161241 < [<c0025d8c>] (default_idle+0x0/0x58) from [<c0025c58>] (cpu_idle+0x58/0x98)
20090306_161241 < [<c0025c00>] (cpu_idle+0x0/0x98) from [<c01fa980>] (rest_init+0x68/0x7c)
20090306_161241 < r8:8001fdd0 r7:c028bc50 r6:c0021f14 r5:c029eb68 r4:c02b4a8c
20090306_161241 < [<c01fa918>] (rest_init+0x0/0x7c) from [<c0008a28>] (start_kernel+0x26c/0x2dc)
20090306_161241 < [<c00087bc>] (start_kernel+0x0/0x2dc) from [<80008034>] (0x80008034)
20090306_161241 < r6:c0022318 r5:c029f02c r4:00053175
20090306_161241 < Code: e5843008 e121f002 ea000008 e894000c (e5823004)
20090306_161241 < Kernel panic - not syncing: Fatal exception in interrupt

This email, including any attachments, is only for the intended addressee. It is subject to copyright, confidential and may be the subject of legal or other privilege, none of which is waived or lost by reason of this transmission. If the receiver is not the intended addressee, please accept our apologies, notify us by return, delete all copies and perform no other act on the email. Unfortunately, we cannot warrant that the email has not been altered or corrupted during transmission.

_______________________________________________
Davinci-linux-open-source mailing list
[email protected]
http://linux.davincidsp.com/mailman/listinfo/davinci-linux-open-source

Kernel panic in net_rx_action + solutions (patches coming)

Function emac_poll(), around line 2189:

Solution

Testing

Function emac_poll(), around line 2137:

Solution

Additional problem

Console log

Reply via email to