RE: Kernel panic in net_rx_action + solutions (patches coming)

Nori, Sekhar Tue, 24 Mar 2009 11:00:37 -0700

From: Stephen Irons
Sent: Tuesday, March 24, 2009 4:18 AM
>
> As reported a few days ago, my DVEVM was producing a kernel panic with kernel
> version 2.6.25-davinci1, 2.6.26-rc5 and 2.6.27-davinci1. It was consistently 
> in
> net_rx_action. It would happen after anything from a few seconds up to about 4
> hours of operation. A copy of the console log is attached at the end of this 
> email.
>
> This happened only when the DVEVM was attached to a corporate network with
> thousands of packets per second. On a quiet, development network, this 
> problem did
> not happen.
>
> I have discovered a few issues in drivers/net/davinci_emac.c. These problems 
> have
> not been fixed in kernel 2.6.28 or 2.6.29. In the descriptions below, line 
> numbers
> refer to the head of Davinci GIT.
>
> This email is a description of the problem and some solutions. Patches will 
> come
> shortly.
>
>
> Stephen Irons
> Senior Engineer
> Tait Electronics Ltd
>
> Function emac_poll(), around line 2189:
> Original code:
>       if (txpending || rxpending) {
>               if (likely(netif_rx_schedule_prep(&priv->napi))) {
>                       emac_int_disable(priv);
>                       __netif_rx_schedule(&priv->napi);
>               }
>       } else {
>               netif_rx_complete(napi);
>               emac_int_enable(priv);
>       }
> This code tries to decide whether to perform another NAPI poll (if there is 
> still
> transmit or receive work to do), or if it should stop NAPI polling and 
> re-enable
> the EMAC interrupt. If there is no more work to do (both txpending and 
> rxpending
> are false), it removes the NAPI poll from the poll list, and re-enables the 
> EMAC
> interrupt.
>
> Now, in the rare situation that the driver performs EXACTLY the budgetted 
> amount of
> work and there is now no more to do, then it violates the comment in
> net/core/dev.c, function net_rx_action(),
>               /* Drivers must not modify the NAPI state if they
>                * consume the entire weight.  In such cases this code
>                * still "owns" the NAPI instance and therefore can
>                * move the instance around on the list at-will.
>                */
>               if (unlikely(work == weight)) {
>                       if (unlikely(napi_disable_pending(n)))
>                               __napi_complete(n);
>                       else
>                               list_move_tail(&n->poll_list, list);
>               }
> This code executes list_move_tail() to move the NAPI poll to the end of the 
> list.
> However, in the rare situation described above, the NAPI poll has already been
> moved, and list_move_tail() tries to access a list-item that has been 
> deleted. The
> ->next and ->prev pointers are set to the LIST_POISON value (0x00100100) and 
> when
> these are followed, the corresponding addresses are not mapped into virtual 
> memory
> and the kernel panic described above occurs.


Hmm, good catch!

> Solution
> In function emac_poll(), around line 2189, only disable the NAPI poll if 
> there is
> no more work to do AND you have not consumed exactly your budget:
>       if (txpending || rxpending) {
>               if (likely(netif_rx_schedule_prep(&priv->napi))) {
>                       emac_int_disable(priv);
>                       __netif_rx_schedule(&priv->napi);
>               }
>       } else {
>               if (num_pkts < budget) {
>                       netif_rx_complete(napi);
>                       emac_int_enable(priv);
>               }
>       }

Right, seems like this is what is being done in other drivers. 
Drivers/net/e100.c seems a good example to follow. Looking at that driver, the 
call to __netif_rx_schedule above could be superfluous as well. The poll 
function will anyway keep getting called until it is knocked off the poll list 
by driver poll function or by net_rx_action.

> This has the effect that there will be one more poll even though there is no 
> more
> work to do. However, the next time around it will process no packets (because 
> there
> are no more) and will disable the NAPI poll.
> Testing
> This problem usually occurs on a busy network. However, it is possible to 
> cause the
> problem to happen on a quiet network by reducing the Davinci EMAC driver poll
> weight from 64 down to 4 or 8. If set to 4, my DVEVM would not even get to the
> login: prompt before crashing (booting from NFS-mounted root FS). If set to 
> 8, it
> would often get to the login prompt, but would crash when flood-pinging the 
> board.
>
> In drivers/net/davinci_emac.c:
> #define EMAC_POLL_WEIGHT        (4) /* Default NAPI poll weight */
>
> /* Buffer descriptor parameters */
> #define EMAC_DEF_TX_MAX_SERVICE     (4) /* TX max service BD's */
> #define EMAC_DEF_RX_MAX_SERVICE     (4) /* should = netdev->weight */
> This works by making the emac_poll() function consumes its entire budget more
> often. When set to the default of 64, it very seldom consumed its budget, 
> except on
> a busy network.
>
> It does also load up the processor significantly: it spends 50--60% of its 
> time in
> doing soft irq activity.
> Function emac_poll(), around line 2137:
> The change above highlighted another issue: the transmit function can, and 
> often
> will, do MORE than the budgetted amount of work in one poll.
>       if (status & mask) {
>               num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
>                                         EMAC_DEF_TX_MAX_SERVICE,
>                                         &txpending);
>       } /* TX processing */
> EMAC_DEF_TX_MAX_SERVICE is set to 32. If there is lots of transmit work to 
> do, it
> will do all of it, irrespective of the budget. If there is no receive work to 
> do,
> emac_poll() will return the number of transmit packets processed. This could 
> be a
> value larger than its budget. This is a bad thing to do:
> * it will earn the driver a warning in dev.c, function net_rx_action(), when 
> it
> executes WARN_ON_ONCE(work > weight);
> * it will cause untold chaos in netpoll.c, function poll_one_napi() and
> poll_napi(), because
> o poll_one_napi() will return a negative number (budget - work)
> o poll_napi() will keep on looping when it should have stopped
>             if (!budget)
>                 break;
> Solution
> In drivers/net/davinci_emac.c, function emac_poll(), ensure that transmit
> processing also honours the budget.
>       if (status & mask) {
>               num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
>                                         budget,
>                                         &txpending);
>       } /* TX processing */

I think NAPI was meant for RX side only. Tx is supposed to throttle itself 
since that traffic originates from the system itself.

Again e100 looks a good example to follow. It allows Tx to piggyback the poll 
function, but Tx does not influence the work done. So in our case, not setting 
num_pkts for Tx case should be good?

Thanks,
Sekhar

> In net/core/netpoll.c, function poll_napi(), change the test for end of loop 
> to if
> (budget <= 0). This is defensive coding to ensure that buggy drivers do not 
> make
> poll_napi() loop too many times.
>             if (budget <= 0)
>                 break;
> Additional problem
> Consider the situation where there is both transmit and receive work to do. 
> It will
> process up to the budgetted number of transmit AND receive packets. However,
> num_pkts is set to the number of receive packets processed, so it lies about 
> the
> amount of work it has done. This is not a serious problem. No-one else in the
> kernel actually knows or cares how much time we spend doing stuff.
>
> Ideally, the emac_poll() function would:
> * Find out how much transmit and receive work is queued.
> * Portion this out between transmit and receive.
> * Tell the caller how much work it actually did.
> In practice, it might be acceptable to:
> * Do up to the budgetted amount of receive work
> * Do the remainder of the budget in transmit work
> * Tell the caller how much work it actually did.
>       mask = EMAC_DM644X_MAC_IN_VECTOR_RX_INT_VEC;
>
>       if (priv->version == EMAC_VERSION_2)
>               mask = EMAC_DM646X_MAC_IN_VECTOR_RX_INT_VEC;
>
>       if (status & mask) {
>               num_rx_pkts = emac_rx_bdproc(priv, EMAC_DEF_RX_CH,
>                                         budget, &rxpending);
>       } /* RX processing */
>
>       tx_budget = budget - num_rx_pkts;
>       mask = EMAC_DM644X_MAC_IN_VECTOR_TX_INT_VEC;
>
>       if (priv->version == EMAC_VERSION_2)
>               mask = EMAC_DM646X_MAC_IN_VECTOR_TX_INT_VEC;
>
>       if (status & mask) {
>               num_tx_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
>                                         tx_budget,
>                                         &txpending);
>       } /* TX processing */
>
> I have not tested the above. It might just make the Davinci EMAC driver a bit 
> more
> friendly.
> Console log
> 20090306_153918 < r...@davinci-dcb:~# uname -r. .a
> 20090306_153918 < Linux davinci-dcb 2.6.27-davinci1-svn17242 #3 PREEMPT Fri 
> Mar 6
> 09:34:36 NZDT 2009 armv5tejl unknown
> 20090306_153918 < r...@davinci-dcb:~#
> 20090306_161241 < r...@davinci-dcb:~# Unable to handle kernel paging request 
> at
> virtual address 00100104
> 20090306_161241 < pgd = c0004000
> 20090306_161241 < [00100104] *pgd=00000000
> 20090306_161241 < Internal error: Oops: 817 [#1] PREEMPT
> 20090306_161241 < Modules linked in:
> 20090306_161241 < CPU: 0    Not tainted  (2.6.27-davinci1-svn17242 #3)
> 20090306_161241 < PC is at net_rx_action+0x158/0x260
> 20090306_161241 < LR is at emac_poll+0x5bc/0x868
> 20090306_161241 < pc : [<c01845ec>]    lr : [<c015f3ac>]    psr: 60000093
> 20090306_161241 < sp : c0289e98  ip : c0289e38  fp : c0289ecc
> 20090306_161241 < r10: 0000012c  r9 : c02c2464  r8 : 00000040
> 20090306_161241 < r7 : c0288000  r6 : 00000000  r5 : 00000040  r4 : c30a238c
> 20090306_161241 < r3 : 00200200  r2 : 00100100  r1 : 00000100  r0 : 00000040
> 20090306_161241 < Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  
> Segment
> kernel
> 20090306_161241 < Control: 0005317f  Table: 8318c000  DAC: 00000017
> 20090306_161241 < Process swapper (pid: 0, stack limit = 0xc0288268)
> 20090306_161241 < Stack: (0xc0289e98 to 0xc028a000)
> 20090306_161241 < 9e80:
> c0289ebc 000320ce
> 20090306_161241 < 9ea0: c0180e48 c02a57c0 00000001 0000000a c02a5780 00000001
> c02b47ec 00000000
> 20090306_161241 < 9ec0: c0289efc c0289ed0 c003df44 c01844a4 c3144e20 0000000d
> c028f5ec 00000000
> 20090306_161241 < 9ee0: 00000002 00000001 c0288000 8001fd9c c0289f14 c0289f00
> c003e338 c003deec
> 20090306_161241 < 9f00: 00000002 0000000d c0289f34 c0289f18 c002404c c003e2fc
> c0063140 ffffffff
> 20090306_161241 < 9f20: fec48000 c028bdb8 c0289f8c c0289f38 c0024868 c0024010
> 00000000 0005317f
> 20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8 c0025d8c
> c029efc8 41069265
> 20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0 c0025ddc
> 60000013 ffffffff
> 20090306_161241 < 9f80: c0289fb4 c0289f90 c0025c58 c0025d9c c029eb68 c02b4a8c
> c029eb68 c0021f14
> 20090306_161241 < 9fa0: c028bc50 8001fdd0 c0289fc4 c0289fb8 c01fa980 c0025c10
> c0289ff4 c0289fc8
> 20090306_161241 < 9fc0: c0008a28 c01fa928 c0008394 00000000 00000000 c0021f14
> 00000000 00053175
> 20090306_161241 < 9fe0: c029f02c c0022318 00000000 c0289ff8 80008034 c00087cc
> 00000000 00000000
> 20090306_161241 < Backtrace:
> 20090306_161241 < [<c0184494>] (net_rx_action+0x0/0x260) from [<c003df44>]
> (__do_softirq+0x68/0xd4)
> 20090306_161241 < [<c003dedc>] (__do_softirq+0x0/0xd4) from [<c003e338>]
> (irq_exit+0x4c/0x9c)
> 20090306_161241 < [<c003e2ec>] (irq_exit+0x0/0x9c) from [<c002404c>]
> (__exception_text_start+0x4c/0x64)
> 20090306_161241 <  r4:0000000d
> 20090306_161241 < [<c0024000>] (__exception_text_start+0x0/0x64) from 
> [<c0024868>]
> (__irq_svc+0x48/0x88)
> 20090306_161241 < Exception stack(0xc0289f38 to 0xc0289f80)
> 20090306_161241 < 9f20:
> 00000000 0005317f
> 20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8 c0025d8c
> c029efc8 41069265
> 20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0 c0025ddc
> 60000013 ffffffff
> 20090306_161241 <  r6:c028bdb8 r5:fec48000 r4:ffffffff
> 20090306_161241 < [<c0025d8c>] (default_idle+0x0/0x58) from [<c0025c58>]
> (cpu_idle+0x58/0x98)
> 20090306_161241 < [<c0025c00>] (cpu_idle+0x0/0x98) from [<c01fa980>]
> (rest_init+0x68/0x7c)
> 20090306_161241 <  r8:8001fdd0 r7:c028bc50 r6:c0021f14 r5:c029eb68 r4:c02b4a8c
> 20090306_161241 < [<c01fa918>] (rest_init+0x0/0x7c) from [<c0008a28>]
> (start_kernel+0x26c/0x2dc)
> 20090306_161241 < [<c00087bc>] (start_kernel+0x0/0x2dc) from [<80008034>]
> (0x80008034)
> 20090306_161241 <  r6:c0022318 r5:c029f02c r4:00053175
> 20090306_161241 < Code: e5843008 e121f002 ea000008 e894000c (e5823004)
> 20090306_161241 < Kernel panic - not syncing: Fatal exception in interrupt
> ________________________________________
>
> This email, including any attachments, is only for the intended addressee. It 
> is
> subject to copyright, confidential and may be the subject of legal or other
> privilege, none of which is waived or lost by reason of this transmission. If 
> the
> receiver is not the intended addressee, please accept our apologies, notify 
> us by
> return, delete all copies and perform no other act on the email. 
> Unfortunately, we
> cannot warrant that the email has not been altered or corrupted during
> transmission.
> ________________________________________
_______________________________________________
Davinci-linux-open-source mailing list
[email protected]
http://linux.davincidsp.com/mailman/listinfo/davinci-linux-open-source

RE: Kernel panic in net_rx_action + solutions (patches coming)

Reply via email to