Nori, Sekhar wrote:
From: Stephen Irons
Sent: Tuesday, March 24, 2009 4:18 AM
As reported a few days ago, my DVEVM was producing a kernel panic with kernel
version 2.6.25-davinci1, 2.6.26-rc5 and 2.6.27-davinci1. It was consistently in
net_rx_action. It would happen after anything from a few seconds up to about 4
hours of operation. A copy of the console log is attached at the end of this 
email.

This happened only when the DVEVM was attached to a corporate network with
thousands of packets per second. On a quiet, development network, this problem 
did
not happen.

I have discovered a few issues in drivers/net/davinci_emac.c. These problems 
have
not been fixed in kernel 2.6.28 or 2.6.29. In the descriptions below, line 
numbers
refer to the head of Davinci GIT.

This email is a description of the problem and some solutions. Patches will come
shortly.


Stephen Irons
Senior Engineer
Tait Electronics Ltd

Function emac_poll(), around line 2189:
Original code:
      if (txpending || rxpending) {
              if (likely(netif_rx_schedule_prep(&priv->napi))) {
                      emac_int_disable(priv);
                      __netif_rx_schedule(&priv->napi);
              }
      } else {
              netif_rx_complete(napi);
              emac_int_enable(priv);
      }
This code tries to decide whether to perform another NAPI poll (if there is 
still
transmit or receive work to do), or if it should stop NAPI polling and re-enable
the EMAC interrupt. If there is no more work to do (both txpending and rxpending
are false), it removes the NAPI poll from the poll list, and re-enables the EMAC
interrupt.

Now, in the rare situation that the driver performs EXACTLY the budgetted 
amount of
work and there is now no more to do, then it violates the comment in
net/core/dev.c, function net_rx_action(),
              /* Drivers must not modify the NAPI state if they
               * consume the entire weight.  In such cases this code
               * still "owns" the NAPI instance and therefore can
               * move the instance around on the list at-will.
               */
              if (unlikely(work == weight)) {
                      if (unlikely(napi_disable_pending(n)))
                              __napi_complete(n);
                      else
                              list_move_tail(&n->poll_list, list);
              }
This code executes list_move_tail() to move the NAPI poll to the end of the 
list.
However, in the rare situation described above, the NAPI poll has already been
moved, and list_move_tail() tries to access a list-item that has been deleted. 
The
->next and ->prev pointers are set to the LIST_POISON value (0x00100100) and 
when
these are followed, the corresponding addresses are not mapped into virtual 
memory
and the kernel panic described above occurs.

Hmm, good catch!

Solution
In function emac_poll(), around line 2189, only disable the NAPI poll if there 
is
no more work to do AND you have not consumed exactly your budget:
      if (txpending || rxpending) {
              if (likely(netif_rx_schedule_prep(&priv->napi))) {
                      emac_int_disable(priv);
                      __netif_rx_schedule(&priv->napi);
              }
      } else {
              if (num_pkts < budget) {
                      netif_rx_complete(napi);
                      emac_int_enable(priv);
              }
      }

Right, seems like this is what is being done in other drivers. 
Drivers/net/e100.c seems a good example to follow. Looking at that driver, the 
call to __netif_rx_schedule above could be superfluous as well. The poll 
function will anyway keep getting called until it is knocked off the poll list 
by driver poll function or by net_rx_action.

This has the effect that there will be one more poll even though there is no 
more
work to do. However, the next time around it will process no packets (because 
there
are no more) and will disable the NAPI poll.
Testing
This problem usually occurs on a busy network. However, it is possible to cause 
the
problem to happen on a quiet network by reducing the Davinci EMAC driver poll
weight from 64 down to 4 or 8. If set to 4, my DVEVM would not even get to the
login: prompt before crashing (booting from NFS-mounted root FS). If set to 8, 
it
would often get to the login prompt, but would crash when flood-pinging the 
board.

In drivers/net/davinci_emac.c:
#define EMAC_POLL_WEIGHT        (4) /* Default NAPI poll weight */

/* Buffer descriptor parameters */
#define EMAC_DEF_TX_MAX_SERVICE     (4) /* TX max service BD's */
#define EMAC_DEF_RX_MAX_SERVICE     (4) /* should = netdev->weight */
This works by making the emac_poll() function consumes its entire budget more
often. When set to the default of 64, it very seldom consumed its budget, 
except on
a busy network.

It does also load up the processor significantly: it spends 50--60% of its time 
in
doing soft irq activity.
Function emac_poll(), around line 2137:
The change above highlighted another issue: the transmit function can, and often
will, do MORE than the budgetted amount of work in one poll.
      if (status & mask) {
              num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
                                        EMAC_DEF_TX_MAX_SERVICE,
                                        &txpending);
      } /* TX processing */
EMAC_DEF_TX_MAX_SERVICE is set to 32. If there is lots of transmit work to do, 
it
will do all of it, irrespective of the budget. If there is no receive work to 
do,
emac_poll() will return the number of transmit packets processed. This could be 
a
value larger than its budget. This is a bad thing to do:
* it will earn the driver a warning in dev.c, function net_rx_action(), when it
executes WARN_ON_ONCE(work > weight);
* it will cause untold chaos in netpoll.c, function poll_one_napi() and
poll_napi(), because
o poll_one_napi() will return a negative number (budget - work)
o poll_napi() will keep on looping when it should have stopped
            if (!budget)
                break;
Solution
In drivers/net/davinci_emac.c, function emac_poll(), ensure that transmit
processing also honours the budget.
      if (status & mask) {
              num_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
                                        budget,
                                        &txpending);
      } /* TX processing */

I think NAPI was meant for RX side only. Tx is supposed to throttle itself 
since that traffic originates from the system itself.

Again e100 looks a good example to follow. It allows Tx to piggyback the poll 
function, but Tx does not influence the work done. So in our case, not setting 
num_pkts for Tx case should be good?

Thank you for your quick response.

Can I assume that some of these suggestion will somehow get fed to the Davinci GIT tree? I am working on 2.6.27 at present, but will probably update to 2.6.29 when a -davinci1 version becomes available. A patch will not apply exactly to both versions...

http://www.linuxfoundation.org/en/Net:NAPI suggests that it is permissible to use NAPI for other purposes as well as network reception.

"NAPI usage does not have to be limited only to receiving packets. With many devices the /poll()/ routine can also be used to manage transmit completion or PHY <http://www.linuxfoundation.org/index.php?title=Net:PHY&action=edit&redlink=1> interface state changes. By moving this processing out of the hardware interrrupt service routine, there may be less latency and better performance. "

It seems to me that emac_poll() can do the transmission side as well as reception.

Your suggestion is good: transmission work should not affect the receive budget. Ignore the number of packets actually transmitted, and always return the number of receive packets processed, or zero if no receive packets were processed.

Alternatively, do all the reception (up to the budget). Then, if there is any amount left in the budget, transmit up to that many packets. This means that receive packets get priority, and we do not send more packets to the network if there are still receive packets to process. It feels as though we are helping the network to empty...

Stephen Irons


Thanks,
Sekhar

In net/core/netpoll.c, function poll_napi(), change the test for end of loop to 
if
(budget <= 0). This is defensive coding to ensure that buggy drivers do not make
poll_napi() loop too many times.
            if (budget <= 0)
                break;
Additional problem
Consider the situation where there is both transmit and receive work to do. It 
will
process up to the budgetted number of transmit AND receive packets. However,
num_pkts is set to the number of receive packets processed, so it lies about the
amount of work it has done. This is not a serious problem. No-one else in the
kernel actually knows or cares how much time we spend doing stuff.

Ideally, the emac_poll() function would:
* Find out how much transmit and receive work is queued.
* Portion this out between transmit and receive.
* Tell the caller how much work it actually did.
In practice, it might be acceptable to:
* Do up to the budgetted amount of receive work
* Do the remainder of the budget in transmit work
* Tell the caller how much work it actually did.
      mask = EMAC_DM644X_MAC_IN_VECTOR_RX_INT_VEC;

      if (priv->version == EMAC_VERSION_2)
              mask = EMAC_DM646X_MAC_IN_VECTOR_RX_INT_VEC;

      if (status & mask) {
              num_rx_pkts = emac_rx_bdproc(priv, EMAC_DEF_RX_CH,
                                        budget, &rxpending);
      } /* RX processing */

      tx_budget = budget - num_rx_pkts;
      mask = EMAC_DM644X_MAC_IN_VECTOR_TX_INT_VEC;

      if (priv->version == EMAC_VERSION_2)
              mask = EMAC_DM646X_MAC_IN_VECTOR_TX_INT_VEC;

      if (status & mask) {
              num_tx_pkts = emac_tx_bdproc(priv, EMAC_DEF_TX_CH,
                                        tx_budget,
                                        &txpending);
      } /* TX processing */

I have not tested the above. It might just make the Davinci EMAC driver a bit 
more
friendly.
Console log
20090306_153918 < r...@davinci-dcb:~# uname -r. .a
20090306_153918 < Linux davinci-dcb 2.6.27-davinci1-svn17242 #3 PREEMPT Fri Mar 
6
09:34:36 NZDT 2009 armv5tejl unknown
20090306_153918 < r...@davinci-dcb:~#
20090306_161241 < r...@davinci-dcb:~# Unable to handle kernel paging request at
virtual address 00100104
20090306_161241 < pgd = c0004000
20090306_161241 < [00100104] *pgd=00000000
20090306_161241 < Internal error: Oops: 817 [#1] PREEMPT
20090306_161241 < Modules linked in:
20090306_161241 < CPU: 0    Not tainted  (2.6.27-davinci1-svn17242 #3)
20090306_161241 < PC is at net_rx_action+0x158/0x260
20090306_161241 < LR is at emac_poll+0x5bc/0x868
20090306_161241 < pc : [<c01845ec>]    lr : [<c015f3ac>]    psr: 60000093
20090306_161241 < sp : c0289e98  ip : c0289e38  fp : c0289ecc
20090306_161241 < r10: 0000012c  r9 : c02c2464  r8 : 00000040
20090306_161241 < r7 : c0288000  r6 : 00000000  r5 : 00000040  r4 : c30a238c
20090306_161241 < r3 : 00200200  r2 : 00100100  r1 : 00000100  r0 : 00000040
20090306_161241 < Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment
kernel
20090306_161241 < Control: 0005317f  Table: 8318c000  DAC: 00000017
20090306_161241 < Process swapper (pid: 0, stack limit = 0xc0288268)
20090306_161241 < Stack: (0xc0289e98 to 0xc028a000)
20090306_161241 < 9e80:
c0289ebc 000320ce
20090306_161241 < 9ea0: c0180e48 c02a57c0 00000001 0000000a c02a5780 00000001
c02b47ec 00000000
20090306_161241 < 9ec0: c0289efc c0289ed0 c003df44 c01844a4 c3144e20 0000000d
c028f5ec 00000000
20090306_161241 < 9ee0: 00000002 00000001 c0288000 8001fd9c c0289f14 c0289f00
c003e338 c003deec
20090306_161241 < 9f00: 00000002 0000000d c0289f34 c0289f18 c002404c c003e2fc
c0063140 ffffffff
20090306_161241 < 9f20: fec48000 c028bdb8 c0289f8c c0289f38 c0024868 c0024010
00000000 0005317f
20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8 c0025d8c
c029efc8 41069265
20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0 c0025ddc
60000013 ffffffff
20090306_161241 < 9f80: c0289fb4 c0289f90 c0025c58 c0025d9c c029eb68 c02b4a8c
c029eb68 c0021f14
20090306_161241 < 9fa0: c028bc50 8001fdd0 c0289fc4 c0289fb8 c01fa980 c0025c10
c0289ff4 c0289fc8
20090306_161241 < 9fc0: c0008a28 c01fa928 c0008394 00000000 00000000 c0021f14
00000000 00053175
20090306_161241 < 9fe0: c029f02c c0022318 00000000 c0289ff8 80008034 c00087cc
00000000 00000000
20090306_161241 < Backtrace:
20090306_161241 < [<c0184494>] (net_rx_action+0x0/0x260) from [<c003df44>]
(__do_softirq+0x68/0xd4)
20090306_161241 < [<c003dedc>] (__do_softirq+0x0/0xd4) from [<c003e338>]
(irq_exit+0x4c/0x9c)
20090306_161241 < [<c003e2ec>] (irq_exit+0x0/0x9c) from [<c002404c>]
(__exception_text_start+0x4c/0x64)
20090306_161241 <  r4:0000000d
20090306_161241 < [<c0024000>] (__exception_text_start+0x0/0x64) from 
[<c0024868>]
(__irq_svc+0x48/0x88)
20090306_161241 < Exception stack(0xc0289f38 to 0xc0289f80)
20090306_161241 < 9f20:
00000000 0005317f
20090306_161241 < 9f40: 0005217f 60000013 c0025d8c c0288000 c028bdb8 c0025d8c
c029efc8 41069265
20090306_161241 < 9f60: 8001fd9c c0289f8c 600000d3 c0289f80 c0025dd0 c0025ddc
60000013 ffffffff
20090306_161241 <  r6:c028bdb8 r5:fec48000 r4:ffffffff
20090306_161241 < [<c0025d8c>] (default_idle+0x0/0x58) from [<c0025c58>]
(cpu_idle+0x58/0x98)
20090306_161241 < [<c0025c00>] (cpu_idle+0x0/0x98) from [<c01fa980>]
(rest_init+0x68/0x7c)
20090306_161241 <  r8:8001fdd0 r7:c028bc50 r6:c0021f14 r5:c029eb68 r4:c02b4a8c
20090306_161241 < [<c01fa918>] (rest_init+0x0/0x7c) from [<c0008a28>]
(start_kernel+0x26c/0x2dc)
20090306_161241 < [<c00087bc>] (start_kernel+0x0/0x2dc) from [<80008034>]
(0x80008034)
20090306_161241 <  r6:c0022318 r5:c029f02c r4:00053175
20090306_161241 < Code: e5843008 e121f002 ea000008 e894000c (e5823004)
20090306_161241 < Kernel panic - not syncing: Fatal exception in interrupt
________________________________________

This email, including any attachments, is only for the intended addressee. It is
subject to copyright, confidential and may be the subject of legal or other
privilege, none of which is waived or lost by reason of this transmission. If 
the
receiver is not the intended addressee, please accept our apologies, notify us 
by
return, delete all copies and perform no other act on the email. Unfortunately, 
we
cannot warrant that the email has not been altered or corrupted during
transmission.
________________________________________


=======================================================================
This email, including any attachments, is only for the intended
addressee.  It is subject to copyright, is confidential and may be
the subject of legal or other privilege, none of which is waived or
lost by reason of this transmission.
If the receiver is not the intended addressee, please accept our
apologies, notify us by return, delete all copies and perform no
other act on the email.
Unfortunately, we cannot warrant that the email has not been
altered or corrupted during transmission.
=======================================================================


_______________________________________________
Davinci-linux-open-source mailing list
[email protected]
http://linux.davincidsp.com/mailman/listinfo/davinci-linux-open-source

Reply via email to