Re: em0 interface on Lenovo T60 hangs starting as of OpenBSD 5.7

Bryan Linton Sun, 03 May 2015 02:39:07 -0700

% uptime
 1:47AM  up 8 days,  4:59, 11 users, load averages: 0.34, 0.88, 1.13


% sysctl kern.netlivelocks
kern.netlivelocks=40259

% systat -b mb


   10 users    Load 2.53 1.24 0.68                     Sun May  3 01:31:43 2015

IFACE             LIVELOCKS  SIZE ALIVE   LWM   HWM   CWM                       
System                    0   256   168          40                             
                             2048    17          49                             
                             4096    64           8                             
lo0                                                                             
em0                          2048     4     2   256     4                       
wpi0                                                                            
enc0                                                                            
tun1                                                                            
pflog0              


I noticed that kern.netlivelocks stays relatively stable if I'm not using the 
CPU, increasing by maybe 5-10 a minute, but that it increased by about 300 from 
1:30-1:45 due to cron running /etc/daily.  Usually if I'm running anything more 
than a mostly idle IRC client, the network will wedge when cron does its 
nightly 
runs, but sometimes it will wedge even on a mostly idle box.

Also of note, since you mentioned that some em chips might not like this, I
noticed that both myself and the OP are running the following identical chips
respectively:

em0 at pci2 dev 0 function 0 "Intel 82573L" rev 0x00: msi, address 
00:16:41:52:7e:81
em0 at pci2 dev 0 function 0 "Intel 82573L" rev 0x00: msi, address 
00:15:58:7c:c0:6c

Building a kernel, running glxgears, and playing with google maps in firefox
while rsyncing a new snapshot got it to wedge and I saw something interesting.
Below is what showed while the network was wedged:

   13 users    Load 6.93 4.58 2.62                     Sun May  3 02:15:24 2015

IFACE             LIVELOCKS  SIZE ALIVE   LWM   HWM   CWM                       
System                    0   256   336          40                            
                             2048    25          49                            
                             4096    64           8                            
lo0                                                                            
em0                          2048     2     2   256     2                      
wpi0                                                                           
enc0                                                                           
tun1                                                                           
pflog0          

After an "ifconfig em0 down up", the following showed:

   13 users    Load 6.81 4.67 2.68                     Sun May  3 02:15:39 2015

IFACE             LIVELOCKS  SIZE ALIVE   LWM   HWM   CWM                      
System                    0   256   314          40                            
                             2048    25          49                            
                             4096    64           8                            
lo0                                                                            
em0                          2048     4     2   256     4                      
wpi0                                                                           
enc0                                                                           
tun1                                                                           
pflog0

The key difference is the following two lines.  The first wedged,
the second unwedged:
em0                          2048     2     2   256     2                      
em0                          2048     4     2   256     4  

It seems like the em0 line always shows the latter line, so I'm hoping 
this indicates something useful.

Yet through all of that kern.netlivelocks only went from
kern.netlivelocks=40274 to kern.netlivelocks=40361, much less
than from running /etc/daily.

Also, I've attached the patches I was sent that help a lot with mitigating the 
network wedging, in case they may affect this.

If I can provide any more information or test any patches, please let me know.  
I'd really like to have this bug fixed if at all possible, but I'm well aware 
of OpenBSD culture with regard to whining about problems and begging others
to fix them.  :)

Hopefully the information I've provided here is helpful.

One final question, is there a manpage somewhere that would help me interpret
the output of "systat mb"?

-- 
Bryan

On 2015-05-03 01:25:59, Stuart Henderson <[email protected]> wrote:
> Not sure if it will help, but it might be useful to show 'systat mb' and 
> 'sysctl kern.netlivelocks'. You mention updating packages, I've definitely 
> had systems which have been pretty much flattened with 
> netlivelocks/mitigation while doing this, perhaps some em(4) don't react very 
> well to this...?
> 
> On 3 May 2015 01:13:19 BST, Bryan Linton <[email protected]> wrote:
> >On 2015-05-02 00:16:43, Christian Schulte <[email protected]> wrote:
> >> >Synopsis: After some time (minutes or seconds) the em0 interface
> >stops working
> >> >Category: system
> >> >Environment:
> >>    System      : OpenBSD 5.7
> >>    Details     : OpenBSD 5.7-stable (GENERIC.MP) #0: Fri May  1
> >23:59:46 CEST 2015
> >>                     
> >> [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> >> 
> >>    Architecture: OpenBSD.amd64
> >>    Machine     : amd64
> >> >Description:
> >>    Following is the contents of the /etc/hostname.em0 file:
> >> 
> >>    inet 192.168.10.50 255.255.255.0 192.168.10.255
> >> 
> >>    The em0 interface works as expected. After some time it stops
> >>         working. Processes currently transmitting data will show a
> >>         (Broken pipe) error. Doing ifconfig em0 down && sh
> >/etc/netstart
> >>         the interface starts working again for some time and than
> >hangs
> >>         again.
> >> 
> >> >How-To-Repeat:
> >>    The issue is reproducible by simply using the em0 interface.
> >> >Fix:
> >>    ifconfig em0 down && sh /etc/netstart
> >> 
> >> [...]
> >>
> >
> >I wonder if this is the same thing that I mentioned here: 
> >     http://marc.info/?l=openbsd-misc&m=141007061612003&w=2
> >Though I don't get a "(Broken pipe)" error, em0 just hangs and won't
> >transmit
> >packets.
> >
> >In a nutshell, CPU activity causes em0 to hang on both my T60 and X61t
> >laptops.
> >
> >Provided I don't do CPU intensive activities, it can run fine for days,
> >
> >or potentially weeks at a time.  However, running bogofilter (which
> >I've stopped
> >using since it wedged the network reliably every time I'd fetch my
> >email), 
> >compiling a large port (smaller ones don't tend to wedge it, only
> >larger ones
> >that take at least several minutes to compile seem to) or just trying
> >to do 
> >something like fetch and upgrade packages, or just going to a CPU
> >intensive 
> >webpage like google maps will invariably cause the network to wedge.
> >
> >Neither machine will send or receive packets, including responding to
> >ping 
> >requests when this happens.
> >
> >If I need to update packages or compile something, I'll usually just
> >run a 
> >script like the following so I don't need to babysit the computer:
> >     while true; do sleep 25 && ifconfig em0 down up && sleep 1 && ifconfig
> >em0 down up && echo ping; done
> >
> >I was contacted privately by a developer and sent a few patches to the
> >UVM and 
> >softraid systems which caused them to KERNEL_LOCK() and KERNEL_UNLOCK()
> >in a 
> >few key places which has helped a lot, but the network will still lock
> >up 
> >reliably with heavy CPU activity.
> >
> >All I can say is that I first noticed this when I upgraded from a
> >mid-July 2014
> >snap to an early-September 2014 snap, which I know is a very large
> >window.  I've 
> >unfortunately been too busy to try to go back and figure out what
> >change caused 
> >this, and have been getting by with the above script when necessary.
> >
> >I've seen a lot of work done on the networking code and in UVM over the
> >last 
> >year or so, so I've been upgrading to newer snaps as they've been
> >released
> >hoping that they'd fix it, but it seems like the problem may lie
> >somewhere 
> >else or be obscure or specific to my setup.
> >
> >I've just assumed that since no one else has reported this problem,
> >that there 
> >was something unique about my system or setup that was causing this,
> >which 
> >would tend to lower the severity of this bug since as far as I knew up
> >until 
> >now, I was the only one affected by it.
> >
> >I realize this is a rather poor bug report, but hopefully by at least
> >mentioning 
> >the fact that a few key KERNEL_LOCK()/KERNEL_UNLOCK() calls sprinkled
> >around 
> >some UVM and softraid code reduce the occurance of this bug, it at
> >least gives 
> >someone a somewhat smaller target to look at.
> >
> >I can provide more information if necessary.  A dmesg from the T60
> >(with the 
> >above mentioned patches applied) follows:
> 
> -- 
> Sent from a phone, please excuse the formatting.

Index: softraid_crypto.c
===================================================================
RCS file: /cvs/src/sys/dev/softraid_crypto.c,v
retrieving revision 1.117
diff -u -r1.117 softraid_crypto.c
--- softraid_crypto.c   14 Mar 2015 03:38:46 -0000      1.117
+++ softraid_crypto.c   3 May 2015 08:55:58 -0000
@@ -368,8 +368,10 @@
        case SR_CRYPTOM_AES_ECB_256:
                if (rijndael_set_key_enc_only(&ctx, key, 256) != 0)
                        goto out;
+               KERNEL_UNLOCK();
                for (i = 0; i < size; i += RIJNDAEL128_BLOCK_LEN)
                        rijndael_encrypt(&ctx, &p[i], &c[i]);
+               KERNEL_LOCK();
                rv = 0;
                break;
        default:
@@ -394,8 +396,10 @@
        case SR_CRYPTOM_AES_ECB_256:
                if (rijndael_set_key(&ctx, key, 256) != 0)
                        goto out;
+               KERNEL_UNLOCK();
                for (i = 0; i < size; i += RIJNDAEL128_BLOCK_LEN)
                        rijndael_decrypt(&ctx, &c[i], &p[i]);
+               KERNEL_LOCK();
                rv = 0;
                break;
        default:
Index: uvm_pdaemon.c
===================================================================
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.75
diff -u -r1.75 uvm_pdaemon.c
--- uvm_pdaemon.c       17 Dec 2014 19:42:15 -0000      1.75
+++ uvm_pdaemon.c       3 May 2015 08:56:24 -0000
@@ -198,6 +198,8 @@
        uvmpd_tune();
        uvm_unlock_pageq();
 
+       KERNEL_UNLOCK();
+
        for (;;) {
                long size;
                work_done = 0; /* No work done this iteration. */
@@ -218,6 +220,8 @@
 
                uvm_unlock_fpageq();
 
+               KERNEL_LOCK();
+
                /* now lock page queues and recompute inactive count */
                uvm_lock_pageq();
                if (npages != uvmexp.npages) {  /* check for new pages? */
@@ -273,6 +277,8 @@
                /* scan done. unlock page queues (only lock we are holding) */
                uvm_unlock_pageq();
 
+               KERNEL_UNLOCK();
+
                sched_pause();
        }
        /*NOTREACHED*/
@@ -290,6 +296,8 @@
 
        uvm.aiodoned_proc = curproc;
 
+       KERNEL_UNLOCK();
+
        for (;;) {
                /*
                 * Check for done aio structures. If we've got structures to
@@ -303,6 +311,8 @@
                TAILQ_INIT(&uvm.aio_done);
                mtx_leave(&uvm.aiodoned_lock);
 
+               KERNEL_LOCK();
+
                /* process each i/o that's done. */
                free = uvmexp.free;
                while (bp != NULL) {
@@ -321,6 +331,8 @@
                wakeup(free <= uvmexp.reserve_kernel ? &uvm.pagedaemon :
                    &uvmexp.free);
                uvm_unlock_fpageq();
+
+               KERNEL_UNLOCK();
        }
 }

Re: em0 interface on Lenovo T60 hangs starting as of OpenBSD 5.7

Reply via email to