On Mon, May 15, 2017 at 11:45 -0400, Dan Cross wrote:
> On Mon, May 15, 2017 at 11:28 AM, Mike Belopuhov <m...@belopuhov.com> wrote:
> 
> > On Mon, May 15, 2017 at 11:18 -0400, Dan Cross wrote:
> > > On Mon, May 15, 2017 at 11:01 AM, Mike Belopuhov <m...@belopuhov.com>
> > wrote:
> > > >
> > > > Thanks for reporting this, however there's not enough info to follow
> > > > up on this right now.  What is clear is that your provider is using
> > > > an ancient version of Xen that doesn't even support the callback
> > > > vector interrupt delivery (the emulated xspd0 device is delivering
> > > > all interrupts).  We have developed code for Xen 4.5+ platforms and
> > > > there was only some testing done by users on 3.x.  So, in a way, you
> > > > can consider Xen 3.x to not be officially supported at this point.
> > >
> > > That's unfortunate. Sadly, this is common across two different providers
> > > (Panix and rootbsd.net). The latter, I'm sure, would at least be
> > interested
> > > in coordinating with you guys to get a fix. I'll open a trouble ticket
> > with
> > > them.
> > >
> > > Having said that, I've got a few questions:
> > > >
> > > >  - Do you see other write failures as well?
> > >
> > > Yes. E.g, syslogd had a similar write failure before panic.
> >
> > Can you reproduce any of these write failures at will?
> >
> 
> I'm not sure what you mean. If I induce the load conditions, then the VM
> will panic fairly reliably.
>

I was wondering if you have seen any other write errors apart
from those that cause the panic.

> What happens when you just send a signal to dump the core?
> > You can test this by running "sleep 100", and then call
> > "pkill -ABRT -lf sleep".
> 
> 
> I'm not sure what this shows, but sure I can do that:
>

There are quite a number of different I/O codepaths in the
kernel and some are wonkier than the other.

> : jaan; /bin/sleep 100&
> [1] 20701
> : jaan; pkill -ABRT -lf sleep
> 20701 sleep
> : jaan;
> [1]  + abort (core dumped)  /bin/sleep 100
> : jaan; ls -l sleep.core
> -rw-------  1 cross  staff  4208416 May 15 15:42 sleep.core
> : jaan;
> 
> The panic-inducing condition seems to be that, for whatever reason, the
> kernel gets into a funny state where processes like init(8) die due to
> having part of their VM image corrupted; the kernel then panics because
> `init` dies.
> 
> >  - Do you have swap enabled? (pstat -s)
> > >
> > >
> > > Yes; a gig:
> > >
> > > : jaan; pstat -s
> > > Device      1K-blocks     Used    Avail Capacity  Priority
> > > /dev/sd0b     1048249        0  1048249     0%    0
> > > : jaan;
> > >
> >
> > Do you see swap being used under your load?
> 
> 
> I'm not sure. I can try and crash a machine again and see poke at a kernel
> var from ddb to see; anything in particular you want me to look at?
>

Indeed.  You can run a "show uvmexp" DDB command.

Please try running with the diff below.  It will log all polled
and bounced transfers as well as some additional info.



diff --git sys/dev/pv/xbf.c sys/dev/pv/xbf.c
index d5c44770acb..29e7615d0fc 100644
--- sys/dev/pv/xbf.c
+++ sys/dev/pv/xbf.c
@@ -36,11 +36,11 @@
 #include <scsi/scsi_all.h>
 #include <scsi/cd.h>
 #include <scsi/scsi_disk.h>
 #include <scsi/scsiconf.h>
 
-/* #define XBF_DEBUG */
+#define XBF_DEBUG
 
 #ifdef XBF_DEBUG
 #define DPRINTF(x...)          printf(x)
 #else
 #define DPRINTF(x...)
@@ -478,10 +478,11 @@ xbf_load_xs(struct scsi_xfer *xs, int desc)
                sge->sge_first = i > 0 ? 0 :
                    ((vaddr_t)xs->data & PAGE_MASK) >> XBF_SEC_SHIFT;
                sge->sge_last = sge->sge_first +
                    (map->dm_segs[i].ds_len >> XBF_SEC_SHIFT) - 1;
 
+               if (ISSET(xs->flags, SCSI_POLL))
                DPRINTF("%s:   seg %d/%d ref %lu len %lu first %u last %u\n",
                    sc->sc_dev.dv_xname, i + 1, map->dm_nsegs,
                    map->dm_segs[i].ds_addr, map->dm_segs[i].ds_len,
                    sge->sge_first, sge->sge_last);
 
@@ -640,10 +641,11 @@ xbf_submit_cmd(struct scsi_xfer *xs)
        xrd->xrd_req.req_op = operation;
        xrd->xrd_req.req_unit = (uint16_t)sc->sc_unit;
        xrd->xrd_req.req_sector = lba;
 
        if (operation == XBF_OP_READ || operation == XBF_OP_WRITE) {
+               if (ISSET(xs->flags, SCSI_POLL))
                DPRINTF("%s: desc %d %s%s lba %llu nsec %u len %d\n",
                    sc->sc_dev.dv_xname, desc, operation == XBF_OP_READ ?
                    "read" : "write", ISSET(xs->flags, SCSI_POLL) ? "-poll" :
                    "", lba, nblk, xs->datalen);
 
@@ -718,10 +720,11 @@ xbf_complete_cmd(struct scsi_xfer *xs, int desc)
            BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
        bus_dmamap_unload(sc->sc_dmat, map);
 
        sc->sc_xs[desc] = NULL;
 
+       if (ISSET(xs->flags, SCSI_POLL))
        DPRINTF("%s: completing desc %d(%llu) op %u with error %d\n",
            sc->sc_dev.dv_xname, desc, xrd->xrd_rsp.rsp_id,
            xrd->xrd_rsp.rsp_op, xrd->xrd_rsp.rsp_status);
 
        id = xrd->xrd_rsp.rsp_id;

Reply via email to