Re: Not-so stable if you take a CAM error....

Karl Denninger Mon, 11 Jul 2016 06:47:34 -0700

On 7/11/2016 08:32, Ian Lepore wrote:
> On Mon, 2016-07-11 at 06:30 -0500, Karl Denninger wrote:
>> On 7/11/2016 02:57, Ronald Klop wrote:
>>> On Mon, 11 Jul 2016 02:54:38 +0200, Karl Denninger
>>> <k...@denninger.net> wrote:
>>>
>>>> Got a (nasty) surprise this afternoon on my sandbox machine.
>>>>
>>>> I was updating some Raspberry Pi2 machines which involved taking
>>>> the sd
>>>> card out, sticking it in an adapter and plugging it into the
>>>> sandbox,
>>>> then mounting the partition and using rsync.
>>>>
>>>> Unfortunately one of the cards was, unknown to me, bad and
>>>> returned a
>>>> write error during the update.
>>>>
>>>> The machine panic'd immediately after the CAM write error popped
>>>> up.
>>>>
>>>> I was quite surprised by this, since (1) the SD card was (of
>>>> course)
>>>> mounted as a UFS filesystem; it shows up as a CAM device, (2) the
>>>> machine itself is running off a ZFS root on a normal host-adapter
>>>> and
>>>> thus there is no comingling of the buffer cache and (3) there
>>>> were no
>>>> images being run from (can't, wrong architecture!) nor any system
>>>> I/O
>>>> (e.g. pagefile) going to the SD card.
>>>>
>>>> I certainly understand that under some circumstances (maybe even
>>>> most
>>>> circumstances) taking a hard I/O error to a system device is
>>>> going to
>>>> hose you and a panic() is arguably "least astonishment" when the
>>>> price
>>>> of being wrong might be a corrupted system file or worse (e.g.
>>>> corrupted
>>>> paged-out RSS, etc.)  But I didn't expect a panic out a failed
>>>> write to
>>>> a device that is mounted and being used purely for data.
>>>>
>>>> I don't have a crash dump but can almost-certainly reproduce this
>>>> if
>>>> it's something that shouldn't happen and thus merits
>>>> investigation.
>>>>
>>> Hi,
>>>
>>> I understand you are surprised by this. I don't think it is the way
>>> it
>>> should work.
>>> Is there _any_ debugging information for people to use and try to
>>> help
>>> you? Like which FreeBSD version are you running? Which FreeBSD
>>> version
>>> was used to create the UFS fs? Does it use softupdates (SU) or also
>>> journaling (SU+J)?
>>> Maybe some output of dmesg? Or type of SD-card and reader. Other
>>> people might have similar problems with similar hardware.
>>>
>>> Regards,
>>> Ronald.
>>>
>> FreeBSD 11.0-BETA1 #0 r302489: Sat Jul  9 10:15:24 CDT 2016    
>> k...@newfs.denninger.net:/usr/obj/usr/src/sys/KSD-SMP
>>
>> and
>>
>> FreeBSD 11.0-BETA1 #0 r302526: Sun Jul 10 10:39:31 CDT 2016    
>> k...@newfs.denninger.net:/pics/CrossBuild/obj/arm.armv6/pics/CrossBui
>> ld/src/sys/RPI2
>>
>> Both blew up in the same way when stimulated with same I/O error.
>>
>> The filesystem in question does have softupdates enabled (the RPI
>> images
>> have it turned on by default) but no journaling.  It's not
>> card/reader
>> dependent no architecture dependent; when it occurred the first time
>> I
>> stuck the card and reader into one of my Pis and attempted to update
>> it
>> there (thinking that perhaps my sandbox machine's USB port was wonky)
>> and it blew up the Pi2 in the exact same way.
>>
>> This isn't (obviously, given both Intel-style and ARM machines being
>> involved) architecture dependent.
>>
>> It's been a good long while since I took an actual hard I/O error
>> that
>> was 'visible' at the OS level (I've had plenty of disks die on ZFS
>> over
>> last few years but no "double failures" on a mirror or similar, and I
>> on
>> my servers I haven't had a UFS-based system for a while.  This
>> definitely looks like some sort of regression in the code; I've run
>> FreeBSD for a hell of a long time and have had plenty of instances
>> where
>> disks have failed without having the machine go out from under me.
>>
> Unfortunately, this is "just the way it works".  A hard IO error while
> writing to a ufs filesystem with softupdates enabled will cause a
> panic, because the softupdates code doesn't handle that sort of
> failure, and the failure means that filesystem integrity is lost.  The
> code has no idea how important the data is to the functioning of the
> system, no basis on which to decide whether to panic or not.
>
> -- Ian
>


Here's the backtrace ... sounds like expected behavior, which is not-so
good all-in for a situation like this.  I guess the strategy is to turn
off softupdates before attempting such an update so as not to crash the
host machine if there's a problem with the card.

root@Dbms2:/var/crash # kgdb /boot/kernel/kernel vmcore.0
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
panic: initiate_write_inodeblock_ufs2: already started
cpuid = 14
KDB: stack backtrace:
#0 0xffffffff80b1f357 at kdb_backtrace+0x67
#1 0xffffffff80ad6ec2 at vpanic+0x182
#2 0xffffffff80ad6d33 at panic+0x43
#3 0xffffffff80dc16ad at softdep_disk_io_initiation+0x159d
#4 0xffffffff80de61eb at ffs_geom_strategy+0x13b
#5 0xffffffff80b872f7 at bufwrite+0x267
#6 0xffffffff80b8ac6a at vfs_bio_awrite+0x3ca
#7 0xffffffff80b96b77 at vop_stdfsync+0x277
#8 0xffffffff80983766 at devfs_fsync+0x26
#9 0xffffffff81101f7d at VOP_FSYNC_APV+0x8d
#10 0xffffffff80baf1ae at sched_sync+0x3be
#11 0xffffffff80a8dcb5 at fork_exit+0x85
#12 0xffffffff80f7f85e at fork_trampoline+0xe
Uptime: 27m9s


(kgdb) where
#0  doadump (textdump=<value optimized out>) at pcpu.h:221
#1  0xffffffff80ad6949 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80ad6efb in vpanic (fmt=<value optimized out>,
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80ad6d33 in panic (fmt=0x0)
    at /usr/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff80dc16ad in softdep_disk_io_initiation (bp=<value
optimized out>)
    at /usr/src/sys/ufs/ffs/ffs_softdep.c:10301
#5  0xffffffff80de61eb in ffs_geom_strategy (bo=<value optimized out>,
    bp=<value optimized out>) at buf.h:412
#6  0xffffffff80b872f7 in bufwrite (bp=0xfffffe02e8629b30) at buf.h:405
#7  0xffffffff80b8ac6a in vfs_bio_awrite (bp=<value optimized out>)
    at buf.h:393
#8  0xffffffff80b96b77 in vop_stdfsync (ap=0xfffffe034f481b68)
    at /usr/src/sys/kern/vfs_default.c:692
#9  0xffffffff80983766 in devfs_fsync (ap=0xfffffe034f481b68)
    at /usr/src/sys/fs/devfs/devfs_vnops.c:702
#10 0xffffffff81101f7d in VOP_FSYNC_APV (vop=<value optimized out>,
    a=<value optimized out>) at vnode_if.c:1331
#11 0xffffffff80baf1ae in sched_sync () at vnode_if.h:549
#12 0xffffffff80a8dcb5 in fork_exit (callout=0xffffffff80baedf0
<sched_sync>,
    arg=0x0, frame=0xfffffe034f481c00) at /usr/src/sys/kern/kern_fork.c:1038
#13 0xffffffff80f7f85e in fork_trampoline ()
    at /usr/src/sys/amd64/amd64/exception.S:611
#14 0x0000000000000000 in ?? ()
(kgdb)

FreeBSD 11.0-BETA1 #0 r302439: Fri Jul  8 14:37:27 CDT 2016    
k...@dbms2.denninger.net:/usr/obj/usr/src/sys/GENERIC

The offending code line:

static void
initiate_write_inodeblock_ufs2(inodedep, bp)
        struct inodedep *inodedep;
        struct buf *bp;                 /* The inode block */
{
        struct allocdirect *adp, *lastadp;
        struct ufs2_dinode *dp;
        struct ufs2_dinode *sip;
        struct inoref *inoref;
        struct ufsmount *ump;
        struct fs *fs;
        ufs_lbn_t i;
#ifdef INVARIANTS
        ufs_lbn_t prevlbn = 0;
#endif
        int deplist;

*        if (inodedep->id_state & IOSTARTED)**
**                panic("initiate_write_inodeblock_ufs2: already started");*
        inodedep->id_state |= IOSTARTED;


-- 
Karl Denninger
k...@denninger.net <mailto:k...@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Not-so stable if you take a CAM error....

Reply via email to