Hi Greg,
Thank you very much for your insight.
Let me give you a little bit more background on the issue.

We (ok, one of our customers) observed a (reproducible) crash on an attempt
to import a pool.
Upon analysis of the cores from several import attempts we have discovered
the following:

1. ZFS panics trying to import the pool:

> ::status
debugging crash dump vmcore.0 (64-bit) from ConfigD-Bot
operating system: 5.11 NexentaOS_134f (i86pc)
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff01e9d07780 addr=38 occurred in
module
 "zfs" due to a NULL pointer dereference

> $C
ffffff01e9d07880 vdev_is_dead+0xc(0)
ffffff01e9d078a0 vdev_readable+0x16(0)
ffffff01e9d078e0 vdev_mirror_child_select+0x61(ffffff434f4cb010)
ffffff01e9d07920 vdev_mirror_io_done+0x1d7(ffffff434f4cb010)
ffffff01e9d07960 zio_vdev_io_done+0x13f(ffffff434f4cb010)
ffffff01e9d07990 zio_execute+0x8d(ffffff434f4cb010)
ffffff01e9d079f0 zio_notify_parent+0xa6(ffffff434f4cb010, ffffff437d80b1d0,
1)
ffffff01e9d07a60 zio_done+0x3d6(ffffff437d80b1d0)
ffffff01e9d07a90 zio_execute+0x8d(ffffff437d80b1d0)
ffffff01e9d07af0 zio_notify_parent+0xa6(ffffff437d80b1d0, ffffff437c429570,
1)
ffffff01e9d07b60 zio_done+0x3d6(ffffff437c429570)
ffffff01e9d07b90 zio_execute+0x8d(ffffff437c429570)
ffffff01e9d07c40 taskq_thread+0x285(ffffff43754a56f8)
ffffff01e9d07c50 thread_start+8()

> ffffff437c429570::print -ta zio_t io_spa
ffffff437c4295b8 spa_t *io_spa = 0xffffff4375de3500

> 0xffffff4375de3500::spa -v
ADDR                 STATE NAME
ffffff4375de3500    ACTIVE $import

    ADDR             STATE     AUX          DESCRIPTION
    ffffff4313277000 HEALTHY   -            root
    ffffff4348698800 HEALTHY   -              raidz
    ffffff434869e000 HEALTHY   -
/dev/dsk/c0t5000C50040DF9523d0s0
    ffffff434869e800 HEALTHY   -
/dev/dsk/c0t5000C50056D4065Fd0s0
    ffffff43486a1000 HEALTHY   -
/dev/dsk/c0t5000C50056CC41EBd0s0
    ffffff43486a1800 HEALTHY   -
/dev/dsk/c0t5000C50056CC7B43d0s0
    ffffff43131ae000 HEALTHY   -
/dev/dsk/c0t5000C50056D4113Fd0s0
    ffffff434869b000 HEALTHY   -
/dev/dsk/c0t5000C50056C8AB03d0s0
    ffffff43486a2000 HEALTHY   -              raidz
    ffffff43486a8800 HEALTHY   -
/dev/dsk/c0t5000C50056CC2FABd0s0
    ffffff434868e000 HEALTHY   -
/dev/dsk/c0t5000C50056CC37CBd0s0
    ffffff4312fef000 HEALTHY   -
/dev/dsk/c0t5000C50056D42257d0s0
    ffffff43486ac800 HEALTHY   -
/dev/dsk/c0t5000C50056CC74CBd0s0
    ffffff43139a6000 HEALTHY   -
/dev/dsk/c0t5000C50056CA3F4Bd0s0
    ffffff434867d800 HEALTHY   -
/dev/dsk/c0t5000C50056D46907d0s0
    ffffff4348680000 HEALTHY   -              raidz
    ffffff434866e000 HEALTHY   -
/dev/dsk/c0t5000C50056D42557d0s0
    ffffff434866e800 HEALTHY   -
/dev/dsk/c0t5000C50056CC129Bd0s0
    ffffff434866f000 HEALTHY   -
/dev/dsk/c0t5000C50056D47573d0s0
    ffffff434866f800 HEALTHY   -
/dev/dsk/c0t5000C50056D468E7d0s0
    ffffff4348671000 HEALTHY   -
/dev/dsk/c0t5000C50056CC591Fd0s0
    ffffff4348671800 HEALTHY   -
/dev/dsk/c0t5000C50056CC5953d0s0
    ffffff4348676000 HEALTHY   -              raidz
    ffffff4348676800 HEALTHY   -
/dev/dsk/c0t5000C50056CA514Fd0s0
    ffffff4348677000 HEALTHY   -
/dev/dsk/c0t5000C50056CA3C87d0s0
    ffffff4348677800 HEALTHY   -
/dev/dsk/c0t5000C50056CC1D37d0s0
    ffffff434867a000 HEALTHY   -
/dev/dsk/c0t5000C50056D403EFd0s0
    ffffff434867a800 HEALTHY   -
/dev/dsk/c0t5000C50056C84E7Fd0s0
    ffffff431375e800 HEALTHY   -
/dev/dsk/c0t5000C50056CC303Bd0s0

2. Panic happens during attempt to read a root block:

> ffffff434f4cb010::print -ta zio_t io_bp
ffffff434f4cb060 blkptr_t *io_bp = 0xffffff434f4cb070

> 0xffffff434f4cb070::print -ta blkptr_t
ffffff434f4cb070 blkptr_t {
    ffffff434f4cb070 dva_t [3] blk_dva = [
        ffffff434f4cb070 dva_t {
            ffffff434f4cb070 uint64_t [2] dva_word = [ 0xb00000001, 0xda ]
        },
        ffffff434f4cb080 dva_t {
            ffffff434f4cb080 uint64_t [2] dva_word = [ 0x1, 0xd9 ]
        },
        ffffff434f4cb090 dva_t {
            ffffff434f4cb090 uint64_t [2] dva_word = [ 0x100000001, 0xa7 ]
        },
    ]
    ffffff434f4cb0a0 uint64_t blk_prop = 0x800b070300000003
    ffffff434f4cb0a8 uint64_t [2] blk_pad = [ 0, 0 ]
    ffffff434f4cb0b8 uint64_t blk_phys_birth = 0
    ffffff434f4cb0c0 uint64_t blk_birth = 0x1dc
    ffffff434f4cb0c8 uint64_t blk_fill = 0x2d
    ffffff434f4cb0d0 zio_cksum_t blk_cksum = {
        ffffff434f4cb0d0 uint64_t [4] zc_word = [ 0x12a3a2c350,
0x66e904f03b5, 0x129109f48cb2e, 0x254df3b606d6c8 ]
    }
}

> 0xffffff4375de3500::print -ta spa_t spa_ubsync
ffffff4375de3948 uberblock_t spa_ubsync = {
    ffffff4375de3948 uint64_t spa_ubsync.ub_magic = 0xbab10c
    ffffff4375de3950 uint64_t spa_ubsync.ub_version = 0x1c
    ffffff4375de3958 uint64_t spa_ubsync.ub_txg = 0x1dc
    ffffff4375de3960 uint64_t spa_ubsync.ub_guid_sum = 0xd8cae927954137fa
    ffffff4375de3968 uint64_t spa_ubsync.ub_timestamp = 0x528da678
    ffffff4375de3970 blkptr_t spa_ubsync.ub_rootbp = {
        ffffff4375de3970 dva_t [3] blk_dva = [
            ffffff4375de3970 dva_t {
                ffffff4375de3970 uint64_t [2] dva_word = [ 0xb00000001,
0xda ]
            },
            ffffff4375de3980 dva_t {
                ffffff4375de3980 uint64_t [2] dva_word = [ 0x1, 0xd9 ]
            },
            ffffff4375de3990 dva_t {
                ffffff4375de3990 uint64_t [2] dva_word = [ 0x100000001,
0xa7 ]
            },
        ]
        ffffff4375de39a0 uint64_t blk_prop = 0x800b070300000003
        ffffff4375de39a8 uint64_t [2] blk_pad = [ 0, 0 ]
        ffffff4375de39b8 uint64_t blk_phys_birth = 0
        ffffff4375de39c0 uint64_t blk_birth = 0x1dc
        ffffff4375de39c8 uint64_t blk_fill = 0x2d
        ffffff4375de39d0 zio_cksum_t blk_cksum = {
            ffffff4375de39d0 uint64_t [4] zc_word = [ 0x12a3a2c350,
0x66e904f03b5, 0x129109f48cb2e,
0x254df3b606d6c8 ]
        }
    }
    ffffff4375de39f0 uint64_t spa_ubsync.ub_software_version = 0x1c
}

3. Root block has three copies on three vdev's.
Looking at blk_dva[0], it is apparent that its dva is corrupt (it points at
vdev 0xb, which is 11, while the pool only has 4 top level vdev's. My
speculation here is that the original dva id was 3 and one bit got flipped
(0b0011 -> 0b1011).

4. An out of range dva causes vdev_lookup_top() to return NULL for the
first device in mirror map.
Subsequent attempt to read from a NULL device causes crash as apparent from
the stack (see above).

The fix that we implemented is pretty much trivial - we made
vdev_readable() and vdev_writeable() return FALSE if vdev is NULL.

This, however, only covers a subset of cases where vdev_lookup_top() may
return NULL due to out of range dva.
Hence I posted my original question about other cases in which the code
(including metaslab) does not check return value of vdev_lookup_top()
before dereferencing it.

Hope this provides enough background for the issue.
I think, I can upload one or two of the cores we have somewhere if they are
of any interest.

Thanks,
Ilya.



On Fri, Dec 27, 2013 at 9:37 AM, George Wilson <[email protected]>wrote:

>  Ilya,
>
> I think there is a bigger problem here that your vdev_mirror() change
> would cover up. The problems is that we should never have a bp with a dva
> that is out of range. So we need to understand how that is occurring.
>
> As for you question about metaslab_allocate_dva(), I'm assuming you're
> referring to this code path:
>
>         } else if (d != 0) {
>                 vd = vdev_lookup_top(spa, DVA_GET_VDEV(&dva[d - 1]));
>                 mg = vd->vdev_mg->mg_next;
>
> This vdev can't be NULL since the configuration should not have changed
> while we're in the middle of doing allocations.
>
> Even for gang blocks we should never have a NULL vdev. I'm assuming you're
> referring to this logic:
>
>                 if (vd != NULL) {
>                         mg = vd->vdev_mg;
>
>                         if (flags & METASLAB_HINTBP_AVOID &&
>                             mg->mg_next != NULL)
>                                 mg = mg->mg_next;
>                 } else {
>                         mg = mc->mc_rotor;
>                 }
>
> This is meant to handle device logs which can be removed since the ZIL may
> have a reference to an old log block on a device that no longer exists.
>
> I think we need to find the root cause of where this out of range dva is
> coming from. If you have old core files that you could share please point
> me at them.
>
> Thanks,
> George
>
>
> On 12/18/13 1:33 PM, Ilya Usvyatsky wrote:
>
>      I am looking at a rare but nasty case of corruption in which a block
> pointer has one of ditto dva's out of range.
>  This causes vdev_lookup_top() to return NULL for a vdev pointer.
>  Going through the callers of vdev_lookup_top(), I noticed that in some
> cases NULL vdev are handled properly, while in others it is not.
>  In particular, in case of an import, ditto blocks are handled by a mirror
> vdev code path that does not have proper handling for NULL and would panic
> the system.
>
>  My question, though, is not about the mirror (for which I have a fix that
> I will upstream eventually),but about metaslab allocator.
>  In particular, I am looking at metaslab_allocate() and
> metaslab_allocate_dva().
>  In the latter function NULL vdev is properly handled in the case of a
> gang block (since a hint may be stale and the device may be gone).
>  However, that very same function does not handle NULL vdev in case of a
> regular block pointer with ditto blocks.
>  The dilemma I am facing here is that I can either just use rotor (i.e..
> mimic gang block behavior), or return an error immediately.
>  In the latter case, the caller, metaslab_allocate() would handle it
> properly.
>  I am inclined to go with the second option, but I would greatly
> appreciate an insight on this from someone who is familiar with the
> internal logic and the theory behind metaslab allocator.
>
>  Thank you very much,
>  Ilya.
>
>
>
> _______________________________________________
> developer mailing 
> [email protected]http://lists.open-zfs.org/mailman/listinfo/developer
>
>
>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to