Hi Greg,
Thank you very much for your insight.
Let me give you a little bit more background on the issue.
We (ok, one of our customers) observed a (reproducible) crash on an attempt
to import a pool.
Upon analysis of the cores from several import attempts we have discovered
the following:
1. ZFS panics trying to import the pool:
> ::status
debugging crash dump vmcore.0 (64-bit) from ConfigD-Bot
operating system: 5.11 NexentaOS_134f (i86pc)
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff01e9d07780 addr=38 occurred in
module
"zfs" due to a NULL pointer dereference
> $C
ffffff01e9d07880 vdev_is_dead+0xc(0)
ffffff01e9d078a0 vdev_readable+0x16(0)
ffffff01e9d078e0 vdev_mirror_child_select+0x61(ffffff434f4cb010)
ffffff01e9d07920 vdev_mirror_io_done+0x1d7(ffffff434f4cb010)
ffffff01e9d07960 zio_vdev_io_done+0x13f(ffffff434f4cb010)
ffffff01e9d07990 zio_execute+0x8d(ffffff434f4cb010)
ffffff01e9d079f0 zio_notify_parent+0xa6(ffffff434f4cb010, ffffff437d80b1d0,
1)
ffffff01e9d07a60 zio_done+0x3d6(ffffff437d80b1d0)
ffffff01e9d07a90 zio_execute+0x8d(ffffff437d80b1d0)
ffffff01e9d07af0 zio_notify_parent+0xa6(ffffff437d80b1d0, ffffff437c429570,
1)
ffffff01e9d07b60 zio_done+0x3d6(ffffff437c429570)
ffffff01e9d07b90 zio_execute+0x8d(ffffff437c429570)
ffffff01e9d07c40 taskq_thread+0x285(ffffff43754a56f8)
ffffff01e9d07c50 thread_start+8()
> ffffff437c429570::print -ta zio_t io_spa
ffffff437c4295b8 spa_t *io_spa = 0xffffff4375de3500
> 0xffffff4375de3500::spa -v
ADDR STATE NAME
ffffff4375de3500 ACTIVE $import
ADDR STATE AUX DESCRIPTION
ffffff4313277000 HEALTHY - root
ffffff4348698800 HEALTHY - raidz
ffffff434869e000 HEALTHY -
/dev/dsk/c0t5000C50040DF9523d0s0
ffffff434869e800 HEALTHY -
/dev/dsk/c0t5000C50056D4065Fd0s0
ffffff43486a1000 HEALTHY -
/dev/dsk/c0t5000C50056CC41EBd0s0
ffffff43486a1800 HEALTHY -
/dev/dsk/c0t5000C50056CC7B43d0s0
ffffff43131ae000 HEALTHY -
/dev/dsk/c0t5000C50056D4113Fd0s0
ffffff434869b000 HEALTHY -
/dev/dsk/c0t5000C50056C8AB03d0s0
ffffff43486a2000 HEALTHY - raidz
ffffff43486a8800 HEALTHY -
/dev/dsk/c0t5000C50056CC2FABd0s0
ffffff434868e000 HEALTHY -
/dev/dsk/c0t5000C50056CC37CBd0s0
ffffff4312fef000 HEALTHY -
/dev/dsk/c0t5000C50056D42257d0s0
ffffff43486ac800 HEALTHY -
/dev/dsk/c0t5000C50056CC74CBd0s0
ffffff43139a6000 HEALTHY -
/dev/dsk/c0t5000C50056CA3F4Bd0s0
ffffff434867d800 HEALTHY -
/dev/dsk/c0t5000C50056D46907d0s0
ffffff4348680000 HEALTHY - raidz
ffffff434866e000 HEALTHY -
/dev/dsk/c0t5000C50056D42557d0s0
ffffff434866e800 HEALTHY -
/dev/dsk/c0t5000C50056CC129Bd0s0
ffffff434866f000 HEALTHY -
/dev/dsk/c0t5000C50056D47573d0s0
ffffff434866f800 HEALTHY -
/dev/dsk/c0t5000C50056D468E7d0s0
ffffff4348671000 HEALTHY -
/dev/dsk/c0t5000C50056CC591Fd0s0
ffffff4348671800 HEALTHY -
/dev/dsk/c0t5000C50056CC5953d0s0
ffffff4348676000 HEALTHY - raidz
ffffff4348676800 HEALTHY -
/dev/dsk/c0t5000C50056CA514Fd0s0
ffffff4348677000 HEALTHY -
/dev/dsk/c0t5000C50056CA3C87d0s0
ffffff4348677800 HEALTHY -
/dev/dsk/c0t5000C50056CC1D37d0s0
ffffff434867a000 HEALTHY -
/dev/dsk/c0t5000C50056D403EFd0s0
ffffff434867a800 HEALTHY -
/dev/dsk/c0t5000C50056C84E7Fd0s0
ffffff431375e800 HEALTHY -
/dev/dsk/c0t5000C50056CC303Bd0s0
2. Panic happens during attempt to read a root block:
> ffffff434f4cb010::print -ta zio_t io_bp
ffffff434f4cb060 blkptr_t *io_bp = 0xffffff434f4cb070
> 0xffffff434f4cb070::print -ta blkptr_t
ffffff434f4cb070 blkptr_t {
ffffff434f4cb070 dva_t [3] blk_dva = [
ffffff434f4cb070 dva_t {
ffffff434f4cb070 uint64_t [2] dva_word = [ 0xb00000001, 0xda ]
},
ffffff434f4cb080 dva_t {
ffffff434f4cb080 uint64_t [2] dva_word = [ 0x1, 0xd9 ]
},
ffffff434f4cb090 dva_t {
ffffff434f4cb090 uint64_t [2] dva_word = [ 0x100000001, 0xa7 ]
},
]
ffffff434f4cb0a0 uint64_t blk_prop = 0x800b070300000003
ffffff434f4cb0a8 uint64_t [2] blk_pad = [ 0, 0 ]
ffffff434f4cb0b8 uint64_t blk_phys_birth = 0
ffffff434f4cb0c0 uint64_t blk_birth = 0x1dc
ffffff434f4cb0c8 uint64_t blk_fill = 0x2d
ffffff434f4cb0d0 zio_cksum_t blk_cksum = {
ffffff434f4cb0d0 uint64_t [4] zc_word = [ 0x12a3a2c350,
0x66e904f03b5, 0x129109f48cb2e, 0x254df3b606d6c8 ]
}
}
> 0xffffff4375de3500::print -ta spa_t spa_ubsync
ffffff4375de3948 uberblock_t spa_ubsync = {
ffffff4375de3948 uint64_t spa_ubsync.ub_magic = 0xbab10c
ffffff4375de3950 uint64_t spa_ubsync.ub_version = 0x1c
ffffff4375de3958 uint64_t spa_ubsync.ub_txg = 0x1dc
ffffff4375de3960 uint64_t spa_ubsync.ub_guid_sum = 0xd8cae927954137fa
ffffff4375de3968 uint64_t spa_ubsync.ub_timestamp = 0x528da678
ffffff4375de3970 blkptr_t spa_ubsync.ub_rootbp = {
ffffff4375de3970 dva_t [3] blk_dva = [
ffffff4375de3970 dva_t {
ffffff4375de3970 uint64_t [2] dva_word = [ 0xb00000001,
0xda ]
},
ffffff4375de3980 dva_t {
ffffff4375de3980 uint64_t [2] dva_word = [ 0x1, 0xd9 ]
},
ffffff4375de3990 dva_t {
ffffff4375de3990 uint64_t [2] dva_word = [ 0x100000001,
0xa7 ]
},
]
ffffff4375de39a0 uint64_t blk_prop = 0x800b070300000003
ffffff4375de39a8 uint64_t [2] blk_pad = [ 0, 0 ]
ffffff4375de39b8 uint64_t blk_phys_birth = 0
ffffff4375de39c0 uint64_t blk_birth = 0x1dc
ffffff4375de39c8 uint64_t blk_fill = 0x2d
ffffff4375de39d0 zio_cksum_t blk_cksum = {
ffffff4375de39d0 uint64_t [4] zc_word = [ 0x12a3a2c350,
0x66e904f03b5, 0x129109f48cb2e,
0x254df3b606d6c8 ]
}
}
ffffff4375de39f0 uint64_t spa_ubsync.ub_software_version = 0x1c
}
3. Root block has three copies on three vdev's.
Looking at blk_dva[0], it is apparent that its dva is corrupt (it points at
vdev 0xb, which is 11, while the pool only has 4 top level vdev's. My
speculation here is that the original dva id was 3 and one bit got flipped
(0b0011 -> 0b1011).
4. An out of range dva causes vdev_lookup_top() to return NULL for the
first device in mirror map.
Subsequent attempt to read from a NULL device causes crash as apparent from
the stack (see above).
The fix that we implemented is pretty much trivial - we made
vdev_readable() and vdev_writeable() return FALSE if vdev is NULL.
This, however, only covers a subset of cases where vdev_lookup_top() may
return NULL due to out of range dva.
Hence I posted my original question about other cases in which the code
(including metaslab) does not check return value of vdev_lookup_top()
before dereferencing it.
Hope this provides enough background for the issue.
I think, I can upload one or two of the cores we have somewhere if they are
of any interest.
Thanks,
Ilya.
On Fri, Dec 27, 2013 at 9:37 AM, George Wilson <[email protected]>wrote:
> Ilya,
>
> I think there is a bigger problem here that your vdev_mirror() change
> would cover up. The problems is that we should never have a bp with a dva
> that is out of range. So we need to understand how that is occurring.
>
> As for you question about metaslab_allocate_dva(), I'm assuming you're
> referring to this code path:
>
> } else if (d != 0) {
> vd = vdev_lookup_top(spa, DVA_GET_VDEV(&dva[d - 1]));
> mg = vd->vdev_mg->mg_next;
>
> This vdev can't be NULL since the configuration should not have changed
> while we're in the middle of doing allocations.
>
> Even for gang blocks we should never have a NULL vdev. I'm assuming you're
> referring to this logic:
>
> if (vd != NULL) {
> mg = vd->vdev_mg;
>
> if (flags & METASLAB_HINTBP_AVOID &&
> mg->mg_next != NULL)
> mg = mg->mg_next;
> } else {
> mg = mc->mc_rotor;
> }
>
> This is meant to handle device logs which can be removed since the ZIL may
> have a reference to an old log block on a device that no longer exists.
>
> I think we need to find the root cause of where this out of range dva is
> coming from. If you have old core files that you could share please point
> me at them.
>
> Thanks,
> George
>
>
> On 12/18/13 1:33 PM, Ilya Usvyatsky wrote:
>
> I am looking at a rare but nasty case of corruption in which a block
> pointer has one of ditto dva's out of range.
> This causes vdev_lookup_top() to return NULL for a vdev pointer.
> Going through the callers of vdev_lookup_top(), I noticed that in some
> cases NULL vdev are handled properly, while in others it is not.
> In particular, in case of an import, ditto blocks are handled by a mirror
> vdev code path that does not have proper handling for NULL and would panic
> the system.
>
> My question, though, is not about the mirror (for which I have a fix that
> I will upstream eventually),but about metaslab allocator.
> In particular, I am looking at metaslab_allocate() and
> metaslab_allocate_dva().
> In the latter function NULL vdev is properly handled in the case of a
> gang block (since a hint may be stale and the device may be gone).
> However, that very same function does not handle NULL vdev in case of a
> regular block pointer with ditto blocks.
> The dilemma I am facing here is that I can either just use rotor (i.e..
> mimic gang block behavior), or return an error immediately.
> In the latter case, the caller, metaslab_allocate() would handle it
> properly.
> I am inclined to go with the second option, but I would greatly
> appreciate an insight on this from someone who is familiar with the
> internal logic and the theory behind metaslab allocator.
>
> Thank you very much,
> Ilya.
>
>
>
> _______________________________________________
> developer mailing
> [email protected]http://lists.open-zfs.org/mailman/listinfo/developer
>
>
>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer