On 17/10/2017 23:21, Matthew Ahrens wrote:
> I think this is because zei_histogram_set/cleared are arrays of uint16_t.  
> These
> are counts of (64-bit) words that have the bit set/cleared when it shouldn't 
> be
> (the index into the histogram is the bit number in the word).  See
> update_histogram() for details.
> 
> So I think this assertion is valid since it ensures the histogram bucket does
> not overflow (exceed UINT16_MAX).

Thank you for the detailed explanation.

> I see 2 options for fixing this:
> 
> 1. Allow the bucket to overflow "gracefully", e.g. capping it at UINT16_MAX
> rather than having it wrap around.

This is a super safe option, of course, as it completely isolates the change.


> 2. Change the zei_histogram_set/cleared to uint32_t's.  I'm not sure the 
> impact
> this would have to the fm_payload or consumers of it.

I tried to analyze the potential impact.
Both arrays currently have size of 128 bytes each (64 bit positions * 2 bytes).
They are posted as nvlist elements of type DATA_TYPE_UINT16_ARRAY.
So, changing the type to uint32_t would double array sizes to 256 bytes.
That value does not seem like it's crossing any threshold.  I think that the
nvlist and fm code should be able to deal with the larger size without any 
issues.
As to the consumers, I could not find any in-tree consumer that specifically
looks for those two fields, bad_set_histogram and bad_cleared_histogram.
Any code code that processes nvlists in a generic way should be able to deal
with the type change.  And any code, if any at all, that expects those elements
to be DATA_TYPE_UINT16_ARRAY should simply fail to extract the arrays.
So, at the very least, the change should not cause any silent misinterpretation
of the data.

> On Tue, Oct 17, 2017 at 6:11 AM, Andriy Gapon <[email protected]
> <mailto:[email protected]>> wrote:
> 
> 
>     Does anyone know a reason for this assertion in annotate_ecksum() ?
>             ASSERT3U(nui64s, <=, UINT16_MAX);
>     where nui64s = size / sizeof (uint64_t) with size being the size of the
>     corrupted block.  I looked at the code but could not find any obvious 
> reason for
>     the limit.
> 
>     I think that with large block sizes (512KB and greater) that we support 
> now the
>     assertion can be violated.
> 
>     In fact, I have actually run into this on a raid-z pool where some 
> datasets have
>     recordsize=1m:
> 
>     panic: solaris assert: nui64s <= 0xffff (0x10000 <= 0xffff), file:
>     /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_fm.c, line: 
> 555
> 
>     (kgdb) bt
>     #0  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:318
>     #1  0xffffffff80660c75 in kern_reboot (howto=260) at
>     /usr/src/sys/kern/kern_shutdown.c:386
>     #2  0xffffffff80661473 in vpanic (fmt=<optimized out>, 
> ap=0xfffffe051a4907a0) at
>     /usr/src/sys/kern/kern_shutdown.c:782
>     #3  0xffffffff806614d3 in panic (fmt=<unavailable>) at
>     /usr/src/sys/kern/kern_shutdown.c:715
>     #4  0xffffffff802e4bec in assfail3 (a=<unavailable>, lv=<unavailable>,
>     op=<unavailable>, rv=<unavailable>, f=<unavailable>, l=<optimized out>) at
>     /usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:91
>     #5  0xffffffff80396282 in annotate_ecksum (ereport=0xfffff800192692e0,
>     info=0xfffffe051a4908e0,
>         goodbuf=0xfffffe000b5ad000 ...,
>         badbuf=0xfffffe000d12d000 ..., size=524288, drop_if_identical=0)
>         at 
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_fm.c:555
>     #6  0xffffffff80396733 in zfs_ereport_post_checksum (spa=<optimized out>,
>     vd=<optimized out>, zio=<optimized out>, offset=<optimized out>, 
> length=524288,
>     good_data=0xfffffe000b5ad000, bad_data=<unavailable>, zbc=<unavailable>)
>         at 
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_fm.c:795
>     #7  0xffffffff80382f4b in raidz_checksum_error (zio=0xfffff80019265000,
>     rc=0xfffff8001b180660, bad_data=0xfffffe000d12d000) at
>     /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:2045
>     #8  0xffffffff803826aa in raidz_parity_verify (zio=<optimized out>,
>     rm=0xfffff8001b180600) at
>     /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:2104
>     #9  0xffffffff803812b1 in vdev_raidz_io_done (zio=0xfffff80019265000) at
>     /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:2402
>     #10 0xffffffff803c3828 in zio_vdev_io_done (zio=0xfffff80019265000) at
>     /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:3286
>     #11 0xffffffff803bf7c8 in zio_execute (zio=0xfffff80019265000) at
>     /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1710
>     #12 0xffffffff802e588e in taskq_run_ent (arg=<unavailable>,
>     pending=<unavailable>) at
>     /usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_taskq.c:155
>     #13 0xffffffff806ace21 in taskqueue_run_locked (queue=0xfffff80019e75e00) 
> at
>     /usr/src/sys/kern/subr_taskqueue.c:478
>     #14 0xffffffff806ada48 in taskqueue_thread_loop (arg=<optimized out>) at
>     /usr/src/sys/kern/subr_taskqueue.c:787


-- 
Andriy Gapon

------------------------------------------
openzfs-developer
Archives: 
https://openzfs.topicbox.com/groups/developer/discussions/T33b163a0291b77e1-M76d569c31ca94ee8d193971f
Powered by Topicbox: https://topicbox.com

Reply via email to