…block with checksum error

Reviewed by: George Wilson <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>

This assertion (VERIFY) failure was reported when reading a block. Turns out 
the problem is that if
we get an i/o error (ECKSUM in this case), and there are multiple concurrent 
ARC reads of the same
block (from different clones), then the ARC will put multiple buf's on the same 
ANON hdr, which
isn't supposed to happen, and then causes a panic when we try to 
arc_buf_destroy() the buf.

```
Jan 12 18:39:29 TFGELSVMLXDEL01 ^Mpanic[cpu12]/thread=fffffe4229738c40:
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 603766 kern.notice] assertion 
failed: remove_reference(hdr, 0L, tag) 0 (0x2 0x0), file: 
../../common/fs/zfs/arc.c, line: 3125
Jan 12 18:39:29 TFGELSVMLXDEL01 unix: [ID 100000 kern.notice]
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 655072 kern.notice] 
fffffe42297388b0 genunix:strlog+0 ()
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 655072 kern.notice] 
fffffe42297388f0 zfs:arc_buf_destroy+d7 ()
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 655072 kern.notice] 
fffffe4229738940 zfs:dbuf_read_done+93 ()
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 655072 kern.notice] 
fffffe42297389a0 zfs:arc_read_done+198 ()
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 655072 kern.notice] 
fffffe4229738a40 zfs:zio_done+3f2 ()
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 655072 kern.notice] 
fffffe4229738a70 zfs:zio_execute+7f ()
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 655072 kern.notice] 
fffffe4229738b30 genunix:taskq_thread+2d0 ()
Jan 12 18:39:29 TFGELSVMLXDEL01 genunix: [ID 655072 kern.notice] 
fffffe4229738b40 unix:thread_start+8 ()

fffffe422a5b2c40::findstack -v

stack pointer for thread fffffe422a5b2c40: fffffe422a5b24f0
[ fffffe422a5b24f0 _resume_from_idle+0xf4() ]
fffffe422a5b2520 swtch+0x141()
fffffe422a5b2560 cv_wait+0x70(fffffea8a7bc13e0, fffffea8a7bc13d8)
fffffe422a5b25a0 zio_wait+0x6b(fffffea8a7bc1068)
fffffe422a5b2650 dmu_buf_hold_array_by_dnode+0x163(fffffeaac781a780, 
853d5dc0000, 2000, 1, fffffffff7ac7650, fffffe422a5b269c, fffffe422a5b2690, 0)
fffffe422a5b26f0 dmu_read_uio_dnode+0x5a(fffffeaac781a780, fffffe422a5b2940, 
2000)
fffffe422a5b2750 dmu_read_uio+0x5b(fffffea4cd1f0ac0, 1, fffffe422a5b2940, 2000)
fffffe422a5b27c0 zvol_read+0x10a(1130000016e, fffffe422a5b2940, 
fffffea323cacdb0)
fffffe422a5b27f0 cdev_read+0x2d(1130000016e, fffffe422a5b2940, fffffea323cacdb0)
fffffe422a5b2890 spec_read+0x2b9(fffffea4d53a6580, fffffe422a5b2940, 0, 
fffffea323cacdb0, 0)
fffffe422a5b2910 fop_read+0x5b(fffffea4d53a6580, fffffe422a5b2940, 0, 
fffffea323cacdb0, 0)
fffffe422a5b29f0 vn_rdwr+0x10a(0, fffffea4d53a6580, fffffea4cccb6000, 2000, 
853d5dc0000, 1, 0, fffffffffffffffd, fffffea323cacdb0, fffffe422a5b2a48)
fffffe422a5b2ab0 sbd_data_read+0x153(fffffeab1ea9d998, fffffea3ca9fc800, 
853d5dc0000, 2000, fffffea4cccb6000)
fffffe422a5b2b20 sbd_handle_read+0x1ce(fffffea3ca9fc800, 0)
fffffe422a5b2b80 sbd_new_task+0x60b(fffffea3ca9fc800, 0)
fffffe422a5b2c20 stmf_worker_task+0x2fd(fffffea43f5d80f0)
fffffe422a5b2c30 thread_start+8()

fffffe422ce63c40::findstack -v

stack pointer for thread fffffe422ce63c40: fffffe422ce634f0
[ fffffe422ce634f0 _resume_from_idle+0xf4() ]
fffffe422ce63520 swtch+0x141()
fffffe422ce63560 cv_wait+0x70(fffffea8ad269f00, fffffea8ad269ef8)
fffffe422ce635a0 zio_wait+0x6b(fffffea8ad269b88)
fffffe422ce63650 dmu_buf_hold_array_by_dnode+0x163(fffffeaac666ac50, 
853d5dc0000, 2000, 1, fffffffff7ac7650, fffffe422ce6369c, fffffe422ce63690, 0)
fffffe422ce636f0 dmu_read_uio_dnode+0x5a(fffffeaac666ac50, fffffe422ce63940, 
2000)
fffffe422ce63750 dmu_read_uio+0x5b(fffffea4cd07ae00, 1, fffffe422ce63940, 2000)
fffffe422ce637c0 zvol_read+0x10a(11300000162, fffffe422ce63940, 
fffffea323cacdb0)
fffffe422ce637f0 cdev_read+0x2d(11300000162, fffffe422ce63940, fffffea323cacdb0)
fffffe422ce63890 spec_read+0x2b9(fffffea3616f9e80, fffffe422ce63940, 0, 
fffffea323cacdb0, 0)
fffffe422ce63910 fop_read+0x5b(fffffea3616f9e80, fffffe422ce63940, 0, 
fffffea323cacdb0, 0)
fffffe422ce639f0 vn_rdwr+0x10a(0, fffffea3616f9e80, fffffea4c9c9d000, 2000, 
853d5dc0000, 1, 0, fffffffffffffffd, fffffea323cacdb0, fffffe422ce63a48)
fffffe422ce63ab0 sbd_data_read+0x153(fffffea4c1af5998, fffffea52401f000, 
853d5dc0000, 2000, fffffea4c9c9d000)
fffffe422ce63b20 sbd_handle_read+0x1ce(fffffea52401f000, 0)
fffffe422ce63b80 sbd_new_task+0x60b(fffffea52401f000, 0)
fffffe422ce63c20 stmf_worker_task+0x2fd(fffffea43f5d8230)
fffffe422ce63c30 thread_start+8()
```

Note that both threads are reading the same offset (0x853d5dc0000) of different 
dnodes (both object #1 of different ZVOLs).

The typical code path is that the 2nd arc_read() of this BP will go through:
```
if (hdr != NULL && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_pabd != NULL) {
...
        if (HDR_IO_IN_PROGRESS(hdr)) {
...
            if (done) {
... add new acb to existing hdr
                hdr->b_l1hdr.b_acb = acb;
and then when the i/o completes, arc_read_done() will do:
for (acb = callback_list; acb != NULL; acb = acb->acb_next) {
...
        int error = arc_buf_alloc_impl(hdr, acb->acb_private,
            acb->acb_compressed, no_zio_error, &acb->acb_buf);
...
    if (no_zio_error) {
...
    } else {
...
            arc_change_state(arc_anon, hdr, hash_lock);
...
    while ((acb = callback_list) != NULL) {
        if (acb->acb_done)
            acb->acb_done(zio, acb->acb_buf, acb->acb_private);
The acb_done() is dbuf_read_done() which calls arc_buf_destroy() if there was 
an i/o error.
There is also a similar code path in arc_read() that should also be able to hit 
this, but is a
narrower race. In arc_read_done(), we do

arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
...
    cv_broadcast(&hdr->b_l1hdr.b_cv);
if (hash_lock != NULL) {
        mutex_exit(hash_lock);
So a concurrent
arc_read() can see the hdr as not IO_IN_PROGRESS, but the hdr is still ANON. 
arc_read() can do:
if (hdr != NULL && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_pabd != NULL) {
...
        if (HDR_IO_IN_PROGRESS(hdr)) { - FALSE
...
        if (done) {
...
            VERIFY0(arc_buf_alloc_impl(hdr, private,
                compressed_read, B_TRUE, &buf));
```
And then arc_read_done() calls acb_done() which is dbuf_read_done() which calls 
arc_buf_destroy() if there was an i/o error - but now there are 2 bufs and it 
would panic.

Possible ways to address this:

1. pass another flag to `arc_buf_alloc_impl` to indicate if there’s an error or 
not

2. have arc_read / arc_read_done allocate a whole new anon hdr if there was an 
error

3. change the interface to not pass a buf to the callback (to the DMU) when 
there’s an i/o error

Upstream bug: DLPX-56684 assertion failed in arc_buf_destroy() when 
concurrently reading block with checksum error
You can view, comment on, or merge this pull request online at:

  https://github.com/openzfs/openzfs/pull/601

-- Commit Summary --

  * 9403 assertion failed in arc_buf_destroy() when concurrently reading block 
with checksum error

-- File Changes --

    M usr/src/uts/common/fs/zfs/arc.c (53)
    M usr/src/uts/common/fs/zfs/dbuf.c (31)
    M usr/src/uts/common/fs/zfs/zio_compress.c (18)

-- Patch Links --

https://github.com/openzfs/openzfs/pull/601.patch
https://github.com/openzfs/openzfs/pull/601.diff

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/601

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/discussions/T68d373e44115fed6-Mf920aa054a7cf5c4da1cbeeb
Delivery options: https://openzfs.topicbox.com/groups

Reply via email to