That was based off the analysis of the crash dumps from Gernot Straßer,
however I was able to recreate this on my own test box yesterday
(finally).  I created a script (I can provide if anyone is interested) to
try to simulate the same workload he was doing (essentially periodic
snapshots on a source system batched and sent to a backup system).  I had
to make a few tweaks to account for only having a single system at the
moment to test with (I sent the snapshots to a file, cleared the ARC cache,
then applied the streams so that both send and recv wouldn’t be running at
the same time on the box).  After a few variations, the following steps
caused the panic (I’m working on trying to narrow this down to something
simpler):

Create a 40Gb source zvol
populate it (using /dev/urandom),
create destination volume using zfs send -w | zfs recv dest

then in a loop:
write some new data to a portion of the source zvol
snapshot
after 1-5 snapshots, do either (picked randomly)
zfs send -cI @last_sent_snap source@most_recent_snap > file
   or
zfs send -wI @last_sent_snap source@most_recent_snap > file
(If a raw send) zfs unload-key dest
zinject -a # Flush ARC to hopefully simulate things aging out on destination
        cat file | zfs recv source
(if a raw send) zfs load-key dest
delete sent snapshots on source and dest

I had tried just doing nothing but zfs send -cI, then nothing but zfs send
-DI, then nothing but zfs send -wI (starting with a new zvol for each
variant and leaving the destination key loaded the whole time for the
non-raw send variants), but that didn’t seem to trigger the panic.  It
seemed to take a combination of raw and non-raw incremental sends (and
possibly the key loading/unloading) to do it.

Also, the streams that’d cause a panic when applied on a the host don’t
report any errors when piped through zstreamdump.  I’ll continue to keep
digging, but suggestions/ideas are certainly welcome.




From: Matthew Ahrens <mahr...@delphix.com> <mahr...@delphix.com>
Reply: openzfs-developer <developer@lists.open-zfs.org>
<developer@lists.open-zfs.org>
Date: November 28, 2018 at 1:04:28 PM
To: openzfs-developer <developer@lists.open-zfs.org>
<developer@lists.open-zfs.org>, Jorgen Lundman <lund...@lundman.net>
<lund...@lundman.net>, Thomas Caputi <tcap...@datto.com> <tcap...@datto.com>
Subject:  Re: [developer] Panic when receiving from encrypted zpool

On Fri, Nov 16, 2018 at 12:11 PM Jason King <jason.k...@joyent.com> wrote:

> For a small amount of background, I’ve been trying to help nail down some
> of the reported problems from users trying out the zfs crypto PR on illumos
> distros.  I have one instance here I’ve been digging into, and have some
> questions I’m hoping those who’ve spent far more time in this code than I
> have (so far) might be able to shed some insight.
>
> Specifically, so far it appears there is something surrounding zfs
> send/recv that is causing the checksums to become corrupt.  I’m still
> digging into this further, but the short version is, multiple zfs scrubs on
> the source show no issues, but when sending an incremental stream from an
> encrypted zvol to another system is fairly easily cause a panic during
> reception:
>

That's awesome that you are able to reproduce a problem!


>
> panic[cpu1]/thread=fffffe000fe5cc20:
> assertion failed: (db = dbuf_hold(dn, blkid, FTAG)) != NULL, file:
> ../../common/fs/zfs/dmu.c, line: 1706
>
> fffffe000fe5caa0 genunix:process_type+16c14c ()
> fffffe000fe5cb10 zfs:dmu_assign_arcbuf_dnode+198 ()
> fffffe000fe5cb80 zfs:receive_write+140 ()
> fffffe000fe5cbc0 zfs:receive_process_record+123 ()
> fffffe000fe5cc00 zfs:receive_writer_thread+88 ()
> fffffe000fe5cc10 unix:thread_start+8 ()
>
> While not obvious from that panic message (it took a bit of dtrace and a
> few laps), that panic is because of the following scenario:
>
> dmu_assign_arcbuf_dnode
> dbuf_hold (returns NULL — as seen in the VERIFY violation)
> dbuf_hold_level (returns EIO)
> dbuf_hold_impl (returns EIO)
> dbuf_hold_impl (returns EIO)
> dbuf_hold_impl (returns EIO)
> dbuf_hold_impl (returns EIO)
> dbuf_findbp (returns EIO)
> dbuf_read (returns EIO)
> dbuf_read_verify_dnode_crypt (returns EIO)
> arc_untransform (returns EIO)
> arc_buf_fill (returns ECKSUM)
> arc_fill_hdr_crypt (returns ECKSUM)
>
> Looking at the ZoL code , it would appear it would also panic if it tried
> to receive the same stream (the call stack for ZoL would be slightly
> different since the dbuf_hold args are heap allocated there instead of on
> the stack, but same general sequence).   It seems like zfs recv should be
> more resilient here — the receive should just error out instead of leaving
> a corrupted zvol on the destination — zfs scrub on the _destination_ will
> show errors that it cannot fix, removing the last sent snapshot clears the
> errors (both systems have ECC ram, no errors there).  Is this already a
> known issue?  I didn’t see anything obvious that looked like this, but then
> I might not be looking in the correct places.
>

I'm not aware of this issue (though hopefully it turns out to be related to
the one that Garnot has been hitting).  I agree that the zfs recv should be
more resilient here, and there's an additional problem of why we are
getting this error at all, right?  arc_fill_hdr_crypt() is probably failing
because its MAC doesn't match, but I think that shouldn't happen unless
someone is intentionally trying to attack us with a malicious send stream.


>
> I’m still digging more into the actual corruption itself, but I wanted to
> at least mention the above issue in the meantime.  Of course if the zfs
> send/recv issue is ringing bells with anyone, I’d love to know that as
> well.   Interestingly enough, it’s always the same blkid in the panic.
>

--matt
*openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer / see
discussions <https://openzfs.topicbox.com/groups/developer> + participants
<https://openzfs.topicbox.com/groups/developer/members> + delivery options
<https://openzfs.topicbox.com/groups/developer/subscription> Permalink
<https://openzfs.topicbox.com/groups/developer/Ta7b88998bf99fe05-M6eb1dd5111e84360978115e8>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Ta7b88998bf99fe05-Mfee9bdfa6159efcf33d55d7e
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to