That was based off the analysis of the crash dumps from Gernot Straßer, however I was able to recreate this on my own test box yesterday (finally). I created a script (I can provide if anyone is interested) to try to simulate the same workload he was doing (essentially periodic snapshots on a source system batched and sent to a backup system). I had to make a few tweaks to account for only having a single system at the moment to test with (I sent the snapshots to a file, cleared the ARC cache, then applied the streams so that both send and recv wouldn’t be running at the same time on the box). After a few variations, the following steps caused the panic (I’m working on trying to narrow this down to something simpler):
Create a 40Gb source zvol populate it (using /dev/urandom), create destination volume using zfs send -w | zfs recv dest then in a loop: write some new data to a portion of the source zvol snapshot after 1-5 snapshots, do either (picked randomly) zfs send -cI @last_sent_snap source@most_recent_snap > file or zfs send -wI @last_sent_snap source@most_recent_snap > file (If a raw send) zfs unload-key dest zinject -a # Flush ARC to hopefully simulate things aging out on destination cat file | zfs recv source (if a raw send) zfs load-key dest delete sent snapshots on source and dest I had tried just doing nothing but zfs send -cI, then nothing but zfs send -DI, then nothing but zfs send -wI (starting with a new zvol for each variant and leaving the destination key loaded the whole time for the non-raw send variants), but that didn’t seem to trigger the panic. It seemed to take a combination of raw and non-raw incremental sends (and possibly the key loading/unloading) to do it. Also, the streams that’d cause a panic when applied on a the host don’t report any errors when piped through zstreamdump. I’ll continue to keep digging, but suggestions/ideas are certainly welcome. From: Matthew Ahrens <mahr...@delphix.com> <mahr...@delphix.com> Reply: openzfs-developer <developer@lists.open-zfs.org> <developer@lists.open-zfs.org> Date: November 28, 2018 at 1:04:28 PM To: openzfs-developer <developer@lists.open-zfs.org> <developer@lists.open-zfs.org>, Jorgen Lundman <lund...@lundman.net> <lund...@lundman.net>, Thomas Caputi <tcap...@datto.com> <tcap...@datto.com> Subject: Re: [developer] Panic when receiving from encrypted zpool On Fri, Nov 16, 2018 at 12:11 PM Jason King <jason.k...@joyent.com> wrote: > For a small amount of background, I’ve been trying to help nail down some > of the reported problems from users trying out the zfs crypto PR on illumos > distros. I have one instance here I’ve been digging into, and have some > questions I’m hoping those who’ve spent far more time in this code than I > have (so far) might be able to shed some insight. > > Specifically, so far it appears there is something surrounding zfs > send/recv that is causing the checksums to become corrupt. I’m still > digging into this further, but the short version is, multiple zfs scrubs on > the source show no issues, but when sending an incremental stream from an > encrypted zvol to another system is fairly easily cause a panic during > reception: > That's awesome that you are able to reproduce a problem! > > panic[cpu1]/thread=fffffe000fe5cc20: > assertion failed: (db = dbuf_hold(dn, blkid, FTAG)) != NULL, file: > ../../common/fs/zfs/dmu.c, line: 1706 > > fffffe000fe5caa0 genunix:process_type+16c14c () > fffffe000fe5cb10 zfs:dmu_assign_arcbuf_dnode+198 () > fffffe000fe5cb80 zfs:receive_write+140 () > fffffe000fe5cbc0 zfs:receive_process_record+123 () > fffffe000fe5cc00 zfs:receive_writer_thread+88 () > fffffe000fe5cc10 unix:thread_start+8 () > > While not obvious from that panic message (it took a bit of dtrace and a > few laps), that panic is because of the following scenario: > > dmu_assign_arcbuf_dnode > dbuf_hold (returns NULL — as seen in the VERIFY violation) > dbuf_hold_level (returns EIO) > dbuf_hold_impl (returns EIO) > dbuf_hold_impl (returns EIO) > dbuf_hold_impl (returns EIO) > dbuf_hold_impl (returns EIO) > dbuf_findbp (returns EIO) > dbuf_read (returns EIO) > dbuf_read_verify_dnode_crypt (returns EIO) > arc_untransform (returns EIO) > arc_buf_fill (returns ECKSUM) > arc_fill_hdr_crypt (returns ECKSUM) > > Looking at the ZoL code , it would appear it would also panic if it tried > to receive the same stream (the call stack for ZoL would be slightly > different since the dbuf_hold args are heap allocated there instead of on > the stack, but same general sequence). It seems like zfs recv should be > more resilient here — the receive should just error out instead of leaving > a corrupted zvol on the destination — zfs scrub on the _destination_ will > show errors that it cannot fix, removing the last sent snapshot clears the > errors (both systems have ECC ram, no errors there). Is this already a > known issue? I didn’t see anything obvious that looked like this, but then > I might not be looking in the correct places. > I'm not aware of this issue (though hopefully it turns out to be related to the one that Garnot has been hitting). I agree that the zfs recv should be more resilient here, and there's an additional problem of why we are getting this error at all, right? arc_fill_hdr_crypt() is probably failing because its MAC doesn't match, but I think that shouldn't happen unless someone is intentionally trying to attack us with a malicious send stream. > > I’m still digging more into the actual corruption itself, but I wanted to > at least mention the above issue in the meantime. Of course if the zfs > send/recv issue is ringing bells with anyone, I’d love to know that as > well. Interestingly enough, it’s always the same blkid in the panic. > --matt *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer / see discussions <https://openzfs.topicbox.com/groups/developer> + participants <https://openzfs.topicbox.com/groups/developer/members> + delivery options <https://openzfs.topicbox.com/groups/developer/subscription> Permalink <https://openzfs.topicbox.com/groups/developer/Ta7b88998bf99fe05-M6eb1dd5111e84360978115e8> ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/Ta7b88998bf99fe05-Mfee9bdfa6159efcf33d55d7e Delivery options: https://openzfs.topicbox.com/groups/developer/subscription