Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-09-03 Thread Darren J Moffat
On 26/08/2010 15:42, David Magda wrote: Does a scrub go through the slog and/or L2ARC devices, or only the primary storage components? A scrub traverses datasets including the ZIL thus the scrub will read (and if needed resilver) on a slog device too.

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-27 Thread Bob Friesenhahn
On Thu, 26 Aug 2010, George Wilson wrote: David Magda wrote: On Wed, August 25, 2010 23:00, Neil Perrin wrote: Does a scrub go through the slog and/or L2ARC devices, or only the primary storage components? A scrub will go through slogs and primary storage devices. The L2ARC device is

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread StorageConcepts
Hello, actually this is bad news. I always assumed that the mirror redundancy of zil can also be used to handle bad blocks on the zil device (just as the main pool self healing does for data blocks). I actually dont know how SSD's die, because of the wear out characteristics I can think of

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey
From: Neil Perrin [mailto:neil.per...@oracle.com] Hmm, I need to check, but if we get a checksum mismatch then I don't think we try other mirror(s). This is automatic for the 'main pool', but of course the ZIL code is different by necessity. This problem can of course be fixed. (It will be 

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of StorageConcepts So would say there are 2 bugs / missing features in this: 1) zil needs to report truncated transactions on zilcorruption 2) zil should need mirrored counterpart to recover

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Eric Schrock
On Aug 26, 2010, at 9:14 AM, Edward Ned Harvey wrote: * After introduction of ldr, before this bug fix is available, it is pointless to mirror log devices. That's a bit of an overstatement. Mirrored logs protect against a wide variety of failure modes. Neil just isn't sure if it does the

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Eric Schrock
On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote: 1) zil needs to report truncated transactions on zilcorruption As Neil outlined, this isn't possible while preserving current ZIL performance. There is no way to distinguish the last ZIL block without incurring additional writes for every

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Markus Keil
Does that mean that when the begin of the intent log chain gets corrupted, all other intent log data after the corruption area is lost, because the checksum of the first corrupted block doesn't match?    Regards, Markus Neil Perrin neil.per...@oracle.com hat am 23. August 2010 um 19:44

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 If I might add my $0.02: it appears that the ZIL is implemented as a kind of circular log buffer. As I understand it, when a corrupt checksum is detected, it is taken to be the end of the log, but this kind of defeats the checksum's original purpose,

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread StorageConcepts
Actually - I can't read ZFS code, so the next assumtions are more or less based on brainware - excuse me in advance :) How does ZFS detect up to date zil's ? - with the tnx check of the ueberblock - right ? In our corruption case, we had 2 valid ueberblocks at the end and ZFS used those to

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Darren J Moffat
On 26/08/2010 15:08, Saso Kiselkov wrote: If I might add my $0.02: it appears that the ZIL is implemented as a kind of circular log buffer. As I understand it, when a corrupt checksum It is NOT circular since that implies limited number of entries that get overwritten. is detected, it is

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread David Magda
On Wed, August 25, 2010 23:00, Neil Perrin wrote: On 08/25/10 20:33, Edward Ned Harvey wrote: It's commonly stated, that even with log device removal supported, the most common failure mode for an SSD is to blindly write without reporting any errors, and only detect that the device is failed

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I see, thank you for the clarification. So it is possible to have something equivalent to main storage self-healing on ZIL, with ZIL-scrub to activate it. Or is that already implemented also? (Sorry for asking these obvious questions, but I'm not

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson
David Magda wrote: On Wed, August 25, 2010 23:00, Neil Perrin wrote: Does a scrub go through the slog and/or L2ARC devices, or only the primary storage components? A scrub will go through slogs and primary storage devices. The L2ARC device is considered volatile and data loss is not possible

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson
Edward Ned Harvey wrote: Add to that: During scrubs, perform some reads on log devices (even if there's nothing to read). We do read from log device if there is data stored on them. In fact, during scrubs, perform some reads on every device (even if it's actually empty.) Reading from the

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread Edward Ned Harvey
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread Neil Perrin
On 08/25/10 20:33, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together.

[zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread StorageConcepts
Hello, we are currently extensivly testing the DDRX1 drive for ZIL and we are going through all the corner cases. The headline above all our tests is do we still need to mirror ZIL with all current fixes in ZFS (zfs can recover zil failure, as long as you don't export the pool, with latest

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread Neil Perrin
This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can't read a block due to an IO error then

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread Neil Perrin
On 08/23/10 13:12, Markus Keil wrote: Does that mean that when the begin of the intent log chain gets corrupted, all other intent log data after the corruption area is lost, because the checksum of the first corrupted block doesn't match? - Yes, but you wouldn't want to replay the