Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-09-03 Thread Darren J Moffat

On 26/08/2010 15:42, David Magda wrote:

Does a scrub go through the slog and/or L2ARC devices, or only the
primary storage components?


A scrub traverses datasets including the ZIL thus the scrub will read 
(and if needed resilver) on a slog device too.


http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu_traverse.c

A scrub does not traverse an L2ARC device because hold in memory 
checksums (in the ARC header) for everything on the cache devices if we 
get a checksum failure on read we remove the L2ARC cached entry and read 
from the main pool again.   The L2ARC cache devices are purely caches 
there is NEVER data on them that isn't already in the main pool devices.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-27 Thread Bob Friesenhahn

On Thu, 26 Aug 2010, George Wilson wrote:


David Magda wrote:

On Wed, August 25, 2010 23:00, Neil Perrin wrote:

Does a scrub go through the slog and/or L2ARC devices, or only the
primary storage components?


A scrub will go through slogs and primary storage devices. The L2ARC device 
is considered volatile and data loss is not possible should it fail.


What gets scrubbed in the slog?  The slog contains transient 
data which exists for only seconds at a time.  The slog is quite 
likely to be empty at any given point in time.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread StorageConcepts
Hello, 
actually this is bad news. 

I always assumed that the mirror redundancy of zil can also be used to handle 
bad blocks on the zil device (just as the main pool self healing does for data 
blocks).

I actually dont know how SSD's die, because of the wear out characteristics 
I can think of a increased number of bad blocks / bit errors at the EOL of such 
a device -  probably undiscovered.

Because ZIL is write only, you only know if it worked in case you need it - 
wich is bad. So my suggestion was always to run with 1 zil during 
pre-production, and add the zil mirror 2 weeks later when production starts. 
This way they dont't age exactly the same and zil2 has 2 more weeks of expected 
flifetime (or even more, assuming the usual heavier writes during stress 
testing). 

I would call this pre-aging. However if the second zil is not used to recover 
from bad blocks, this does not make a lot of sense.

So would say there are 2 bugs / missing features in this: 

1) zil needs to report truncated transactions on zilcorruption
2) zil should need mirrored counterpart to recover bad block checksums 

Now with OpenSolaris beeing Oracle closed and Illumos beeing just startet, I 
don't  know how to handle bug openenings :) - is bugs.opensolaris.org still 
maintained ???

Regards, 
Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey
 From: Neil Perrin [mailto:neil.per...@oracle.com]
 
 Hmm, I need to check, but if we get a checksum mismatch then I don't
 think we try other
 mirror(s). This is automatic for the 'main pool', but of course the ZIL
 code is different
 by necessity. This problem can of course be fixed. (It will be  a week
 and a bit before I can
 report back on this, as I'm on vacation).

Thanks...

If indeed that is the behavior, then I would conclude:  
* Call it a bug.  It needs a bug fix.
* Prior to log device removal (zpool 19) it is critical to mirror log
device.
* After introduction of ldr, before this bug fix is available, it is
pointless to mirror log devices.
* After this bug fix is introduced, it is again recommended to mirror slogs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of StorageConcepts
 
 So would say there are 2 bugs / missing features in this:
 
 1) zil needs to report truncated transactions on zilcorruption
 2) zil should need mirrored counterpart to recover bad block checksums

Add to that:

During scrubs, perform some reads on log devices (even if there's nothing to
read).
In fact, during scrubs, perform some reads on every device (even if it's
actually empty.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Eric Schrock

On Aug 26, 2010, at 9:14 AM, Edward Ned Harvey wrote:
 * After introduction of ldr, before this bug fix is available, it is
 pointless to mirror log devices.

That's a bit of an overstatement.  Mirrored logs protect against a wide variety 
of failure modes.  Neil just isn't sure if it does the right thing for checksum 
errors.  That is a very small subset of possible device failure modes.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Eric Schrock

On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote:
 
 1) zil needs to report truncated transactions on zilcorruption

As Neil outlined, this isn't possible while preserving current ZIL performance. 
 There is no way to distinguish the last ZIL block without incurring 
additional writes for every block.  If it's even possible to implement this 
paranoid ZIL tunable, are you willing to take a 2-5x performance hit to be 
able to detect this failure mode?

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Markus Keil
Does that mean that when the begin of the intent log chain gets corrupted, all
other intent log data after the corruption area is lost, because the checksum of
the first corrupted block doesn't match? 
 
Regards,
Markus

Neil Perrin neil.per...@oracle.com hat am 23. August 2010 um 19:44
geschrieben:

 This is a consequence of the design for performance of the ZIL code.
 Intent log blocks are dynamically allocated and chained together.
 When reading the intent log we read each block and checksum it
 with the embedded checksum within the same block. If we can't read
 a block due to an IO error then that is reported, but if the checksum does
 not match then we assume it's the end of the intent log chain.
 Using this design means we the minimum number of writes to add
 write an intent log record is just one write.

 So corruption of an intent log is not going to generate any errors.

 Neil.

 On 08/23/10 10:41, StorageConcepts wrote:
  Hello,
 
  we are currently extensivly testing the DDRX1 drive for ZIL and we are going
  through all the corner cases.
 
  The headline above all our tests is do we still need to mirror ZIL with
  all current fixes in ZFS (zfs can recover zil failure, as long as you don't
  export the pool, with latest upstream you can also import a poool with a
  missing zil)? This question  is especially interesting with RAM based
  devices, because they don't wear out, have a very low bit error rate and use
  one PCIx slot - which are rare. Price is another aspect here :)
 
  During our tests we found a strange behaviour of ZFS ZIL failures which are
  not device related and we are looking for help from the ZFS guru's here :)
 
  The test in question is called offline ZIL corruption. The question is,
  what happens if my ZIL data is corrupted while a server is transported or
  moved and not properly shut down. For this we do:
 
  - Prepare 2 OS installations (ProdudctOS and CorruptOS)
  - Boot ProductOS and create a pool and add the ZIL
  - ProductOS: Issue synchronous I/O with a increasing TNX number (and print
  the latest committet transaciton)
  - ProductOS: Power off the server and record the laast committet transaction
  - Boot CorruptOS
  - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL
   ~ 300 MB from start of disk, overwriting the first two disk labels)
  - Boot ProductOS
  - Verify that the data corruption is detected by checking the file with the
  transaction number against the one recorded
 
  We ran the test and it seems with modern snv_134 the pool comes up after the
  corruption with all beeing ok, while ~1 Transactions (this is some
  seconds of writes with DDRX1) are missing and nobody knows about this. We
  ran a scrub and scrub does not even detect this. ZFS automatically repairs
  the labels on the ZIL, however no error is reported about the missing data.
 
  While it is clear to us that if we do not have a mirrored zil, the data we
  have overwritten in the zil is lost, we are really wondering why ZFS does
  not REPORT about this corruption, silently ignoring it.
 
  Is this is a bug or .. aehm ... a feature  :) ?
 
  Regards,
  Robert
    


--
StorageConcepts Europe GmbH
    Storage: Beratung. Realisierung. Support     

Markus Keil            k...@storageconcepts.de
                       http://www.storageconcepts.de
Wiener Straße 114-116  Telefon:   +49 (351) 8 76 92-21
01219 Dresden          Telefax:   +49 (351) 8 76 92-99
Handelregister Dresden, HRB 28281
Geschäftsführer: Robert Heinzmann, Gerd Jelinek
--
Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind
vertraulich  und ausschließlich für den Gebrauch durch den Empfänger bestimmt,
soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt.
Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten
sein. Soweit eine Weitergabe oder Verteilung nicht ausschließlich zu internen
Zwecken des Empfängers geschieht, wird jede Weitergabe, Verteilung oder sonstige
Kopierung untersagt. Sollten Sie nicht  der beabsichtigte Empfänger der Sendung
sein, informieren Sie den Absender bitte unverzüglich.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

If I might add my $0.02: it appears that the ZIL is implemented as a
kind of circular log buffer. As I understand it, when a corrupt checksum
is detected, it is taken to be the end of the log, but this kind of
defeats the checksum's original purpose, which is to detect device
failure. Thus we would first need to change this behavior to only be
used for failure detection. This leaves the question of how to detect
the end of the log, which I think could be done by using a monotonously
incrementing counter on the ZIL entries. Once we find an entry where the
counter != n+1, then we know that the block is the last one in the sequence.

Now that we can use checksums to detect device failure, it would be
possible to implement ZIL-scrub, allowing an environment to detect ZIL
device degradation before it actually results in a catastrophe.

- --
Saso

On 08/26/2010 03:22 PM, Eric Schrock wrote:
 
 On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote:

 1) zil needs to report truncated transactions on zilcorruption
 
 As Neil outlined, this isn't possible while preserving current ZIL 
 performance.  There is no way to distinguish the last ZIL block without 
 incurring additional writes for every block.  If it's even possible to 
 implement this paranoid ZIL tunable, are you willing to take a 2-5x 
 performance hit to be able to detect this failure mode?
 
 - Eric
 
 --
 Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2dUwACgkQRO8UcfzpOHD6QgCfWRBvqYxwKOqrFeaMyQ3nZDVX
Pu0AoJJHPybVT3GqvQbJPL8Xa58aC5P1
=pQJU
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread StorageConcepts
Actually - I can't read ZFS code, so the next assumtions are more or less based 
on   brainware - excuse me in advance :)

How does ZFS detect up to date zil's ? - with the tnx check of the ueberblock 
- right ? 

In our corruption case, we had 2 valid ueberblocks at the end and ZFS used 
those to import the pool. this is what the end-ueberblock is for. Ok, so the 
ueberblock contains the pointer to the start of the zil chain - right ? 

Assume we are adding the tnx number of the current transaction this zil is part 
of to the blocks written to the zil (special packages zil blocks). So the zil 
blocks are a little bit bigger then the data blocks, however the transaction 
count is the the same. Ok for SSD block alignment might be an issue ... agreed. 
For memory DRAM based ZIL's this is not a problem - except for bandwith.

Logic: 

On ZIL import, check: 
  - If the pointer to the zil chain is empty
if yes - clean pool
if not - we need to replay 

  - now if the block the root pointer points to is ok (checksum), the zil is 
used and replayed. At the end, the tnxof the last zil block must be = pool tnx. 
If =, then OK, if not report a error about missing zil parts and switch to 
mirror (if available). 

 As Neil outlined, this isn't possible while
 preserving current ZIL performance.  There is no way
 to distinguish the last ZIL block without incurring
 additional writes for every block.  If it's even
 possible to implement this paranoid ZIL tunable,
 are you willing to take a 2-5x performance hit to be
 able to detect this failure mode?
 
Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Darren J Moffat

On 26/08/2010 15:08, Saso Kiselkov wrote:

If I might add my $0.02: it appears that the ZIL is implemented as a
kind of circular log buffer. As I understand it, when a corrupt checksum


It is NOT circular since that implies limited number of entries that get 
overwritten.



is detected, it is taken to be the end of the log, but this kind of
defeats the checksum's original purpose, which is to detect device
failure. Thus we would first need to change this behavior to only be
used for failure detection. This leaves the question of how to detect
the end of the log, which I think could be done by using a monotonously
incrementing counter on the ZIL entries. Once we find an entry where the
counter != n+1, then we know that the block is the last one in the sequence.


See the comment part way down zil_read_log_block about how we do 
something pretty much like that for checking the chain of log blocks:


http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block

This is the checksum in the BP checksum field.

But before we even got there we checked the ZILOG2 checksum as part of 
doing the zio (in zio_checksum_verify() stage):


http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error

A ZILOG2 checksum is a embedded  in the block (at the start, the 
original ZILOG was at the end) version of fletcher4.  If that failed - 
ie the block was corrupt we would have returned an error back through 
the dsl_read() of the log block.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread David Magda
On Wed, August 25, 2010 23:00, Neil Perrin wrote:
 On 08/25/10 20:33, Edward Ned Harvey wrote:

 It's commonly stated, that even with log device removal supported, the
 most common failure mode for an SSD is to blindly write without reporting
 any errors, and only detect that the device is failed upon read.  So ...
 If an SSD is in this failure mode, you won't detect it?  At bootup, the
 checksum will simply mismatch, and we'll chug along forward, having lost
 the data ... (nothing can prevent that) ... but we don't know that we've
 lost data?

 - Indeed, we wouldn't know we lost data.

Does a scrub go through the slog and/or L2ARC devices, or only the
primary storage components?

If it doesn't go through these secondary devices, that may be a useful
RFE, as one would ideally want to test the data on every component of a
storage system.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I see, thank you for the clarification. So it is possible to have
something equivalent to main storage self-healing on ZIL, with ZIL-scrub
to activate it. Or is that already implemented also? (Sorry for asking
these obvious questions, but I'm not familiar with ZFS source code.)

- --
Saso

On 08/26/2010 04:31 PM, Darren J Moffat wrote:
 On 26/08/2010 15:08, Saso Kiselkov wrote:
 If I might add my $0.02: it appears that the ZIL is implemented as a
 kind of circular log buffer. As I understand it, when a corrupt checksum
 
 It is NOT circular since that implies limited number of entries that get
 overwritten.
 
 is detected, it is taken to be the end of the log, but this kind of
 defeats the checksum's original purpose, which is to detect device
 failure. Thus we would first need to change this behavior to only be
 used for failure detection. This leaves the question of how to detect
 the end of the log, which I think could be done by using a monotonously
 incrementing counter on the ZIL entries. Once we find an entry where the
 counter != n+1, then we know that the block is the last one in the
 sequence.
 
 See the comment part way down zil_read_log_block about how we do
 something pretty much like that for checking the chain of log blocks:
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block
 
 
 This is the checksum in the BP checksum field.
 
 But before we even got there we checked the ZILOG2 checksum as part of
 doing the zio (in zio_checksum_verify() stage):
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error
 
 
 A ZILOG2 checksum is a embedded  in the block (at the start, the
 original ZILOG was at the end) version of fletcher4.  If that failed -
 ie the block was corrupt we would have returned an error back through
 the dsl_read() of the log block.
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2f64ACgkQRO8UcfzpOHA7rACgoyydAq2hO/VIfdknRb09WWGJ
BkwAn2i3nPtWNnfXwyW2089YMb8FRkZP
=YMqL
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson

Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Neil Perrin

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can't read
a block due to an IO error then that is reported, but if the checksum
does
not match then we assume it's the end of the intent log chain.
Using this design means we the minimum number of writes to add
write an intent log record is just one write.

So corruption of an intent log is not going to generate any errors.


I didn't know that.  Very interesting.  This raises another question ...

It's commonly stated, that even with log device removal supported, the most
common failure mode for an SSD is to blindly write without reporting any
errors, and only detect that the device is failed upon read.  So ... If an
SSD is in this failure mode, you won't detect it?  At bootup, the checksum
will simply mismatch, and we'll chug along forward, having lost the data ...
(nothing can prevent that) ... but we don't know that we've lost data?


If the drive's firmware isn't returning back a write error of any kind 
then there isn't much that ZFS can really do here (regardless of whether 
this is an SSD or not). Turning every write into a read/write operation 
would totally defeat the purpose of the ZIL. It's my understanding that 
SSDs will eventually transition to read-only devices once they've 
exceeded their spare reallocation blocks. This should propagate to the 
OS as an EIO which means that ZFS will instead store the ZIL data on the 
main storage pool.




Worse yet ... In preparation for the above SSD failure mode, it's commonly
recommended to still mirror your log device, even if you have log device
removal.  If you have a mirror, and the data on each half of the mirror
doesn't match each other (one device failed, and the other device is good)
... Do you read the data from *both* sides of the mirror, in order to
discover the corrupted log device, and correctly move forward without data
loss?


Yes, we read all sides of the mirror when we claim (i.e. read) the log 
blocks for a log device. This is exactly what a scrub would do for a 
mirrored data device.


- George



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson

David Magda wrote:

On Wed, August 25, 2010 23:00, Neil Perrin wrote:

Does a scrub go through the slog and/or L2ARC devices, or only the
primary storage components?


A scrub will go through slogs and primary storage devices. The L2ARC 
device is considered volatile and data loss is not possible should it fail.


- George
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson

Edward Ned Harvey wrote:


Add to that:

During scrubs, perform some reads on log devices (even if there's nothing to
read).


We do read from log device if there is data stored on them.

In fact, during scrubs, perform some reads on every device (even if it's
actually empty.)


Reading from the data portion of an empty device wouldn't really show us 
much as we're going to be reading a bunch of non-checksummed data. The 
best we can do is to probe the device's label region to determine it's 
health. This is exactly what we do today.


- George



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Neil Perrin
 
 This is a consequence of the design for performance of the ZIL code.
 Intent log blocks are dynamically allocated and chained together.
 When reading the intent log we read each block and checksum it
 with the embedded checksum within the same block. If we can't read
 a block due to an IO error then that is reported, but if the checksum
 does
 not match then we assume it's the end of the intent log chain.
 Using this design means we the minimum number of writes to add
 write an intent log record is just one write.
 
 So corruption of an intent log is not going to generate any errors.

I didn't know that.  Very interesting.  This raises another question ...

It's commonly stated, that even with log device removal supported, the most
common failure mode for an SSD is to blindly write without reporting any
errors, and only detect that the device is failed upon read.  So ... If an
SSD is in this failure mode, you won't detect it?  At bootup, the checksum
will simply mismatch, and we'll chug along forward, having lost the data ...
(nothing can prevent that) ... but we don't know that we've lost data?

Worse yet ... In preparation for the above SSD failure mode, it's commonly
recommended to still mirror your log device, even if you have log device
removal.  If you have a mirror, and the data on each half of the mirror
doesn't match each other (one device failed, and the other device is good)
... Do you read the data from *both* sides of the mirror, in order to
discover the corrupted log device, and correctly move forward without data
loss?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread Neil Perrin

On 08/25/10 20:33, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Neil Perrin

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can't read
a block due to an IO error then that is reported, but if the checksum
does
not match then we assume it's the end of the intent log chain.
Using this design means we use the minimum number of writes.

So corruption of an intent log is not going to generate any errors.



I didn't know that.  Very interesting.  This raises another question ...

It's commonly stated, that even with log device removal supported, the most
common failure mode for an SSD is to blindly write without reporting any
errors, and only detect that the device is failed upon read.  So ... If an
SSD is in this failure mode, you won't detect it?  At bootup, the checksum
will simply mismatch, and we'll chug along forward, having lost the data ...
(nothing can prevent that) ... but we don't know that we've lost data?
  


- Indeed, we wouldn't know we lost data.


Worse yet ... In preparation for the above SSD failure mode, it's commonly
recommended to still mirror your log device, even if you have log device
removal.  If you have a mirror, and the data on each half of the mirror
doesn't match each other (one device failed, and the other device is good)
... Do you read the data from *both* sides of the mirror, in order to
discover the corrupted log device, and correctly move forward without data
loss?

  


Hmm, I need to check, but if we get a checksum mismatch then I don't 
think we try other
mirror(s). This is automatic for the 'main pool', but of course the ZIL 
code is different
by necessity. This problem can of course be fixed. (It will be  a week 
and a bit before I can

report back on this, as I'm on vacation).

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread Neil Perrin

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can't read
a block due to an IO error then that is reported, but if the checksum does
not match then we assume it's the end of the intent log chain.
Using this design means we the minimum number of writes to add
write an intent log record is just one write.

So corruption of an intent log is not going to generate any errors.

Neil.

On 08/23/10 10:41, StorageConcepts wrote:
Hello, 

we are currently extensivly testing the DDRX1 drive for ZIL and we are going through all the corner cases. 


The headline above all our tests is do we still need to mirror ZIL with all 
current fixes in ZFS (zfs can recover zil failure, as long as you don't export the pool, 
with latest upstream you can also import a poool with a missing zil)? This question  is 
especially interesting with RAM based devices, because they don't wear out, have a very 
low bit error rate and use one PCIx slot - which are rare. Price is another aspect here :)

During our tests we found a strange behaviour of ZFS ZIL failures which are not device related and we are looking for help from the ZFS guru's here :) 

The test in question is called offline ZIL corruption. The question is, what happens if my ZIL data is corrupted while a server is transported or moved and not properly shut down. For this we do: 


- Prepare 2 OS installations (ProdudctOS and CorruptOS)
- Boot ProductOS and create a pool and add the ZIL 
- ProductOS: Issue synchronous I/O with a increasing TNX number (and print the latest committet transaciton)

- ProductOS: Power off the server and record the laast committet transaction
- Boot CorruptOS
- Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL  
~ 300 MB from start of disk, overwriting the first two disk labels)
- Boot ProductOS
- Verify that the data corruption is detected by checking the file with the 
transaction number against the one recorded

We ran the test and it seems with modern snv_134 the pool comes up after the 
corruption with all beeing ok, while ~1 Transactions (this is some seconds 
of writes with DDRX1) are missing and nobody knows about this. We ran a scrub 
and scrub does not even detect this. ZFS automatically repairs the labels on 
the ZIL, however no error is reported about the missing data.

While it is clear to us that if we do not have a mirrored zil, the data we have 
overwritten in the zil is lost, we are really wondering why ZFS does not REPORT 
about this corruption, silently ignoring it.

Is this is a bug or .. aehm ... a feature  :) ?

Regards, 
Robert
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread Neil Perrin

On 08/23/10 13:12, Markus Keil wrote:

Does that mean that when the begin of the intent log chain gets corrupted, all
other intent log data after the corruption area is lost, because the checksum of
the first corrupted block doesn't match? 
  


- Yes, but you wouldn't want to replay the following entries in case the 
log records

in the missing log block were important (eg create file).

Mirroring the slogs is recommended to minimise concerns about slogs 
corruption.



 
Regards,

Markus

Neil Perrin neil.per...@oracle.com hat am 23. August 2010 um 19:44
geschrieben:

  

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can't read
a block due to an IO error then that is reported, but if the checksum does
not match then we assume it's the end of the intent log chain.
Using this design means we the minimum number of writes to add
write an intent log record is just one write.

So corruption of an intent log is not going to generate any errors.

Neil.

On 08/23/10 10:41, StorageConcepts wrote:


Hello,

we are currently extensivly testing the DDRX1 drive for ZIL and we are going
through all the corner cases.

The headline above all our tests is do we still need to mirror ZIL with
all current fixes in ZFS (zfs can recover zil failure, as long as you don't
export the pool, with latest upstream you can also import a poool with a
missing zil)? This question  is especially interesting with RAM based
devices, because they don't wear out, have a very low bit error rate and use
one PCIx slot - which are rare. Price is another aspect here :)

During our tests we found a strange behaviour of ZFS ZIL failures which are
not device related and we are looking for help from the ZFS guru's here :)

The test in question is called offline ZIL corruption. The question is,
what happens if my ZIL data is corrupted while a server is transported or
moved and not properly shut down. For this we do:

- Prepare 2 OS installations (ProdudctOS and CorruptOS)
- Boot ProductOS and create a pool and add the ZIL
- ProductOS: Issue synchronous I/O with a increasing TNX number (and print
the latest committet transaciton)
- ProductOS: Power off the server and record the laast committet transaction
- Boot CorruptOS
- Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL
 ~ 300 MB from start of disk, overwriting the first two disk labels)
- Boot ProductOS
- Verify that the data corruption is detected by checking the file with the
transaction number against the one recorded

We ran the test and it seems with modern snv_134 the pool comes up after the
corruption with all beeing ok, while ~1 Transactions (this is some
seconds of writes with DDRX1) are missing and nobody knows about this. We
ran a scrub and scrub does not even detect this. ZFS automatically repairs
the labels on the ZIL, however no error is reported about the missing data.

While it is clear to us that if we do not have a mirrored zil, the data we
have overwritten in the zil is lost, we are really wondering why ZFS does
not REPORT about this corruption, silently ignoring it.

Is this is a bug or .. aehm ... a feature  :) ?

Regards,
Robert
   
  


--
StorageConcepts Europe GmbH
    Storage: Beratung. Realisierung. Support     


Markus Keil            k...@storageconcepts.de
                       http://www.storageconcepts.de
Wiener StraÃYe 114-116Â  Telefon:Â  Â +49 (351) 8 76 92-21
01219 Dresden          Telefax:   +49 (351) 8 76 92-99
Handelregister Dresden, HRB 28281
Geschäftsführer: Robert Heinzmann, Gerd Jelinek
--
Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind
vertraulich  und ausschlieÃYlich für den Gebrauch durch den Empfänger 
bestimmt,
soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt.
Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten
sein. Soweit eine Weitergabe oder Verteilung nicht ausschlieÃYlich zu internen
Zwecken des Empfängers geschieht, wird jede Weitergabe, Verteilung oder 
sonstige
Kopierung untersagt. Sollten Sie nicht  der beabsichtigte Empfänger der 
Sendung
sein, informieren Sie den Absender bitte unverzüglich.
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss