Hello, 

we are currently extensivly testing the DDRX1 drive for ZIL and we are going 
through all the corner cases. 

The headline above all our tests is "do we still need to mirror ZIL" with all 
current fixes in ZFS (zfs can recover zil failure, as long as you don't export 
the pool, with latest upstream you can also import a poool with a missing zil)? 
This question  is especially interesting with RAM based devices, because they 
don't wear out, have a very low bit error rate and use one PCIx slot - which 
are rare. Price is another aspect here :)

During our tests we found a strange behaviour of ZFS ZIL failures which are not 
device related and we are looking for help from the ZFS guru's here :) 

The test in question is called "offline ZIL corruption". The question is, what 
happens if my ZIL data is corrupted while a server is transported or moved and 
not properly shut down. For this we do: 

- Prepare 2 OS installations (ProdudctOS and CorruptOS)
- Boot ProductOS and create a pool and add the ZIL 
- ProductOS: Issue synchronous I/O with a increasing TNX number (and print the 
latest committet transaciton)
- ProductOS: Power off the server and record the laast committet transaction
- Boot CorruptOS
- Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL .... 
~ 300 MB from start of disk, overwriting the first two disk labels)
- Boot ProductOS
- Verify that the data corruption is detected by checking the file with the 
transaction number against the one recorded

We ran the test and it seems with modern snv_134 the pool comes up after the 
corruption with all beeing ok, while ~10000 Transactions (this is some seconds 
of writes with DDRX1) are missing and nobody knows about this. We ran a scrub 
and scrub does not even detect this. ZFS automatically repairs the labels on 
the ZIL, however no error is reported about the missing data.

While it is clear to us that if we do not have a mirrored zil, the data we have 
overwritten in the zil is lost, we are really wondering why ZFS does not REPORT 
about this corruption, silently ignoring it.

Is this is a bug or .. aehm ... a feature  :) ?

Regards, 
Robert
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to