Re: [zfs-discuss] How Virtual Box handles the IO

2009-08-05 Thread Thomas Burgess
From what i understand, and from everything i've read by following threads
here, there are ways to do it but there is not a standardized tool yet, and
it's complicated and on a per-case basis but people who pay for support have
recovered pools.

i'm sure they are working on it, and i would imagine it would be a major
goal.

On Wed, Aug 5, 2009 at 1:23 AM, James Hess no-re...@opensolaris.org wrote:

 So much for the it's a consumer hardware problem argument.
 I for one gotta count it as a major drawback of ZFS that it doesn't provide
 you a mechanism to get something of your pool back  in the manner of
 reconstruction or reversion, if a failure occurs,  where there is a metadata
 inconsistency.

 A policy of data integrity taken to the extreme of blocking access to good
 data is not something OS users want.

 Users don't put up with this sort of thing from other filesystems...  some
 sort of improvement here is sorely needed.

 ZFS ought to be retaining enough information and make an effort to bring
 pool metadata back to a consistent state,   even if it means  loss of data,
  that a file may have to revert to an older state,   or a file that was
 undergoing changes  may now be unreadable,  since the log was inconsistent..

 even if the user should have to zpool import with a  recovery-mode  option
 or something of that nature.

 It beats losing a TB of data on the pool that should be otherwise intact.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How Virtual Box handles the IO

2009-08-04 Thread James Hess
So much for the it's a consumer hardware problem argument.
I for one gotta count it as a major drawback of ZFS that it doesn't provide you 
a mechanism to get something of your pool back  in the manner of reconstruction 
or reversion, if a failure occurs,  where there is a metadata inconsistency.

A policy of data integrity taken to the extreme of blocking access to good data 
is not something OS users want.

Users don't put up with this sort of thing from other filesystems...  some sort 
of improvement here is sorely needed.

ZFS ought to be retaining enough information and make an effort to bring pool 
metadata back to a consistent state,   even if it means  loss of data,  that a 
file may have to revert to an older state,   or a file that was undergoing 
changes  may now be unreadable,  since the log was inconsistent..

even if the user should have to zpool import with a  recovery-mode  option or 
something of that nature. 

It beats losing a TB of data on the pool that should be otherwise intact.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How Virtual Box handles the IO

2009-07-31 Thread Richard Elling

Thanks for following up with this, Russel.

On Jul 31, 2009, at 7:11 AM, Russel wrote:


After all the discussion here about VB, and all the finger pointing
I raised a bug on VB about flushing.

Remember I am using RAW disks via the SATA emulation in VB
the disks are WD 2TB drives. Also remember the HOST machine
NEVER crashed or stopped. BUT the guest OS OpenSolaris was
hung and so I powered off the VIRTUAL host.

OK, this is what the VB engineer had to say after reading this and
another thread I had pointed him to. (he missed the fast I was
using RAW not supprising as its a rather long thread now!)

===
Just looked at those two threads, and from what I saw all vital  
information is missing - no hint whatsoever on how the user set up  
his disks, nothing about what errors should be dealt with and so on.  
So hard to say anything sensible, especially as people seem most  
interested in assigning blame to some product. ZFS doesn't deserve  
this, and VirtualBox doesn't deserve this either.


In the first place, there is absolutely no difference in how the IDE  
and SATA devices handle the flush command. The documentation just  
wasn't updated to talk about the SATA controller. Thanks for  
pointing this out, it will be fixed in the next major release. If  
you want to get the information straight away: just replace  
piix3ide with ahci, and all other flushing behavior settings  
apply as well. See a bit further below of what it buys you (or not).


What I haven't mentioned is the rationale behind the current  
behavior. The reason for ignoring flushes is simple: the biggest  
competitor does it by default as well, and one gets beaten up by  
every reviewer if VirtualBox is just a few percent slower than you  
know what. Forget about arguing with reviewers.


That said, a bit about what flushing can achieve - or not. Just keep  
in mind that VirtualBox doesn't really buffer anything. In the IDE  
case every read and write requests gets handed more or less straight  
(depending on the image format complexity) to the host OS. So there  
is absolutely nothing which can be lost if one assumes the host OS  
doesn't crash.


In the SATA case things are slightly more complicated. If you're  
using anything but raw disks or flat file VMDKs, the behavior is  
100% identical to IDE. If you use raw disks or flat file VMDKs, we  
activate NCQ support in the SATA device code, which means that the  
guest can push through a number of commands at once, and they get  
handled on the host via async I/O. Again - if the host OS works  
reliably there is nothing to lose.


The problem with this thought process is that since the data is not
on medium, a fault that occurs between the flush request and
the bogus ack goes undetected. The OS trusts when the disk
said the data is on the medium that the data is on the medium
with no errors.

This problem also affects hardware RAID arrays which provide
nonvolatile caches.  If the array acks a write and flush, but the
data is not yet committed to medium, then if the disk fails, the
data must remain in nonvolatile cache until it can be committed
to the medium. A use case may help, suppose the power goes
out. Most arrays have enough battery to last for some time. But
if power isn't restored prior to the batteries discharging, then
there is a risk of data loss.

For ZFS, cache flush requests are not gratuitous. One critical
case is the uberblock or label update. ZFS does:
1. update labels 0 and 2
2. flush
3. check for errors
4. update labels 1 and 3
5. flush
6. check for errors

Making flush be a nop destroys the ability to check for errors
thus breaking the trust between ZFS and the data on medium.
 -- richard



The only thing what flushing can potentially improve is the behavior  
when the host OS crashes. But that depends on many assumptions on  
what the respective OS does, the filesystems do etc etc.


Hope those facts can be the basis of a real discussion. Feel free to  
raise any issue you have in this context, as long as it's not purely  
hypothetical.


===
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How Virtual Box handles the IO

2009-07-31 Thread Frank Middleton

Great to hear a few success stories! We have been experimentally
running ZFS on really crappy hardware and it has never lost a
pool. Running on VB with ZFS/iscsi raw disks we have yet to see
any errors at all. On sun4u with lsi sas/sata it is really rock
solid. And we've been going out of our way to break it because of
bad experiences with ntfs, ext2 and UFS as well as many disk
failures (ever had fsck run amok?).

On 07/31/09 12:11 PM, Richard Elling wrote:


Making flush be a nop destroys the ability to check for errors
thus breaking the trust between ZFS and the data on medium.
-- richard


Can you comment on the issue that the underlying disks were,
as far as we know, never powered down? My understanding is
that disks usually try to flush their caches as quickly as
possible to make room for more data, so in this scenario
things were probably quiet after the guest crash, so likely
what ever was in the cache would have been flushed anyway,
certainly by the time the OP restarted VB and the guest.

Could you also comment on CR 6667683. which I believe is proposed
as a solution for recovery in this very rare case? I understand
that the ZILs are allocated out of the general pool. Is there
a ZIL for the ZILs, or does this make no sense?

As the one who started the whole ECC discussion, I don't think
anyone has ever claimed that lack of ECC caused this loss of
a pool or that it could. AFAIK lack of ECC can't be a problem
at all on RAIDZ vdevs, only with single drives or plain mirrors.
I've suggested an RFE for the mirrored case to double buffer
the writes in this case, but disabling checksums pretty much
fixes the problem if you don't have ECC, so it isn't worth
pursuing. You can disable checksum per file system, so this
is an elegant solution if you don't have ECC memory but
you do mirror. No mirror IMO is suicidal with any file system.

Has anyone ever actually lost a pool on Sun hardware other than
by losing too many replicas or operator error? As you have so
eloquently pointed out, building a reliable storage system is
an engineering problem. There are a lot of folks out there who
are very happy with ZFS on decent hardware. On crappy hardware
you get what you pay for...

Cheers -- Frank (happy ZFS evangelist)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How Virtual Box handles the IO

2009-07-31 Thread Neil Perrin

I understand  that the ZILs are allocated out of the general pool.


There is one intent log chain per dataset (file system or zvol).
The head of each log the log is kept in the main pool.
Without slog(s) we allocate (and chain) blocks from the
main pool. If separate intent log(s) exist then blocks are allocated
and chained there. If we fail to allocate from the
slog(s) then we revert to allocation from the main pool.


Is there a ZIL for the ZILs, or does this make no sense?


There is no ZIL for the ZILs. Note the ZIL is not a journal
(like ext3 or ufs logging). It simply contains records of
system calls (including data) that need to be replayed if the
system crashes and those records have not been committed
in a transaction group.

Hope that helps: Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How Virtual Box handles the IO

2009-07-31 Thread Mike Gerdts
On Fri, Jul 31, 2009 at 7:58 PM, Frank
Middletonf.middle...@apogeect.com wrote:
 Has anyone ever actually lost a pool on Sun hardware other than
 by losing too many replicas or operator error? As you have so

Yes, I have lost a pool when running on Sun hardware.

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-September/013233.html

Quite likely related to:

http://bugs.opensolaris.org/view_bug.do?bug_id=6684721

In other words, it was a buggy Sun component that didn't do the right
thing with cache flushes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss