Re: [zfs-discuss] How Do I know if a ZFS snapshot is complete?

2010-04-28 Thread Peter Schuller
 I took a snapshot of one of my oracle filesystems this week and when someone 
 tried to add data to it it filled up.

 I tried to remove some data but the snapshot seemed to keep reclaiming it as 
 i deleted it. I had taken the snapshot says earlier. Does this make sense?

Snapshots are completed when your zfs snapshot terminates. The
reason deletes do not free up space is that snapshots are just that -
snapshots of the file system at the time, which means that it has to
retain all data referenced by the file system at the point of the
snapshot.

In order to reclaim space used by snapshots, they have to be removed.

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-13 Thread Peter Schuller
I realized I forgot to follow-up on this thread. Just to be clear, I
have confirmed that I am seeing what to me is undesirable behavior
even with the ARC being 1500 MB in size on an almost idle system (
0.5 mb/sec read load, almost 0 write load). Observe these recursive
searches through /usr/src/sys:

% time ack arcstats 2/dev/null 1/dev/null
ack arcstats  2.74s user 1.19s system 20% cpu 19.143 total
% time ack arcstats 2/dev/null 1/dev/null
ack arcstats 2 /dev/null  /dev/null  2.45s user 0.51s system 99% cpu
2.986 total
% time ack arcstats 2/dev/null 1/dev/null
ack arcstats 2 /dev/null  /dev/null  2.41s user 0.62s system 53% cpu
5.667 total
% time ack arcstats 2/dev/null 1/dev/null
ack arcstats 2 /dev/null  /dev/null  2.37s user 0.68s system 50% cpu
6.025 total
% time ack arcstats 2/dev/null 1/dev/null
ack arcstats 2 /dev/null  /dev/null  2.45s user 0.61s system 45% cpu
6.694 total
% time ack arcstats 2/dev/null 1/dev/null
ack arcstats 2 /dev/null  /dev/null  2.45s user 0.59s system 53% cpu
5.651 total
% time ack arcstats 2/dev/null 1/dev/null
ack arcstats 2 /dev/null  /dev/null  2.32s user 0.72s system 46% cpu
6.503 total
% time ack arcstats 2/dev/null 1/dev/null
ack arcstats 2 /dev/null  /dev/null  2.41s user 0.66s system 44% cpu
6.843 total
% time ack arcstats 2/dev/null 1/dev/null
ack arcstats 2 /dev/null  /dev/null  2.37s user 0.67s system 49% cpu
6.119 total

The first was entirely cold. For some reason the second was close to
CPU-bound, while the remainder were significantly disk-bound even if
not to the extent of the initial run. I correlated with 'iostat -x 1'
to confirm that I am in fact generating I/O (but no, I do not have
dtrace output).

Anyways, presumably the answer to my original question is no and the
above isn't really very interesting other than to show that under some
circumstances you can see behavior that is decidedly non-optimal for
interactive desktop use of certain kinds. Whether this is ARC in
general or something FreeBSD specific, I don't know. But it does, at
this point, not have to do with ARC sizing since the ARC is sensibly
large.

(I realize I should investigate properly and report back, but I'm not
likely to have time to dig into this now.)

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Peter Schuller
Hello,

For desktop use, and presumably rapidly changing non-desktop uses, I
find the ARC cache pretty annoying in its behavior. For example this
morning I had to hit my launch-terminal key perhaps 50 times (roughly)
before it would start completing without disk I/O. There are plenty of
other examples as well, such as /var/db/pkg not being pulled
aggressively into cache such that pkg_* operations (this is on
FreeBSD) are slower than they should (I have to run pkg_info some
number of times before *it* will complete without disk I/O too).

I would be perfectly happy with pure LRU caching behavior or an
approximation thereof, and would therefore like to essentially
completely turn off all MFU-like weighting.

I have not investigated in great depth so it's possible this
represents an implementation problem rather than the actual intended
policy of the ARC. If the former, can someone confirm/deny? If the
latter, is there some way to tweak it? I have not found one (other
than changing the code). Is there any particular reason why such knobs
are not exposed? Am I missing something?

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Peter Schuller
 It sounds like you are complaining about how FreeBSD has implemented zfs in
 the system rather than about zfs in general.  These problems don't occur
 under Solaris.  Zfs and the kernel need to agree on how to allocate/free
 memory, and it seems that Solaris is more advanced than FreeBSD in this
 area.  It is my understanding that FreeBSD offers special zfs tunables to
 adjust zfs memory usage.

It may be FreeBSD specific, but note that I a not talking about the
amount of memory dedicated to the ARC and how it balances with free
memory on the system. I am talking about eviction policy. I could be
wrong but I didn't think ZFS port made significant changes there.

And note that part the of *point* of the ARC (at least according to
the original paper, though it was a while since I read it), as opposed
to a pure LRU, is to do some weighting on frequency of access, which
is exactly consistent with what I'm observing (very quick eviction
and/or lack of inserton of data, particularly in the face of unrelated
long-term I/O having happened in the background). It would likely also
be the desired behavior for longer-running homogenous disk access
patterns where optimal use of cache over long period may be more
important than immediately reacting to a changing access pattern. So
it's not like there is no reason to believe this can be about ARC
policy.

Why would this *not* occurr on Solaris? It seems to me that it would
imply the ARC was broken on Solaris, since it is not *supposed* to be
a pure LRU by design. Again, there may very well be a FreeBSD specific
issue here that is altering the behavior, and maybe the extremity of
it that I am reporting is not supposed to be happening, but I believe
the issue is more involved than what you're implying in your response.

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Peter Schuller
 The ARC is designed to use as much memory as is available up to a limit.  If
 the kernel allocator needs memory and there is none available, then the
 allocator requests memory back from the zfs ARC. Note that some systems have
 multiple memory allocators.  For example, there may be a memory allocator
 for the network stack, and/or for a filesystem.

Yes, but again I am concerned with what the ARC chooses to cache and
for how long, not how the ARC balances memory with other parts of the
kernel. At least, none of my observations lead me to believe the
latter is the problem here.

 might be pre-allocated.  I assume that you have already read the FreeBSD ZFS
 tuning guide (http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem
 section in the handbook
 (http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made sure
 that your system is tuned appropriately.

Yes, I have been tweaking and fiddling and reading off and on since
ZFS was originally added to CURRENT.

This is not about tuning in that sense. The fact that the little data
necessary to start an 'urxvt' instance does not get cached for at
least 1-2 seconds on an otherwise mostly idle system is either the
result of cache policy, an implementation bug (freebsd or otherwise),
or a matter of an *extremely* small cache size. I have observed this
behavior for a very long time across versions of both ZFS and FreeBSD,
and with different forms of arc sizing tweaks.

It's entirely possibly there are FreeBSD issues preventing the ARC to
size itself appropriately. What I am saying though is that all
indications are that data is not being selected for caching at all, or
else is evicted extremely quickly, unless sufficient frequency has
been accumulated to, presumably, make the ARC decide to cache the
data.

This is entirely what I would expect from a caching policy that tries
to adapt to long-term access patterns and avoid pre-mature cache
eviction by looking at frequency of access. I don't see what it is
that is so outlandish about my query. These are fundamental ways in
which caches of different types behave, and there is a legitimate
reason to not use the same cache eviction policy under all possible
workloads. The behavior I am seeing is consistent with a caching
policy that tries too hard (for my particular use case) to avoid
eviction in the face of short-term changes in access pattern.

 There have been a lot of eyeballs looking at how zfs does its caching, and a
 ton of benchmarks (mostly focusing on server thoughput) to verify the
 design.  While there can certainly be zfs shortcomings (I have found
 several) these are few and far between.

That's a very general statement. I am talking about specifics here.
For example, you can have mountains of evidence that shows that a
plain LRU is optimal (under some conditions). That doesn't change
the fact that if I want to avoid a sequential scan of a huge data set
to completely evict everything in the cache, I cannot use a plain LRU.

In this case I'm looking for the reverse; i.e., increasing the
importance of 'recenticity' because my workload is such that it would
be more optimal than the behavior I am observing. Benchmarks are
irrelevant except insofar as they show that my problem is not with the
caching policy, since I am trying to address an empirically observed
behavior.

I *will* try to look at how the ARC sizes itself, as I'm unclear on
several things in the way memory is being reported by FreeBSD, but as
far as I can tell these are different issues. Sure, a bigger ARC might
hide the behavior I happen to see; but I want the cache to behave in a
way where I do not need gigabytes of extra ARC size to lure it into
caching the data necessary for 'urxvt' without having to start it 50
times in a row to accumulate statistics.

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Peter Schuller
 In simple terms, the ARC is divided into a MRU and MFU side.
        target size (c) = target MRU size (p) + target MFU size (c-p)

 On Solaris, to get from the MRU to the MFU side, the block must be
 read at least once in 62.5 milliseconds.  For pure read-once workloads,
 the data won't to the MFU side and the ARC will behave exactly like an
 (adaptable) MRU cache.

Ok. That differs significantly from my understanding, though in
retrospect I should have realized it given that arc stats contain only
references to mru and mfu... I previously was under the impression
that the ZFS ARC had an LRU:Ish side to complement the MFU side.
MRU+MFU changes things.

I will have to look into it in better detail to understand the
consequences. Is there a paper that describes the ARC as it is
implemented in ZFS (since it clearly diverges from the IBM ARC)?

 I *will* try to look at how the ARC sizes itself, as I'm unclear on
 several things in the way memory is being reported by FreeBSD, but as

For what it's worth I confirmed that the ARC was too small and that
there are clearly remaining issues with the interaction between the
ARC the rest of the FreeBSD kernel.  (I wasn't sure before but I
confirmed I Was looking at the right number.) I'll try to monitor more
carefully and see if I can figure out when the ARC shrinks and why it
doesn't grow back. Informally my observations have always been that
things behave great for a while after boot, but degenerate over time.

In this case it was sitting at it's minium size, which was 214M. I
realize this is far below what is recommended or even designed for,
but is it clearly caching *something* and I clearly *could* make it
cache urxvt+deps by re-running it several tens of times in rapid
succession.

 I'm not convinced you have attributed the observation to the ARC
 behaviour.  Do you have dtrace (or other) data to explain what process
 is causing the physical I/Os?

In the urxvt case, I am basing my claim on informal observations.
I.e., hit terminal launch key, wait for disks to rattle, get my
terminal. Repeat. Only by repeating it very many times in very rapid
succession am I able to coerce it to be cached such that I can
immediately get my terminal. And what I mean by that is that it keeps
necessitating disk I/O for a long time, even on rapid successive
invocations. But once I have repeated it enough times it seems to
finally enter the cache.

(No dtrace unfortunately. I confess to not having learned dtrace yet,
in spite of thinking it's massively cool.)

However, I will of course accept that given the minimal ARC size at
the time I am moving completely away from the designed-for use-case.
And if that is responsible, it is of course my own fault. Given
MRU+MFU I'll have to back off with my claims. Under the (incorrect)
assumption of LRU+MFU I felt the behavior was unexpected, even with a
small cache size. Given MRU+MFU and without knowing further details
right now, I accept that the ARC may fundamentally need a bigger cache
size in relation to the working set in order to be effective in the
way I am using it here. I was basing my expectations on LRU-style
behavior.

Thanks!

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-19 Thread Peter Schuller
 fsync() is, indeed, expensive.  Lots of calls to fsync() that are not
 necessary for correct application operation EXCEPT as a workaround for
 lame filesystem re-ordering are a sure way to kill performance.

IMO the fundamental problem is that the only way to achieve a write
barrier is fsync() (disregarding direct I/O etc). Again I would just
like an fbarrier() as I've mentioned on the list previously. It seems
to me that if this were just adopted by some operating systems and
applications could start using it, things would just sort itself out
when file systems/block devices layers start actually implementing the
optimization possible (instead of the native fbarrier() - fsync()).

As was noted previously in the previous thread on this topic, ZFS
effectively has an implicit fbarrier() in between each write. Imagine
now if all the applications out there were automatically massively
faster on ZFS... but this won't happen until operating systems start
exposing the necessary interface.

What does one need to do to get something happening here? Other than
whine on mailing lists...

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpZpZZfcWwR3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-19 Thread Peter Schuller
Uh, I should probably clarify some things (I was too quick to hit
send):

 IMO the fundamental problem is that the only way to achieve a write
 barrier is fsync() (disregarding direct I/O etc). Again I would just
 like an fbarrier() as I've mentioned on the list previously. It seems

Of course if fbarrier() is analogous to fsync() this does not actually
address the particular problem which is the main topic of this thread,
since there the fbarrier() would presumably apply only to I/O within
that file.

This particular case would only be helped if the fbarrier() were
global, or at least extending further than the particular file.

Fundamentally, I think a userful observation is that the only time you
ever care about persistence is when you make a contract with an
external party outside of your blackbox of I/O. Typical examples are
database commits and mail server queues. Anything within the blackbox
is only concerned with consistency.

In this particular case, the fsync()/fbarrier() operateon the black
box of the file, with the directory being an external party. The
rename() operation on the directory entry constitutes an operation
which depends on the state of the individual file blackbox, thus
constituting an external dependency and thus requireing persistence.

The question is whether it is necessarily a good idea to make the
blackbox be the entire file system. If it is, a lot of things would be
much much easier. On the other hand, it also makes optimization more
difficult in many cases. For example the latency of persisting 8kb of
data could be very very significant if there is large amounts of bulk
I/O happening in the same file system. So I definitely see the
motivation behind having persistence guarantees be non-global.

Perhaps it boils down to the files+directory model not necessarily
being the best one in all cases. Perhaps one would like to define
subtrees which have global fsync()/fbarrier() type semantics within
each respective subtree.

On the other hand, that sounds a lot like a ZFS file system, other
than the fact that ZFS file system creation is not something which is
exposed to the application programmer.

How about having file-system global barrier/persistence semantics, but
having a well-defined API for creating child file systems rooted at
any point in a hierarchy? It would allow global semantics and what
that entails, while allowing that bulk I/O happening in your 1 TB
PostgreSQL database to be segregated, in terms of performance impact,
from your kde settings file system.

 What does one need to do to get something happening here? Other than
 whine on mailing lists...

And that came off much more rude than intended. Clearly it's not an
implementation effort issues ince the naive fbarrioer() is basically
calling fsync(). However I get the feeling there is little motivation
in the operating system community for addressing these concerns, for
whatever reason (IIRC it was only recently that some write
barrier/write caching issues started being seriously discussed in the
Linux kernel community for example).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpMWdfQtjIuW.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-10 Thread Peter Schuller
 However, I just want to state a warning, that ZFS is far from being that 
 what it
 is promising, and so far from my sum of experience I can't recommend at all 
 to
 use zfs on a professional system.
 
 
 Or, perhaps, you've given ZFS disks which are so broken that they are 
 really unusable; it is USB, after all.

I had a cheap-o USB enclosure that definitely did ignore such
commands. On every txg commit I'd get a warning in dmesg (this was on
FreeBSD) about the device not implementing the relevant SCSI command.

This of course would affect filesystems other than ZFS aswell. What is
worse, I was unable to completely disable write caching either because
that, too, did not actually propagate to the underlying device when
attempted.

(I could not say for certain whether this was fundamental to the
device or in combination with a FreeBSD issue.)

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpLGAnqq8jAy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-10 Thread Peter Schuller
 YES! I recently discovered that VirtualBox apparently defaults to  
 ignoring flushes, which would, if true, introduce a failure mode  
 generally absent from real hardware (and eventually resulting in  
 consistency problems quite unexpected to the user who carefully  
 configured her journaled filesystem or transactional RDBMS!)

I recommend everyone to be extremely hesitant to assume that any
particular storage setup actually honors write barriers and cache
flushes. This is a recommendation I would give even when you purchase
non-cheap battery backed hardware RAID controllers (I won't mention
any names or details to avoid bashing as I'm sure it's not specific to
the particular vendor I had problems with most recently).

You need the underlying device to do the right thing, the driver to do
the right thing, the operating system in general to do the right thing
(which includes the file system, block device layer if any etc - for
example, if use md on Linux with RAID5/6 you're toast).

So again I cannot stress enough - do not assume things behave in a
non-broken fashion with respect to write barriers and flushes. I can't
speak to expensive integrated hardware solutions; I HOPE, though at
this point my level of paranoid does not allow me to assume, that if
you buy boxed systems from companies like Sun/HP/etc you get decent
stuff. But I can definitely say that paying non-trivial amounts of
money for hardware is not a guarantee that you won't get completely
broken behavior.

speculation
I think it boils down to the fact that 99% of customers that aren't
doing integration of the individual components in overall packages,
probably don't care/understand/bother with it, so as long as the
benchmarks say it's fast, they sell.
/speculation

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpHedvKWqrb4.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-10 Thread Peter Schuller
 And again: Why should a 2 weeks old Seagate HDD suddenly be damaged, if there 
 was no shock, hit or any other event like that?

I have no information about your particular situation, but you have to
remember the ZFS uncovers problems that otherwise go unnoticed. Just
personally on my private hardware (meaning a very limited set), I have
seen silent corruption issues several times. The most recent one I
discovered almost immediately because of ZFS. If it weren't for ZFS, I
would have been highly likely to have transfered my entire system
without noticing and suffer weird problems a couple of weeks later.

While I don't know what is going on in your case, blaming the
introduction of a piece of software/hardware/procedure on some problem
without identifying a causal relationship, is a common mistake to
make.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgp5cozV6UwEf.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-10 Thread Peter Schuller
 on a UFS ore reiserfs such errors could be corrected.

In general, UFS has zero capability to actually fix real corruption in
any reliable way.

What you normally do with fsck is repairing *expected* inconsistencies
that the file system was *designed* to produce in the event of e.g. a
sudden reboot or a crash. This is entirely different from repairing
arbitrary corruption. If ZFS says that a file has a checksum error,
that can very well be because there is a bug in ZFS. But it can also
be the case that there *is* actual on-disk (or in-transit) corruption
that ZFS has detected, and given I/O errors back to an application
instead of producing bad data.

Now it is probably entirely true that onces you *do* have broken
hardware or there is some other reason for corruption beyond that
which you can design for, ZFS is probably less mature than traditional
file systems in terms of the availability of tools and procedures to
salvage whatever might actually be salvagable. That is a valid
critisicm.

But you *have* to realize the distinction between repairing fully
expected inconsistencies specifically expected as part of regular
operation in the event of a crash/power outtage, from problems
arrising from misbehaving hardware or bugs in software. ZFS cannot
magically overcome such problems, nor can UFS/reiserfs/xfs/whatever
else.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpOIkWS4ZEdB.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-10 Thread Peter Schuller
 ps This is a recommendation I would give even when you purchase
 ps non-cheap battery backed hardware RAID controllers (I won't
 ps mention any names or details to avoid bashing as I'm sure it's
 ps not specific to the particular vendor I had problems with most
 ps recently).
 
 This again?  If you're sure the device is broken, then I think others
 would like to know it, even if all devices are broken.  

The problem is that I even had help from the vendor in question, and
it was not for me personally but for a company, and I don't want to
use information obtained that way to do any public bashing.

But I have no particular indication that there is any problem with the
vendor in general; it was a combination of choices made by Linux
kernel developers and the behavior of the RAID controller. My
interpretation was that no one was there looking at the big picture,
and the end result was that if you followed the instructions
specifically given by the vendor, you would have a setup whereby you
would loose correctness whenever the BBU was
overheated/broken/disabled.

The alternative was to get completely piss-poor performance by not
being able to take advantage of the battery backed nature of the cache
at all (which defeats most of the purpose of having the controller, if
you use it in any kind of transactional database environment or
similar).

 but, fine.  Anyway, how did you determine the device was broken? 

By performing timing tests as mentioned in the other post that you
answered separately, and after detecting the problem confirming the
status with respect to caching at the different levels as claimed by
the administrative tool for the controller.

While timing tests cannot conclusively prove correct behavior, it can
definitely proove incorrect behavior in cases where your timings are
simply theoretically impossible given the physical nature of the
underlying drives.

 At
 least you can tell us that much without fear of retaliation (whether
 baseless or founded), and maybe others can use the same test to
 independently discover what you did which would be both fair and safe
 for you.

The test was trivial; in my case a ~10 line Python script or something
along those lines. Perhaps I should just go ahead and release
something which non-programmers can easily run and draw conclusions
from.

 This is the real problem as I see it---a bunch of FUD, without any
 actual resolution beyond ``it's working, I _think_, and in any case
 the random beatings have stopped so D'OH-NT TOUCH *ANY*THING!  THAR BE
 DEMONZ IN THE BOWELS O DIS DISK SHELF!''

I'd love to go on a public rant, because I think the whole situation
was a perfect example of a case where a single competent person who
actually cares about correctness could have pinpointed this problem
trivially. But instead you have different camps doing their own stuff
and not considering the big picture.

 If anyone asks questions, they get no actual information, but a huge
 amount of blame heaped on the sysadmin.  Your post is a great example
 of the typical way this problem is handled because it does both: deny
 information and blame the sysadmin.  Though I'm really picking on you
 way too much here.  Hopefully everyone's starting to agree, though, we
 do need a real way out of this mess!

I'm not quite sure what you're referring to here. I'm not blaming any
sysadmin. I was trying to point out *TO* sysadmins, to help them, that
I recommend being paranoid about correctness.

If you mean the original poster in the thread having issues, I am not
blaming him *at all* in the post you responded to. It was strictly
meant as a comment in response to the poster who noted that he
discovered, to his surprise, the problems with VirtualBox. I wanted to
make the point that while I completely understand his surprise, I have
come to expect that these things are broken by default (regardless of
whether you're using virtualbox or not, or vendor X or Y etc), and
that care should be taken if you do want to have correctness when it
comes to write barriers and/or honoring fsync().

However, that said, as I stated in another post I wouldn't be
surprised if it turns out the USB device was ignoring sync
commands. But I have no idea what the case was for the original
poster, nor have I even followed the thread in detail enough to know
if that would even be a possible explanation for his problems.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpjMmPNbbRnF.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does your device honor write barriers?

2009-02-10 Thread Peter Schuller
 mortals ;)

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpGm6lQE1f12.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does your device honor write barriers?

2009-02-10 Thread Peter Schuller
 is such
that you cannot trust the hardware/software involved, I suppose there
is no other way out other than testing and adjusting until it works.

 Here is another bit of FUD to worry about: the common advice for the
 lost SAN pools is, use multi-vdev pools.  Well, that creepily matches
 just the scenario I described: if you need to make a write barrier
 that's valid across devices, the only way to do it is with the
 SYNCHRONIZE CACHE persistence command, because you need a reply from
 Device 1 before you can release writes behind the barrier to Device 2.
 You cannot perform that optimisation I described in the last paragraph
 of pushing the barrier paast the high-latency link down into the
 device, because your initiator is the only thing these two devices
 have in common.  Keeping the two disks in sync would in effect force
 the initiator to interpret the SYNC command as in my second example.
 However if you have just one device, you could write the filesystem to
 use this hypothetical barrier command instead of the persistence
 command for higher performance, maybe significantly higher on
 high-latency SAN.  I don't guess that's actually what's going on
 though, just an interesting creepy speculation.

This would be another case where battery-backed (local to the machine)
NVRAM fundamentally helps even in a situation where you are only
concerned with the barrier, since there is no problem having a
battery-backed controller sort out the disk-local problems itself by
whatever combination of syncs/barriers, while giving instant barrier
support (by effectively implementing synch-and-wait) to the operating
system.

(Referring now to individual drives being battery-backed, not using a
hardware raid volume.)


-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpkpvy9lmPS6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Verify files' checksums

2008-10-26 Thread Peter Schuller
 1) When I notice an error in a file that I've copied from a ZFS disk I
 want to know whether that error is also in the original file on my ZFS
 disk or if it's only in the copy.

This was already addressed but let me do so slightly differently: One
of the major points of ZFS checksumming is that, in the abscence of
software bugs or hardware memory corruption issues when the file is
read on the host, successfully reading a file is supposed to mean that
you got the correct version of the file (either from physical disk or
from cache, having previously been read from physical disk).

A scrub is still required if you want to make sure the file is okay
*ON DISK*, unless you can satisfy yourself that no relevant data is
cached somewhere (or unless someone can inform me of a way to nuke a
particular file and related resources from cache).

 Up to now I've been storing md5sums for all files, but keeping the
 files and their md5sums synchronized is a burden I could do without.

FWIW I wanted to mention here that if you care a lot about this, I'd
recommend something like par2[1] instead. It uses forward error
correction[2], allowing you to not only detect corruption, but also
correct it. You can choose your desired level of redundancy expressed
as a percentage of the file size.

[1] http://en.wikipedia.org/wiki/Parchive
[2] http://en.wikipedia.org/wiki/Forward_error_correction

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpTBfuHa8mpF.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over multiple iSCSI targets

2008-09-07 Thread Peter Schuller
 Exporting them as one huge iSCSI volume is good if you're paranoid about data 
 loss.  You can use raid5 or 6 on the Linux servers, and then mirror those 
 large volumes with ZFS.  The downside is that it's much harder to add 
 storage.  I don't know if iSCSI volumes can be expanded, so you might have to 
 break the mirror, create a larger iSCSI volume and resync all your data with 
 that approach.

Just be careful with respect to writer barriers. The software raid in
Linux does not support them with raid5/raid6, so you loose the
correctness aspect of ZFS you otherwise get even without hw raid
controllers.

(Speaking of this, can someone speak to the general state of affairs
with iSCSI with respect to write barriers? I assume Solaris does it
correctly; what about the bsd/linux stuff? Can one trust that the
iSCSI targets correctly implement cache flushing/writer barriers?)

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpi9RuPOuyGm.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box

2008-03-19 Thread Peter Schuller
 If you are willing to go cheap you can get something that holds 8 drives
 for $70: buy a standard tower case with five internal 3.5 bays ($50), and
 one of these enclosures that fit in two 5.25 bays but give you three 3.5
 bays ($20).

I have one of these:

http://www.gtek.se/index.php?mode=itemid=2454

That 2798 SEK price is about $450, and you can fit up to 30 3.5 drives in the 
exteme case.

This is true even when using the Supermicro SATA hotswap bays that fit 5 
drives in 3x5.25. You can fit 6 of these total, meaning 30 drives. Of course 
the cost of these bays are added (~ $60-$70 I believe for the 5-bay 
supermicro; the Lian Li stuff is cheaper, but not hotswap and such).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q : change disks to get bigger pool

2008-01-21 Thread Peter Schuller
 I don't believe this is true.  Online replacement is smart enough
 to pick this up.  Where you hare to re-import is when you change
 the size of a LUN without doing a zpool replace.

I see. I have never had the hardware necessary to actually try hotswaps, but I 
remember some people complaining here and on FreeBSD lists about not seeing 
the added space. In my case I always rebooted anyway so I could never tell 
the difference.

I stand corrected. Thanks!

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q : change disks to get bigger pool

2008-01-20 Thread Peter Schuller
 So will pool get bigger just by replasing all 4 disks one-by-one ?

Yes, but a re-import (either by export/import or by reboot) is necessary 
before the new space will be usable.

 And if it will get larger how this should be done , fail disks one-by-one
 .. or ???

Use zpool replace.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS via Virtualized Solaris?

2008-01-07 Thread Peter Schuller
 From what I read, one of the main things about ZFS is Don't trust the
  underlying hardware.  If this is the case, could I run Solaris under
  VirtualBox or under some other emulated environment and still get the
  benefits of ZFS such as end to end data integrity?

 You could probably answer that question by changing the phrase to Don't
 trust the underlying virtual hardware!  ZFS doesn't care if the storage is
 virtualised or not.

But worth noting is that, as with for example hardware RAID, if you intend to 
take advantage of the self-healing properties of ZFS with multiple disks, you 
must expose the individual disks to your mirror/raidz/raidz2 individually 
through the virtualization environment and use them in your pool.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD performance

2007-12-19 Thread Peter Schuller
 I was just wandering that maybe there's a problem with just one
 disk...

No, this is something I have observed on at least four different systems, with 
vastly varying hardware. Probably just the effects of the known problem.

Thanks,

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD performance

2007-12-18 Thread Peter Schuller
 Sequential writing problem with process throttling - there's an open
 bug for it for quite a while. Try to lower txg_time to 1s - should
 help a little bit.

Yeah, my post was mostly to emphasize that on commodity hardware raidz2 does 
not even come close to being a CPU bottleneck. It wasn't a poke at the 
streaming performance. Very interesting to hear there's a bug open for it 
though.

 Can you also post iostat -xnz 1 while you're doing dd?
 and zpool status

This was FreeBSD, but I can provide iostat -x if you still want it for some 
reason. 

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD performance

2007-12-14 Thread Peter Schuller
 Use a faster processor or change to a mirrored configuration.
 raidz2 can become processor bound in the Reed-Soloman calculations
 for the 2nd parity set.  You should be able to see this in mpstat, and to
 a coarser grain in vmstat.

Hmm. Is the OP's hardware *that* slow? (I don't know enough about the Sun 
hardware models)

I have a 5-disk raidz2 (cheap SATA) here on my workstation, which is an X2 
3800+ (i.e., one of the earlier AMD dual-core offerings). Here's me dd:ing to 
a file on FreeBSD on ZFS running on that hardware:

promraid 741G   387G  0380  0  47.2M
promraid 741G   387G  0336  0  41.8M
promraid 741G   387G  0424510  51.0M
promraid 741G   387G  0441  0  54.5M
promraid 741G   387G  0514  0  19.2M
promraid 741G   387G 34192  4.12M  24.1M
promraid 741G   387G  0341  0  42.7M
promraid 741G   387G  0361  0  45.2M
promraid 741G   387G  0350  0  43.9M
promraid 741G   387G  0370  0  46.3M
promraid 741G   387G  1423   134K  51.7M
promraid 742G   386G 22329  2.39M  10.3M
promraid 742G   386G 28214  3.49M  26.8M
promraid 742G   386G  0347  0  43.5M
promraid 742G   386G  0349  0  43.7M
promraid 742G   386G  0354  0  44.3M
promraid 742G   386G  0365  0  45.7M
promraid 742G   386G  2460  7.49K  55.5M

At this point the bottleneck looks architectural rather than CPU. None of the 
cores are saturated, and the CPU usage of the ZFS kernel threads is pretty 
low.

I say architectural because writes to the underlying devices are not 
sustained; it drops to almost zero for certain periods (this is more visible 
in iostat -x than it is in the zpool statistics). What I think is happening 
is that ZFS is too late to evict data in the cache, thus blocking the writing 
process. Once a transaction group with a bunch of data gets committed the 
application unblocks, but presumably ZFS waits for a little while before 
resuming writes.

Note that this is also being run on plain hardware; it's not even PCI Express. 
During throughput peaks, but not constantly, the bottleneck is probably the 
PCI bus.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on Freebsd 7.0

2007-12-07 Thread Peter Schuller
   NAMESTATE READ WRITE CKSUM
   fatty   DEGRADED 0 0 3.71K
 raidz2DEGRADED 0 0 3.71K
   da0 ONLINE   0 0 0
   da1 ONLINE   0 0 0
   da2 ONLINE   0 0 0
   da3 ONLINE   0 0   300
   da4 ONLINE   0 0 0
   da5 ONLINE   0 0 0
   da6 ONLINE   0 0   253
   da7 ONLINE   0 0 0
   da8 ONLINE   0 0 0
   spare   DEGRADED 0 0 0
 da9   OFFLINE  0 0 0
 da11  ONLINE   0 0 0
   da10ONLINE   0 0 0
   spares
 da11  INUSE currently in use

 errors: 801 data errors, use '-v' for a list


 After I detach the spare da11 and bring da9 back online all the errors
 go away.

Theory:

Suppose da3 and da6 are either bad drives, have cabling issues, or are on a 
controller suffering corruption (different from the other drives).

If you now were to replace da9 by da11, the resilver operation would be 
reading from these drives, thus triggering checksum issues. Once you bring 
da9 back in, it is either entirely up to date or very close to up to date, so 
the amount of I/O required to resilver it is very small and may not trigger 
problems.

If this theory is correct, a scrub (zpool scrub fatty) should encounter 
checksum errors on da3 and da6.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-16 Thread Peter Schuller
 Brain damage seems a bit of an alarmist label. While you're certainly right
 that for a given block we do need to access all disks in the given stripe,
 it seems like a rather quaint argument: aren't most environments that
 matter trying to avoid waiting for the disk at all? Intelligent prefetch
 and large caches -- I'd argue -- are far more important for performance
 these days.

The concurrent small-i/o problem is fundamental though. If you have an 
application where you care only about random concurrent reads for example, 
you would not want to use raidz/raidz2 currently. No amount of smartness in 
the application gets around this. It *is* a relevant shortcoming of 
raidz/raidz2 compared to raid5/raid6, even if in many cases it is not 
significant.

If disk space is not an issue, striping across mirrors will be okay for random 
seeks. But if you also care about diskspace, it's a show stopper unless you 
can throw money at the problem.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Response to phantom dd-b post

2007-11-12 Thread Peter Schuller
 proper UPS and a power outtage 
should never happen unless you suck. (This has actually been advocated to me. 
Seriously.)

[2] Because of [1] and because of course you only run stable software that is 
well tested and will never be buggy. (This has been advocated. Seriously.)

[3] Because of [1].

[4] Because of [1], [2] and [3].

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-04 Thread Peter Schuller
 Your best bet is to call Tech Support and not Sales.  I've found LSI
 tech support to be very responsive to individual customers.

Thanks. I'll try them. I eventually noticed you could actually get the number 
to them under the LSI offices category of their find-your-contact web form 
system which otherwise looked like a re-seller inventory.

 I recommend the SuperMicro card - but that is PCI-X and I think you're
 looking for PCI-Express? 

PCI is okay and nice, PCI-Express is nicer. PCI-X I don't want since it is 
only semi-compatible with PCI. E.g. the Marvell I have now works in one 
machine, not in another.

 works well with ZFS (SATA or SAS drives).  The newer cards are less
 expensive - but its not clear from the LSI website if they support
 JBOD operation or if you can form a mirror or stripe using only
 one drive and present it to ZFS as a single drive.

I am okay with a one-disk mirror/stripe in the worse case, as long as cache 
flushes and such get passed through.

Would definitely prefer JBOD though since single-disk virtual volumes tend to 
cause some additional headachages (like having two levels of volumen 
management).

 Please let us know what you find out...

If I get anything confirmed from LSI I'll post an update.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-05 Thread Peter Schuller
 Is there a specific reason why you need to do the caching at the DB  
 level instead of the file system?  I'm really curious as i've got  
 conflicting data on why people do this.  If i get more data on real  
 reasons on why we shouldn't cache at the file system, then this could  
 get bumped up in my priority queue.

FWIW a MySQL database was recently moved to a FreeBSD system with
ZFS. Performance ended up sucking because for some reason data did not
make it into the cache in a predictable fashion (simple case of
repeated queries were not cached; so for example a very common query,
even when executed repeatedly on an idle system, would take more than
1 minute instead of 0.10 seconds or so when cached).

Ended up convincing the person running the DB to switch from MyISAM
(which does not seem to support DB level caching, other than of
indexes) to InnoDB, thus allowing use of the InnoDB buffer cache.

I don't know why it wasn't cached by ZFS/ARC to begin with (the size
of the ARC cache was definitely large enough - ~ 800 MB, and I know
the working set for this query was below 300 MB). Perhaps it has to do
with ARC trying to be smart and avoiding flushing the cache with
useless data? I am not read up on the details of the ARC. But in this
particular case it was clear that a simple LRU had been much more
useful - unless there was some other problem related to my setup or
FreeBSD integration that somehow broke proper caching.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpmGHbRPWivC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Survivability of zfs root

2007-09-28 Thread Peter Schuller
 Now, what if that system had been using ZFS root? I have a
 hardware failure, I replace the raid card, the devid of the boot
 device changes.

I am not sure on Solaris, but on FreeBSD I always use glabel:ed
devices in my ZFS pools, making them entirely location independent.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgp4eTccU5E8q.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] reccomended disk configuration

2007-09-17 Thread Peter Schuller
 I also wanted to test a recovery of my pool, so my took two disk raidz pool 
 onto a friends freebsd box.  It seems both systems use zfs version 6, but the 
 import failed.  I noticed on the boot logs:
 
 GEOM: ad6: corrupt or invalid GPT detected.
 GEOM: ad6: GPT rejected -- may not be recoverable.
 
 Is that a solaris or freebsd problem do you think?

This has to do with the GPT
(http://en.wikipedia.org/wiki/GUID_Partition_Table) support rather
than ZFS. IIRC the GPT:s written by Solaris are valid, just not
recognized properly by FreeBSD (but I am out of date and don't
remember the source of this information).

AFAIK the ZFS pools themselves are fully portable.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpFiIJzXKiig.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] It is that time again... ZFS + Firewire/USB - and a specific enclosure

2007-08-06 Thread Peter Schuller
 However, I have been subscribed to the ZFS lists for some time now and
 done a Google check up for topics relating to this for some time, and
 the jury still seems to be out on whether or not this would be a good
 idea.

My only input is that when I used cheap-ass USB enclosures (for single disks) 
flush-cache commands never survived. i would get errors about unimplemented 
SCSI commands upon every ZFS transaction group commit. So a no-go for 
reliability (unless you can turn write caching off through the USB 
conversion - dunno, but even then you will instead suffer the performance 
penalty).

I don't know whether this was because of USB, or because it was cheap-ass 
crap, or something else. Perhaps Firewire works better, especially 
considering it was the de de facto standard in Macs for a long time.

The other issue is that I never seem to get decent performance out of USB, and 
in general have experienced USB enclosures and even USB hubs to be very buggy  
(randomly breaking, particularly with I/O happening with multiple devices at 
the same time, and then requireing power cycling to start working again).

I started buying USB drives at one point to get around the problems with 
fitting many disks in consumer products (=stuff that doesn't cost $100), 
but eventually just gave up because of broken drive controllers (hello WD) 
and broken USB hardware. And don't think that a retailer will consider 
drives broken for the purpose of warranty when they exhibit identical 
problems on both Solaris and FreeBSD (with AFAIK independent USB 
implementations), but work in Windows...

If you have luck with that then please post some public info as I am sure I 
would not be the only one to be interested in it. Wonder what those 8-ways 
cost new...

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS Scalability/performance

2007-06-25 Thread Peter Schuller
 FreeBSD plays it safe too.  It's just that UFS, and other file systems on 
 FreeBSD, understand write caches and flush at appropriate times.

Do you have something to cite w.r.t. UFS here? Because as far as I know,
that is not correct. FreeBSD shipped with write caching turned off by
default for a while for this reason, but then changed it IIRC due to the
hordes of people complaining about performance.

I also have personal experience of corruption-after-powerfail that
indicate otherwise.

I also don't see any complaints about cache flushing on USB drives with
UFS, which I did with ZFS every five seconds when it wanted to flush the
cache (which fails since the SCSI-USB bridge, or probably the USB mass
storage stuff itself, doest not support it).

Also, given the design of UFS and the need for synchronous writes on
updates, I would be surprised strictly based on performance observations
if it actually did flush caches.

The ability to get decent performance *AND* reliability on cheap disks
is one of the major reasons why I love ZFS :)

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] data gone?

2007-06-12 Thread Peter Schuller
 I tried to add a third disk to the raidz array
 The third disk didn't get added to the raidz array, it was added to the pool, 
 but 'parallel' to the raidz

This is because it is not currently possible to add disks to a
raidz/raidz2. Adding storage is typically done by adding an additional
raidz/raidz2 array that you then stripe between.

I believe zpool should have warned you about trying to add a
non-redundant component alongside the redundant raidz, requiring you to
force (-f) the addition.



-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Re: ZFS consistency guarantee

2007-06-08 Thread Peter Schuller
 1. ZFS atomic operation that commits data.
 2. Writes come into the app.
 3. The db put in hotbackup mode.
 4. Snapshot taken on storage.
 5. ZFS atomic operation that commits data.

 So if i do a snap restore, ZFS might revert to point1, but from the db
 perspective, it is inconsistent and we would need to do a
 recovery..correct?.
 
 Right.  So you'll want to synchronize your snapshots with a database
 consistency.  Just like doing backups.

I have gotten the feeling that everyone is misunderstanding everyone
else in this thread ;)

My understanding is that a zfs snapshot that can be proven to have
happened subsequent to a particular write() (or link(), etc), is
guaranteed to contain the data that was written. Anything else would
massively decrease the usefulness of snapshots.

Is this incorrect? If not, feel free to ignore the remainder of this E-Mail.

If it is, then I don't see why the filesystem would be reverted to (1).
It should in fact be guaranteed to revert to (4) (unless the creation of
the snapshot is itself not guaranteed to be persistent without an
explicit global sync by the administrator - but I doubt this is the
case?).

Regardless of the details of snapshots, I think the point that needs
making to the OP is that regardless of filesystem issues the data as
written to that filesystem by the application must always be consistent
from the perspective of the application, and that a snapshot just gives
you a snapshot of a filesystem for which any read will return whatever
it would have done exactly at the point of the snapshot. If the
application has not written the data, it will not be part of the
snapshot. Thus if the application has writes pending that are needed for
consistency, those writes must complete prior to snapshotting.

The synching, which I assume refer to fsync() and/or the sync command,
is about ensuring that the view of the filesystem (or usually a subset
of it) as seen by applications is actually committed to persistent
storage. This is done either to guarantee that some application-level
data is committed and will remain in the face of a crash (e.g. a banking
application does an SQL COMMIT), or as an overkill way of ensuring that
some I/O operation B physically happens after some I/O operation A (such
that in the event of a crash, B will never appear on disk if A does not
also appear) (such as a database maintaining internal transactional
consistency).

Now, assuming that snapshots work in the way I assume and ask about
above, the use of a zfs snapshot at a point in time when the application
has written consistent data to the filesystem is sufficient to guarantee
consistency in the event of a crash. Essentially the zfs snapshot can be
used to achieve the effect of fsync(), with the added benefit of being
able to administratively roll back to the previous version rather than
just guaranteeing that there is some consistent state to return back to.

(Incidentally, since, according to a post here on the list in response
to a related question I had, ZFS already guarantees ordering of writes
there is presumably some pretty significant performance improvements to
be had if a database was made aware of this and allowed a weaker form of
COMMIT where you drop the persistence requirement, but keep the
consistency requirement.)


-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Making 'zfs destroy' safer

2007-05-21 Thread Peter Schuller
 On the other hand personally I just don't see the need for this since
 the @ char isn't special to the shell so I don't see where the original
 problem came from.

I never actually *had* a problem, I am just nervous about it. And yes, @
is not special for classical shells, but it's still more special than
alphanumerics or '/', and probably more likely to *be* special in some
languages/alternate shells.

Then there is the seemingly trivial issue of the physical keyboard
layout. The most common layout will tend to make you use the right shift
key in order to type the @, in a particular sequence such that a slight
slip of the finger there and *kaboom*, you have lost your (and/or
everybody else's) data by accidentally hitting enter instead of shift.

One can of course protect against this by writing commands backwards and
such, which is what I do for cases like this along with SQL DELETE
statements, but to me it just feels unnecessarily dangerous.

 One thing that might help here, when not running as root or as a user
 with ZFS File System Management RBAC profile, is user delegation. This
 will allow you to run the script as user that can only do certain
 operations to filesystems that they own or have been delegated specific
 access to operate on.

On the other than a very minor modification to the command line tool
gets you a pretty significant payoff without complicating things. It
will affect the safety of the out-of-the-box tool, regardless of the
local policy for privilege delegation.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Making 'zfs destroy' safer

2007-05-21 Thread Peter Schuller
 I would much prefer to do
 
 for snap in $(zfs list -t snapshot -r foo/bar )
 do
   zfs destroy -t snapshot $snap
 do
 
 the not have the -t. Especially the further away the destroy is from the 
 generation of the list.  The extra -t would be belt and braces but that is 
 how I like my data protected.

Especially if we imagine that someone further down the line decides to
slightly modify the format of the zfs list -t snapshot output - such
as by having it give a hierarchal view with the roots being the
filesystem path...

This is of course a normal problem with shell scripting (unless the zfs
 command is documented to guarantee backward compatible output?), but in
cases like this it really really becomes critical.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making 'zfs destroy' safer

2007-05-19 Thread Peter Schuller
 Rather than rehash this, again, from scratch.  Refer to a previous
 rehashing.
 http://www.opensolaris.org/jive/thread.jspa?messageID=15363;

I agree that adding a -f requirement and/or an interactive prompt is not
a good solution. As has already been pointed out, my suggestion is
different.

zfs destroy is very general. Often, generality is good (e.g. in
programming languages). But when dealing with something this dangerous
and by it's very nature likely to be used on live production data either
manually or in scripts (that are not subject to a release engineering
process), I think it is useful to make it possible to be more specific,
such that the possible repercussions of a misstake are limited.

As an analogy - would you want rm to automatically do rm -rf if
invoked on a directory? Most probably would not. The general solution
would be for rm to just do what you tell it to - remove whatever you
are pointing it to. But I think most would agree that things are safer
the way they work now.

However that said I am not suggesting removing existing functionality of
destroy, but to provide a way be more specific about your intended
actions in cases where you want to destroy snapshots or clones.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making 'zfs destroy' safer

2007-05-19 Thread Peter Schuller
 Apparently (and I'm not sure where this is documented), you can 'rmdir'
 a snapshot to remove it (in some cases).

Ok. That would be useful, though I also don't like that it breaks
standard rmdir semantics.

In any case it does not work in my case - but that was on FreeBSD.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Making 'zfs destroy' safer

2007-05-18 Thread Peter Schuller
Hello,

with the advent of clones and snapshots, one will of course start
creating them. Which also means destroying them.

Am I the only one who is *extremely* nervous about doing zfs destroy
some/[EMAIL PROTECTED]?

This goes bot manually and automatically in a script. I am very paranoid
about this; especially because the @ sign might conceivably be
incorrectly interpreted by some layer of scripting, being a
non-alphanumeric character and highly atypical for filenames/paths.

What about having dedicated commands destroysnapshot, destroyclone,
or remove (less dangerous variant of destroy) that will never do
anything but remove snapshots or clones? Alternatively having something
along the lines of zfs destroy --nofs or zfs destroy --safe.

I realize this is borderline being in the same territory as special
casing rm -rf / and similar, which is generally not considered a good
idea.

But somehow the snapshot situation feels a lot more risky.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making 'zfs destroy' safer

2007-05-18 Thread Peter Schuller
 What about having dedicated commands destroysnapshot, destroyclone,
 or remove (less dangerous variant of destroy) that will never do
 anything but remove snapshots or clones? Alternatively having something
 along the lines of zfs destroy --nofs or zfs destroy --safe.

Another option is to allow something along the lines of:

zfs destroy snapshot:/path/to/[EMAIL PROTECTED]

Where the use of snapshot: would guarantee that non-snapshots are not
affected.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-13 Thread Peter Schuller
   That is interesting. Could this account for disproportionate kernel
   CPU usage for applications that perform I/O one byte at a time, as
   compared to other filesystems? (Nevermind that the application
   shouldn't do that to begin with.)
 
 I just quickly measured this (overwritting files in CHUNKS);
 This is a software benchmark (I/O is non-factor)
 
   CHUNK   ZFS vz UFS
 
   1B  4X slower
   1K  2X slower
   8K  25% slower
   32K equal
   64K 30% faster
 
 Quick and dirty but I think it paints a picture.
 I can't really answer your question though.

I should probably have said other filesystems on other platforms, I
did not really compare properly on the Solaris box. In this case it
was actually BitTorrent (the official python client) that was
completely CPU bound in kernel space, and tracing showed single-byte
I/O.

Regardless, the above stats are interesting and I suppose consistent
with what one might expect, from previous discussion on this list.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Shrinking a zpool?

2007-02-13 Thread Peter Schuller
 No 'home user' needs shrink.

I strongly disagree with this.

The ability to shrink can be useful in many specific situations, but
in the more general sense, and this is in particular for home use, it
allows you to plan much less rigidly. You can add/remove drives left
and right at your leasure and won't work yourself into a corner with
regards to drive sizes, lack of drive connectivity, or drive
interfaces as you can always perform an incremental
migration/upgrade/downgrade.

 Every professional datacenter needs shrink.

Perhaps at some level. At the level of having 1-20 semi-structured
servers with 5-20 or so terrabytes each, you probably don't need is
all that much - even if it would be nice (speaking from experience).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Meta data corruptions on ZFS.

2007-02-13 Thread Peter Schuller
 This is expected because of the copy-onwrite nature of ZFS. During 
 truncate it is trying to allocate
 new disk blocks probably to write the new metadata and fails to find them.

I realize there is a fundamental issue with copy on write, but does
this mean ZFS does not maintain some kind of reservation to guarantee
you can always remove data?

If so I would consider this a major issue for general purpose use, and
if nothing else it should most definitely be clearly
documented. Accidentally filling up space is not at *all* uncommon in
many situations, be it home use or medium sized business type use. Yes
you should avoid it, but shit (always) happens.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Peter Schuller
 That said, actually implementing the underlying mechanisms may not be
 worth the trouble.  It is only a matter of time before disks have fast
 non-volatile memory like PRAM or MRAM, and then the need to do
 explicit cache management basically disappears.

I meant fbarrier() as a syscall exposed to userland, like fsync(), so
that userland applications can achieve ordered semantics without
synchronous writes. Whether or not ZFS in turn manages to eliminate
synchronous writes by some feature of the underlying storage mechanism
is a separate issue. But even if not, an fbarrier() exposes an
asynchronous method of ensuring relative order of I/O operations to
userland, which is often useful.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: How much do we really want zpool remove?

2007-01-25 Thread Peter Schuller
 The ability to shrink a pool by removing devices is the only reason my
 enterprise is not yet using ZFS, simply because it prevents us from easily
 migrating storage.

Being able to do this would be very high on my wishlist from the perspective 
of a home user.

But also from the perspective of more serious user (though I am not involved 
in using ZFS in such a case - not yet anyway...) it is most definitely a very 
nice thing to be able to do, in various stituations.

Example of things I would love to be able to do, which I would with such a 
feature:

* Easily convert between mirror/striping/raidz/raidz2  (no need to purchase 
twice the capacity for temporary storage during a conversion).

* Easily move storage between physical machines as needs change (assuming a 
situation where you want drives locally attached to the machines in question, 
and iSCSI and similar is not an option).

* Revert stupid misstake: accidentally adding something to a pool that should 
not be there :)

* Easily - even live under the right circumstances - temporarily evacuate a 
disk in order to e.g. perform drive testing if suspicious behavior is present 
without a known cause.

* If a drive starts going bad and I do not have a spare readily available 
(typical home use situation), I may want to evacuate the semi-broken drive so 
that I do not loose redundancy until I can get another disk. May or may not 
be practical depending on current disk space usage of course.

* Some machine A needs a spare drive but there is none, and I have free disk 
space ond disk B and B has matching drives. Evacuate a disk on B and use as 
replacement in A (again, typical home use situation). Once I obtain a new 
drive revert B's disk into B again, or alternatively keep it in A and use the 
new drive in B.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Synchronous Mount?

2007-01-24 Thread Peter Schuller
 Specifically, I was trying to compare ZFS snapshots with LVM snapshots on
 Linux. One of the tests does writes to an ext3FS (that's on top of an LVM
 snapshot) mounted synchronously, in order to measure the real
 Copy-on-write overhead. So, I was wondering if I could do the same with
 ZFS. Seems not.

Given that ZFS does COW for *all* writes, what does this test actually intend 
to show when running on ZFS? Am I missing something, or should not writes to 
a clone be as fast, or even faster, than a write to a non-clone? Given that 
COW is always performed, but in the case of the clone the old data is not 
removed.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] On-failure policies for pools

2007-01-23 Thread Peter Schuller
Hello,

There have been comparisons posted here (and in general out there on the net) 
for various RAID levels and the chances of e.g. double failures. One problem 
that is rarely addressed though, is the various edge cases that significantly 
impact the probability of loss of data.

In particular, I am concerned about the relative likelyhood of bad sectors on 
a drive, vs. entire-drive failure. On a raidz where uptime is not important, 
I would not want a dead drive + a single bad sector on another drive to cause 
loss of data, yet dead drive + bad sector is going to be a lot more likely 
than two dead drives within the same time window.

In many situations it may not feel worth it to move to a raidz2 just to avoid 
this particular case.

I would like a pool policy that allowed one to specify that at the moment a 
disk fails (where fails = considered faulty), all mutable I/O would be 
immediately stopped (returning I/O errors to userspace I presume), and any 
transaction in the process of being committed is rolled back. The result is 
that the drive that just failed completely will not go out of date 
immediately.

If one then triggers a bad block on another drive while resilvering with a 
replacement drive, you know that you have the failed drive as a last resort 
(given that a full-drive failure is unlikely to mean the drive was physically 
obliterated; perhaps the controller circuitery can be replaced or certain 
physical components can be replaced). In the case of raidz2, you effectively 
have another half level of redundancy.

Also, with either raidz/raidz2 one can imagine cases where a machine is booted 
with one or two drives missing (due to cabling issues for example); 
guaranteeing that no pool is ever online for writable operations (thus making 
abscent drives out of date) until the administrative explicitly asks for it, 
would greatly reduce the probability of data loss due to a bad block in this 
case aswell.

In short, if true irrevocable dataloss is limited (assuming no software 
issues) to the complete obliteration of all data on n drives (for n levels of 
redundancy), or alternatively to the unlikely event of bad blocks co-inciding 
on multiple drives, wouldn't reliability be significantly increased in cases 
where this is an acceptable practice?

Opinions?

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-11 Thread Peter Schuller
 PS The ZFS administration guide mentions this recommendation, but does not
 give PS any hint as to why. A reader may assume/believe it's just general
 adviced, PS based on someone's opinion that with more than 9 drives, the
 statistical PS probability of failure is too high for raidz (or raid5).
 It's a shame the PS statement in the guide is not further qualified to
 actually explain that PS there is a concrete issue at play.

 I don't know if ZFS MAN pages should teach people about RAID.

 If somebody doesn't understand RAID basics then some kind of tool
 where you just specify pool of disk and have to choose from: space
 efficient, performance, non-redundant and that's it - all the rest
 will be hidden.

But the guide *does* make a recommendation, but does not qualify it. And if 
there is a problem specific to ZFS that is NOT just obvious results of some 
general principle, that's very relevant for the ZFS administration guide IMO 
(and man pages for that matter).

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Peter Schuller
 It's just a common sense advise - for many users keeping raidz groups
 below 9 disks should give good enough performance. However if someone
 creates raidz group of 48 disks he/she probable expects also
 performance and in general raid-z wouldn't offer one.

There is at least one reason for wanting more drives in the same 
raidz/raid5/etc: redundancy.

Suppose you have 18 drives. Having two raidz:s constisting of 9 drives is 
going to mean you are more likaly to fail than having a single raidz2 
consisting of 18 drives, since in the former case yes - two drives can go 
down, but only if they are the *right* two drives. In the latter case any two 
drives can go down.

The ZFS administration guide mentions this recommendation, but does not give 
any hint as to why. A reader may assume/believe it's just general adviced, 
based on someone's opinion that with more than 9 drives, the statistical 
probability of failure is too high for raidz (or raid5). It's a shame the 
statement in the guide is not further qualified to actually explain that 
there is a concrete issue at play.

(I haven't looked into the archives to find the previously mentioned 
discussion.)

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10

2007-01-08 Thread Peter Schuller
  Is this expected behavior? Assuming concurrent reads (not synchronous and
  sequential) I would naively expect an ndisk raidz2 pool to have a
  normalized performance of n for small reads.

 q.v. http://www.opensolaris.org/jive/thread.jspa?threadID=20942tstart=0
 where such behavior in a hardware RAID array lead to corruption which
 was detected by ZFS.  No free lunch today, either.
  -- richard

I appreciate the advantage of checksumming, believe me. Though I don't see why 
this is directly related to the small read problem, other than that the 
implementation is such.

Is there some fundamental reason why one could not (though I understand one 
*would* not) keep a checksum on a per-disk basis, so that in the normal case 
one really could read from just one disk, for a small read? I realize it is 
not enough for a block to be self-consistent, but theoretically couldn't the 
block which points to the block in question contain multiple checksums for 
the various subsets on different disks, rather than just the one checksum for 
the entire block?

Not that I consider this a major issue; but since you pointed me to that 
article in response to my statement above...

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS and ZFS, a fine combination

2007-01-08 Thread Peter Schuller
 http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine

So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() 
and synchronous writes from the application perspective; it will do *NOTHING* 
to lessen the correctness guarantee of ZFS itself, including in the case of a 
power outtage?

This makes it more reasonable to actually disable the zil. But still, 
personally I would like to be able to tell the NFS server to simply not be 
standards compliant, so that I can keep the correct semantics on the lower 
layer (ZFS), and disable the behavior at the level where I actually want it 
disabled (the NFS server).

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Adding disk to a RAID-Z?

2007-01-08 Thread Peter Schuller
 I want to setup a ZFS server with RAID-Z.  Right now I have 3 disks.  In 6
 months, I want to add a 4th drive and still have everything under RAID-Z
 without a backup/wipe/restore scenario.  Is this possible?

You can add additional storage to the same pool effortlessly, such that the 
pool will be striped across two raidz:s. You cannot (AFAIK) expand the raidz 
itself. End result is 9 disks, with 7 disks worth of effective storage 
capacity. The ZFS administratiion guide contains examples of doing exactly 
this, except I believe the examples use mirrors.

ZFS administration guide:

http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10

2007-01-03 Thread Peter Schuller
 I've been using a simple model for small, random reads.  In that model,
 the performance of a raidz[12] set will be approximately equal to a single
 disk.  For example, if you have 6 disks, then the performance for the
 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way
 dynamic stripe of 2-way mirrors will have a normalized performance of 6.
 I'd be very interested to see if your results concur.

Is this expected behavior? Assuming concurrent reads (not synchronous and 
sequential) I would naively expect an ndisk raidz2 pool to have a normalized 
performance of n for small reads.

Is there some reason why a small read on a raidz2 is not statistically very 
likely to require I/O on only one device? Assuming a non-degraded pool of 
course.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very strange performance patterns

2006-12-27 Thread Peter Schuller
 Short version: Pool A is fast, pool B is slow. Writing to pool A is
 fast. Writing to pool B is slow. Writing to pool B WHILE writing to
 pool A is fast on both pools. Explanation?

[snip]

For the archives, it is interesting to note that when I do not perform a 
local dd to the device, but instead use rsync over ssh (thus being limited 
by network/cpu) performance is fine. It bursts out every now and then:

--  -  -  -  -  -  -
fast51.7G   246G  0  0  0  0
raid0124M  1.16T  0  0  0  0
--  -  -  -  -  -  -
fast51.7G   246G  0  0  0  0
raid0124M  1.16T  0  0  0  0
--  -  -  -  -  -  -
fast51.7G   246G  0  0  0  0
raid0124M  1.16T  0 56  0  7.08M
--  -  -  -  -  -  -
fast51.7G   246G  0  0  0  0
raid0124M  1.16T  0115  0  14.4M
--  -  -  -  -  -  -
fast51.7G   246G  0  0  0  0
raid0154M  1.16T  0 90  0  1.04M
--  -  -  -  -  -  -
fast51.7G   246G  0  0  0  0
raid0154M  1.16T  0  0  0  0

Performance is still not visibly stellar, but then the actual stransfer is 
only 4 megs or so per second and I don't know how flushes are handled by zfs. 
But it's most definitely faster than the 4 mb/second seen when dd:ing to the 
pool.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Very strange performance patterns

2006-12-20 Thread Peter Schuller
Hello,

Short version: Pool A is fast, pool B is slow. Writing to pool A is
fast. Writing to pool B is slow. Writing to pool B WHILE writing to
pool A is fast on both pools. Explanation?

Long version:

I have an existing two-disk pool consisting of two SATA drives. Call
this pool pool1. This has always been as fast as I would have
expected; 30-50 MB/second write, 105 MB/second read.

I have now added four additional drives (A, B, C and D) to the
machine, that I wanted to use for a raidz. For initial testing I chose
a striped pool just to see what kind of performance I would get.

The initial two drives (pool1) are on their own controller. A and B
are on a second controller and C and D on a third controller. All of
the controllers are SiL3512.

Here comes the very interesting bit. For the purpose of the diagram
below, good performance means 20-50 mb/sec write and ~70-80 mb/sec
read. Bad performance means 3-5 mb/sec () write and ~70-80
mb/sec read.

disk layout in pool | other I/O  |  performance

A + B   | none   |  good
C + D   | none   |  good
A + B + C + D   | none   |  bad
A + B + C   | none   |  bad 
A + B + C + D   | write to pool1 |  goodish (!!!)

(some tested combinations omitted)

In other words: Initially it looked like write performance went down
the drain as soon as I combined drives from multiple controllers into
one pool, while performance was fine as long as I stayed within one
controller.

However, writing to the slow A+B+C+D pool *WHILE ALSO WRITING TO
POOL1* actually *INCREASES* performance. The write to pool1 and the
otherwise slow pool are not quite up to normal good level, but
that is probably to be expected even under normal circumstances.

CPU usage during the writing (when slow) is almost non-existent. There
is no spike similar to what you seem to get every five second or so
normally (during transaction commits?).

Also, at least once I saw the write performance on the slow pool
spike at 19 mb/second for a single second period (zpool iostat) when I
initiated the write, then it went down again and remains very
constant, not really varying outside 3.4-4.5. Often EXACTLY at 3.96.

writing and reading means dd:ing (to /dev/null, from /dev/zero)
with bs=$((1024*1024)).

Pools created with zpool create speedtest c4d0 c5d0 c6d0 c7d0 and
variations of that for the different combinations. The pool with all
four drives is 1.16T in size.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS and write caching (SATA)

2006-12-12 Thread Peter Schuller
Hello,

my understanding is that ZFS is specifically designed to work with write 
caching, by instructing drives to flush their caches when a write barrier is 
needed. And in fact, even turns write caching on explicitly on managed 
devices.

My question is of a practical nature: will this *actually* be safe on the 
average consumer grade SATA drive? I have seen offhand references to PATA 
drives generally not being trustworthy when it comes to this (SCSI therefore 
being recommended), but I have not been able to find information on the 
status of typical SATA drives.

While I do intend to perform actual powerloss tests, it would be interesting 
to hear from anybody whether it is generally expected to be safe.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and write caching (SATA)

2006-12-12 Thread Peter Schuller
 PS While I do intend to perform actual powerloss tests, it would be
 interesting PS to hear from anybody whether it is generally expected to be
 safe.

 Well is disks honors cache flush commands then it should be reliable
 wether it's SATA or SCSI disk.

Yes. Sorry, I could have stated my question clear:er. What I am specifically 
concerned about is exactly that - whether your typical SATA drive *will* 
honor cache flush commands, as I understand a lot of PATA drives did/do not.

Googling tends to give very little concrete information on this since very few 
people actually seem to care about this. Since I wanted to confirm my 
understanding of ZFS semantics w.r.t. write caching anyway I thought I might 
aswell also ask about the general tendency among drives since, if anywhere, 
people here might know.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss