Re: [zfs-discuss] [zfs] Petabyte pool?

2013-03-16 Thread Bob Friesenhahn

On Sat, 16 Mar 2013, Kristoffer Sheather @ CloudCentral wrote:


Well, off the top of my head:

2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's

That should fit within 1 rack comfortably and provide 1 PB storage..


What does one do for power?  What are the power requirements when the 
system is first powered on?  Can drive spin-up be staggered between 
JBOD chassis?  Does the server need to be powered up last so that it 
does not time out on the zfs import?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun X4200 Question...

2013-03-11 Thread Bob Friesenhahn

On Mon, 11 Mar 2013, Tiernan OToole wrote:


I know this might be the wrong place to ask, but hopefully someone can point me 
in the right direction...
I got my hands on a Sun x4200. Its the original one, not the M2, and has 2 
single core Opterons, 4Gb RAM and 4 73Gb SAS Disks...
But, I dont know what to install on it... I was thinking of SmartOS, but the 
site mentions Intel support for VT, but nothing for
AMD... The Opterons dont have VT, so i wont be using XEN, but the Zones may be 
useful... 


OpenIndiana or OmniOS seem like the most likely candidates.

You can run VirtualBox on OpenIndiana and it should be able to work 
without VT extensions.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Huge Numbers of Illegal Requests

2013-03-06 Thread Bob Friesenhahn

On Tue, 5 Mar 2013, Ed Shipe wrote:


On 2 different OpenIndiana 151a7 systems, Im showing a huge number of Illegal 
Requests.  There are no other apparent
issues, performance is fine, etc,etc.Everything works great - what are these 
illegal requests?  My Google-Foo is
failing me...


My system used to exhibit this problem so I opened Illumos issue 2998 
(https://www.illumos.org/issues/2998).  The weird thing is that the 
problem went away and has not returned.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Distro Advice

2013-03-05 Thread Bob Friesenhahn

On Mon, 4 Mar 2013, Matthew Ahrens wrote:


Magic rsync options used:

  -a --inplace --no-whole-file --delete-excluded

This causes rsync to overwrite the file blocks in place rather than writing
to a new temporary file first.  As a result, zfs COW produces primitive
deduplication of at least the unchanged blocks (by writing nothing) while
writing new COW blocks for the changed blocks.


If I understand your use case correctly (the application overwrites
some blocks with the same exact contents), ZFS will ignore these
no-op writes only on recent Open ZFS (illumos / FreeBSD / Linux)
builds with checksum=sha256 and compression!=off.  AFAIK, Solaris ZFS
will COW the blocks even if their content is identical to what's
already there, causing the snapshots to diverge.


With these rsync options, rsync will only overwrite a block if the 
contents of the block has changed.  Rsync's notion of a block is 
different than zfs so there is not a perfect overlap.


Rsync does need to read files on the destination filesystem to see if 
they have changed.  If the system has sufficient RAM (and/or L2ARC) 
then files may still be cached from the previous day's run.  In most 
cases only a small subset of the total files are updated (at least on 
my systems) so the caching requirements are small.  Files updated on 
one day are more likely to be the ones updated on subsequent days.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Distro Advice

2013-03-05 Thread Bob Friesenhahn

On Tue, 5 Mar 2013, David Magda wrote:

It's also possible to reduce the amount that rsync has to walk the entire
file tree.

Most folks simply do a rsync --options /my/source/ /the/dest/, but if
you use zfs diff, and parse/feed the output of that to rsync, then the
amount of thrashing can probably be minimized. Especially useful for file
hierarchies that very many individual files, so you don't have to stat()
every single one.


Zfs diff only works for zfs filesystems.  If one is using zfs 
filesystems then rsync may not be the best option.  In the real world, 
data may be sourced from many types of systems and filesystems.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Distro Advice

2013-02-27 Thread Bob Friesenhahn

On Wed, 27 Feb 2013, Ian Collins wrote:

Magic rsync options used:

-a --inplace --no-whole-file --delete-excluded

This causes rsync to overwrite the file blocks in place rather than
writing to a new temporary file first.  As a result, zfs COW produces
primitive deduplication of at least the unchanged blocks (by writing
nothing) while writing new COW blocks for the changed blocks.


Do these options impact performance or reduce the incremental stream sizes?


I don't see any adverse impact on performance and incremental stream 
size is quite considerably reduced.


The main risk is that if the disk fills up you may end up with a 
corrupted file rather than just an rsync error.  However, the 
snapshots help because an earlier version of the file is likely 
available.


I just use -a --delete and the snapshots don't take up much space (compared 
with the incremental stream sizes).


That is what I used to do before I learned better.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2013-02-26 Thread Bob Friesenhahn

On Tue, 26 Feb 2013, hagai wrote:


for what is worth..
I had the same problem and found the answer here -
http://forums.freebsd.org/showthread.php?t=27207


Given enough sequential I/O requests, zfs mirrors behave every much 
like RAID-0 for reads.  Sequential prefetch is very important in order 
to avoid the latencies.


While this script may not work perfectly as is for FreeBSD, it was 
very good at discovering a zfs performance bug (since corrected) and 
is still an interesting exercise for zfs to see how ZFS ARC caching 
helps for re-reads.  See 
http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh;. 
The script will exercise an initial uncached read from disks, and then 
a (hopefully) cached re-read from disks.  I think that it serves as a 
useful benchmark.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Distro Advice

2013-02-26 Thread Bob Friesenhahn

On Tue, 26 Feb 2013, Gary Driggs wrote:


On Feb 26, 2013, at 12:44 AM, Sašo Kiselkov wrote:

  I'd also recommend that you go and subscribe to z...@lists.illumos.org, 
since this list is going to get shut
  down by Oracle next month.


Whose description still reads, everything ZFS running on illumos-based 
distributions.


Even FreeBSD's zfs is now based on zfs from Illumos.  FreeBSD and 
Linux zfs developers contribute fixes back to zfs in Illumos.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Distro Advice

2013-02-26 Thread Bob Friesenhahn

On Tue, 26 Feb 2013, Richard Elling wrote:

Consider using different policies for different data. For traditional file 
systems, you
had relatively few policy options: readonly, nosuid, quota, etc. With ZFS, 
dedup and
compression are also policy options. In your case, dedup for your media is not 
likely
to be a good policy, but dedup for your backups could be a win (unless you're 
using
something that already doesn't backup duplicate data -- eg most backup 
utilities).
A way to approach this is to think of your directory structure and create file 
systems
to match the policies. For example:


I am finding that rsync with the right options (to directly 
block-overwrite) plus zfs snapshots is providing me with pretty 
amazing deduplication for backups without even enabling 
deduplication in zfs.  Now backup storage goes a very long way.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Distro Advice

2013-02-26 Thread Bob Friesenhahn

On Wed, 27 Feb 2013, Ian Collins wrote:

I am finding that rsync with the right options (to directly
block-overwrite) plus zfs snapshots is providing me with pretty
amazing deduplication for backups without even enabling
deduplication in zfs.  Now backup storage goes a very long way.


We do the same for all of our legacy operating system backups. Take a 
snapshot then do an rsync and an excellent way of maintaining incremental 
backups for those.


Magic rsync options used:

  -a --inplace --no-whole-file --delete-excluded

This causes rsync to overwrite the file blocks in place rather than 
writing to a new temporary file first.  As a result, zfs COW produces 
primitive deduplication of at least the unchanged blocks (by writing 
nothing) while writing new COW blocks for the changed blocks.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is there performance penalty when adding vdev to existing pool

2013-02-20 Thread Bob Friesenhahn

On Thu, 21 Feb 2013, Sašo Kiselkov wrote:


On 02/21/2013 12:27 AM, Peter Wood wrote:

Will adding another vdev hurt the performance?


In general, the answer is: no. ZFS will try to balance writes to
top-level vdevs in a fashion that assures even data distribution. If
your data is equally likely to be hit in all places, then you will not
incur any performance penalties. If, OTOH, newer data is more likely to
be hit than old data
, then yes, newer data will be served from fewer spindles. In that case
it is possible to do a send/receive of the affected datasets into new
locations and then renaming them.


You have this reversed.  The older data is served from fewer spindles 
than data written after the new vdev is added. Performance with the 
newer data should be improved.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-discuss mailing list opensolaris EOL

2013-02-16 Thread Bob Friesenhahn

On Fri, 15 Feb 2013, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) 
wrote:



So, I hear, in a couple weeks' time, opensolaris.org is shutting down.  What 
does that mean for this mailing list?  Should we
all be moving over to something at illumos or something?


There is a 'illumos-zfs' list for illumos.  Please see 
http://wiki.illumos.org/display/illumos/illumos+Mailing+Lists; for 
the available lists.  Most open discussion of zfs occurs on the 
illumos list, although there is also useful discussion on the 
freebsd-fs list at freebsd.org.



I'm going to encourage somebody in an official capacity at opensolaris to 
respond...

I'm going to discourage unofficial responses, like, illumos enthusiasts etc 
simply trying to get people to jump this list.


Good for you.  I am sure that Larry will be contacting you soon.

Previously Oracle announced and invited people to join their 
discussion forums, which are web-based and virtually dead.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-discuss mailing list opensolaris EOL

2013-02-16 Thread Bob Friesenhahn

On Sat, 16 Feb 2013, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) 
wrote:


From: Tim Cook [mailto:t...@cook.ms]

That would be the logical decision, yes.  Not to poke fun, but did you really
expect an official response after YEARS of nothing from Oracle?  This is the
same company that refused to release any Java patches until the DHS issued
a national warning suggesting that everyone uninstall Java.


Well, yes.  We do have oracle employees who contribute to this 
mailing list.  It is not accurate or fair to stereotype the whole 
company.  Oracle by itself is as large as some cities or countries.


Yes, these remaining employees do so because they still can.  Except 
for those employees brave enough to post to Illumos/OpenIndiana lists 
(there are some), there will be no more avenues remaining for 
unmoderated two-way communication with the outside world.  There have 
been some cases where people said unfavorable things about Oracle on 
this list.  Oracle needs to control its message and the principle form 
of communication will be via private support calls authorized by 
service contracts and authorized corporate publications.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Bob Friesenhahn

On Mon, 21 Jan 2013, Jim Klimov wrote:


Yes, maybe there were more cool new things per year popping up
with Sun's concentrated engineering talent and financing, but now
it seems that most players - wherever they work now - took a pause
from the marathon, to refine what was done in the decade before.
And this is just as important as churning out innovations faster
than people can comprehend or audit or use them.


I am on most of the mailing lists where zfs is discussed and it is 
clear that significant issues/bugs are continually being discovered 
and fixed.  Fixes come from both the Illumos community and from 
outside it (e.g. from FreeBSD).


Zfs is already quite feature rich.  Many of us would lobby for 
bug fixes and performance improvements over features.


Sašo Kiselkov's LZ4 compression additions may qualify as features 
yet they also offer rather profound performance improvements.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Bob Friesenhahn

On Sat, 19 Jan 2013, Stephan Budach wrote:


Now, this zpool is made of 3-way mirrors and currently 13 out of 15 vdevs are 
resilvering (which they had gone through yesterday as well) and I never got 
any error while resilvering. I have been all over the setup to find any 
glitch or bad part, but I couldn't come up with anything significant.


Doesn't this sound improbable, wouldn't one expect to encounter other chksum 
errors while resilvering is running?


I can't attest to chksum errors since I have yet to see one on my 
machines (have seen several complete disk failures, or disks faulted 
by the system though).  Checksum errors are bad and not seeing them 
should be the normal case.


Resilver may in fact be just verifying that the pool disks are 
coherent via metadata.  This might happen if the fiber channel is 
flapping.


Regarding the dire fiber channel issue, are you using fiber channel 
switches or direct connections to the storage array(s)?  If you are 
using switches, are they stable or are they doing something terrible 
like resetting?  Do you have duplex connectivity?  Have you verified 
that your FC HBA's firmware is correct?


Did you check for messages in /var/adm/messages which might indicate 
when and how FC connectivity has been lost?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Bob Friesenhahn

On Sat, 19 Jan 2013, Jim Klimov wrote:


On 2013-01-19 18:17, Bob Friesenhahn wrote:

Resilver may in fact be just verifying that the pool disks are coherent
via metadata.  This might happen if the fiber channel is flapping.


Correction: that (verification) would be scrubbing ;)


I don't think that zfs would call it scrubbing unless the user 
requested scrubbing.  Unplugging a USB drive which is part of a mirror 
for a short while results in considerable activity when it is plugged 
back in.  It is as if zfs does not trust the device which was 
temporarily unplugged and does a full validation of it.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Bob Friesenhahn

On Sat, 19 Jan 2013, Stephan Budach wrote:


Just ignore the timestamp, as it seems that the time is not set correctly, 
but the dates match my two issues from today and thursday, which accounts for 
three days. I didn't catch that before, but it seems to clearly indicate a 
problem with the FC connection…


But, what do I make of this information?


I don't know, but the issue/problem seems to below the zfs level so 
you need to fix that lower level before worrying about zfs.


Did you check for messages in /var/adm/messages which might indicate when 
and how FC connectivity has been lost?
Well, this is the most scaring part to me. Neither fmdump nor dmesg showed 
anything that would indicate a connectivity issue - at least not the last 
time.


Weird.  I wonder if multipathing is working for you at all.  With my 
direct-connect setup, if a path is lost, then there is quite a lot of 
messaging to /var/adm/messages.  I also see a lot of messaging related 
to multipathing when the system boots and first starts using the 
array.  However, with the direct-connect setup, the HBA can report 
problems immediately if it sees a loss of signal.  Your issues might 
be on the other side of the switch (on the storage array side) so the 
local HBA does not see the problem and timeouts are used.  Make sure 
to check the logs in your storage array to see if it is encountering 
resets or flapping connectivity.


Do you have duplex switches so that there are fully-redundant paths, 
or is only one switch used?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Bob Friesenhahn

On Wed, 16 Jan 2013, Thomas Nau wrote:


Dear all
I've a question concerning possible performance tuning for both iSCSI access
and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL
SSDs and 128G of main memory

The iSCSI access pattern (1 hour daytime average) looks like the following
(Thanks to Richard Elling for the dtrace script)


If almost all of the I/Os are 4K, maybe your ZVOLs should use a 
volblocksize of 4K?  This seems like the most obvious improvement.


[ stuff removed ]


For disaster recovery we plan to sync the pool as often as possible
to a remote location. Running send/receive after a day or so seems to take
a significant amount of time wading through all the blocks and we hardly
see network average traffic going over 45MB/s (almost idle 1G link).
So here's the question: would increasing/decreasing the volblocksize improve
the send/receive operation and what influence might show for the iSCSI side?


Matching the volume block size to what the clients are actually using 
(due to their filesystem configuration) should improve performance 
during normal operations and should reduce the number of blocks which 
need to be sent in the backup by reducing write amplification due to 
overlap blocks..


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Bob Friesenhahn

On Thu, 17 Jan 2013, Peter Wood wrote:


Unless there is some other way to test what/where these write operations are 
applied.


You can install Brendan Gregg's DTraceToolkit and use it to find out 
who and what is doing all the writing.  1.2GB in an hour is quite a 
lot of writing.  If this is going continuously, then it may be causing 
more fragmentation in conjunction with your snapshots.


See http://www.brendangregg.com/dtrace.html;.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Bob Friesenhahn

On Thu, 17 Jan 2013, Peter Wood wrote:


Great points Jim. I have requested more information how the gallery share is 
being used and any temporary data will
be moved out of there.
About atime, it is set to on right now and I've considered to turn it off but 
I wasn't sure if this will effect
incremental zfs send/receive.

'zfs send -i snapshot0 snapshot1' doesn't rely on the atime, right?


Zfs send does not care about atime.  The access time is useless other 
than as a way to see how long it has been since a file was accessed.


For local access (not true for NFS), Zfs is lazy about updating atime 
on disk and so it may not be updated on disk until the next 
transaction group is written (e.g. up to 5 seconds) and so it does not 
represent much actual load.  Without this behavior, the system could 
become unusable.


For NFS you should disable atime on the NFS client mounts.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Bob Friesenhahn

On Thu, 17 Jan 2013, Bob Friesenhahn wrote:


For NFS you should disable atime on the NFS client mounts.


This advice was wrong.  It needs to be done on the server side.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-16 Thread Bob Friesenhahn

On Wed, 16 Jan 2013, Peter Wood wrote:


Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write 
operations on the drives and 661 on the
ZIL. Compare to the other server (who is in way heavier use then this one) 
these numbers are extremely high.

Any idea how to debug any further?


Do some filesystems contain many snapshots?  Do some filesystems use 
small zfs block sizes.  Have the servers been used the same?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)

2012-12-13 Thread Bob Friesenhahn

On Wed, 12 Dec 2012, Jamie Krier wrote:



I am thinking about switching to an Illumos distro, but wondering if this 
problem may be present there
as well. 


I believe that Illumos is forked before this new virtual memory 
sub-system was added to Solaris.  There have not been such reports on 
Illumos or OpenIndiana mailing lists and I don't recall seeing this 
issue in the bug trackers.


Illumos is not so good at dealing with huge memory systems but 
perhaps it is also more stable as well.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS array on marvell88sx in Solaris 11.1

2012-12-12 Thread Bob Friesenhahn

On Wed, 12 Dec 2012, sol wrote:


Hello

I've got a ZFS box running perfectly with an 8-port SATA card
using the marvell88sx driver in opensolaris-2009.

However when I try to run Solaris-11 it won't boot.
If I unplug some of the hard disks it might boot
but then none of them show up in 'format'
and none of them have configured status in 'cfgadm'
(and there's an error or hang if I try to configure them).

Does anyone have any suggestions how to solve the problem?


Since you were previously using opensolaris-2009, have you considered 
trying OpenIndiana oi_151a7 instead?  You could experiment by booting 
from the live CD and seeing if your disks show up.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS array on marvell88sx in Solaris 11.1

2012-12-12 Thread Bob Friesenhahn

On Wed, 12 Dec 2012, sol wrote:


Thanks for the reply.
I've just tried openindiana and it behaves identically -
disks attached to the mv88sx6081 don't show up as disks.
(and APIC error interrupt (status0=0, status1=40) is emitted at boot.)

I've tried some changes to /etc/system with no success
(sata_func_enable=0x5, ahci_msi_enabled=0, sata_max_queue_depth=1)

Is there anything else I can try?


If the SATA card you are using is a JBOD-style card (i.e. disks are 
portable to a different controller), are you able/willing to swap it 
for one that Solaris is known to support well?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 6Tb Database with ZFS

2012-12-02 Thread Bob Friesenhahn

On Sat, 1 Dec 2012, Fung Zheng wrote:


Hello,

Thanks for you reply, i forgot to mention that the doc Configuring ZFS for an 
Oracle Database was followed, this
include primarycache, logbias, recordsize properties, all the best practices 
was followed and my only doubt is the
arc_max parameter, i want to know if 24Gb is good enough for a 6Tb database, 
someone have had implemented something
similar? which was the value used for arc_max?


As I recall, you can tune zfs_arc_max while the system is running so 
you can easily adjust this while your database is running and observe 
behavior and without rebooting.  It is possible that my recollection 
is wrong though.  If my recollection is correct, then it is not so 
important to know what is good enough before starting to put your 
database in service.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remove disk

2012-12-02 Thread Bob Friesenhahn

On Sat, 1 Dec 2012, Jan Owoc wrote:



When I would like to change the disk, I also would like change the disk
enclosure, I don't want to use the old one.


You didn't give much detail about the enclosure (how it's connected,
how many disk bays it has, how it's used etc.), but are you able to
power off the system and transfer the all the disks at once?



And what happen if I have 24, 36 disks to change ? It's take mounth to do
that.


Those are the current limitations of zfs. Yes, with 12x2TB of data to
copy it could take about a month.



You can create a brand new pool with the new chassis and use 'zfs 
send' to send a full snapshot of each filesystem to the new pool. 
After the bulk of the data has been transferred, take new snapshots 
and send the remainder.  This expects that both pools can be available 
at once.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS QoS and priorities

2012-11-29 Thread Bob Friesenhahn

On Thu, 29 Nov 2012, Jim Klimov wrote:


I've heard a claim that ZFS relies too much on RAM caching, but
implements no sort of priorities (indeed, I've seen no knobs to
tune those) - so that if the storage box receives many different
types of IO requests with different administrative weights in
the view of admins, it can not really throttle some IOs to boost
others, when such IOs have to hit the pool's spindles.

For example, I might want to have corporate webshop-related
databases and appservers to be the fastest storage citizens,
then some corporate CRM and email, then various lower priority
zones and VMs, and at the bottom of the list - backups.

AFAIK, now such requests would hit the ARC, then the disks if
needed - in no particular order. Well, can the order be made
particular with current ZFS architecture, i.e. by setting
some datasets to have a certain NICEness or another priority
mechanism?


QoS poses a problem.  Zfs needs to write a transaction group at a 
time.  During part of the TXG write cycle, zfs does not return any 
data.  Zfs writes TXGs quite hard so they fill the I/O channel.  Even 
if one orders the reads during the TXG write cycle, zfs will not 
return any data for part of the time.


There are really only a few solutions when resources might be limited:

  1. Use fewer resources
  2. Use resources more wisely
  3. Add more resources until problem goes away

I think that current zfs strives for #1 and QoS is option #2.  Quite 
often, option #3 is effective because problems just go away once 
enough resources are available.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] suggestions for e-SATA HBA card on x86/x64

2012-10-26 Thread Bob Friesenhahn

On Fri, 26 Oct 2012, Jerry Kemp wrote:


Thanks for the SIIG pointer, most of the stuff I had archived from this
list pointed to LSI products.

I poked around on the site and reviewed SIIG's SATA and SAS HBA.  I also
hit up their search engine.  I'm not implying I did an all inclusive
search, but nothing I came across on their site indicated any type of
Solaris or *Solaris distro support.


What is important is if Solaris supports the card.  I have no idea if 
Solaris supports any of their cards.



Did I miss something on the site?  Or maybe one of their sales people
let you know this stuff worked with Solaris?  Or should it just work
as long as it meets SAS or SATA standards?


They might not even know what Solaris is.  Actually, they might since 
this outfit previously made the USB/FireWire combo card used in SPARC 
and Intel Sun workstations.  It seems likely that SATA boards would 
work if they support the standard AHCI interface.  I would not take 
any chance with unknown SAS.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] suggestions for e-SATA HBA card on x86/x64

2012-10-25 Thread Bob Friesenhahn

On Thu, 25 Oct 2012, Sašo Kiselkov wrote:


Look for Dell's 6Gbps SAS HBA cards. They can be had new for $100 and
are essentially rebranded LSI 9200-8e cards. Always try to look for OEM
cards with LSI, because buying directly from them is incredibly expensive.


Do these support eSATA?  It seems unlikely.

I purchased an eSATA card (from SIIG, http://www.siig.com/) with the 
intention to try it with Solaris 10 to see if it would work but have 
not tried plugging it in yet.


It seems likely that a numer of cheap eSATA cards may work.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] suggestions for e-SATA HBA card on x86/x64

2012-10-25 Thread Bob Friesenhahn

On Thu, 25 Oct 2012, Sašo Kiselkov wrote:


On 10/25/2012 04:09 PM, Bob Friesenhahn wrote:

On Thu, 25 Oct 2012, Sašo Kiselkov wrote:


Look for Dell's 6Gbps SAS HBA cards. They can be had new for $100 and
are essentially rebranded LSI 9200-8e cards. Always try to look for OEM
cards with LSI, because buying directly from them is incredibly
expensive.


Do these support eSATA?  It seems unlikely.


eSATA is just SATA with a different connector - all you need is a cheap
conversion cable or appropriate eSATA-SATA bracket, e.g.
http://www.satacables.com/html/sata-pci-brackets.html


While this can certainly work, according to Wikipedia 
(http://en.wikipedia.org/wiki/Esata#eSATA), eSATA is more than just 
SATA with a different connector.  eSATA specifies a higher voltage 
range (minimum voltage) than SATA.  It may be that a HBA already uses 
this range, or maybe not.  Text I read says that maximum cable length 
is significantly reduced if an adaptor is used.


Also, I am curious to know how well hot-swap works with an 
enterprise-class SAS HBA and these cheap eSATA adaptors.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] all in one server

2012-09-18 Thread Bob Friesenhahn

On Tue, 18 Sep 2012, Erik Ableson wrote:


The bigger issue you'll run into will be data sizing as a year's 
worth of snapshot basically means that you're keeping a journal of 
every single write that's occurred over the year. If you are running


The above is not a correct statement.  The snapshot only preserves the 
file-level differences between the points in time.  A snapshot does 
not preserve every single write.  Zfs does not even send every 
single write to underlying disk.  In some usage models, the same file 
may be re-written 100 times between snapshots, or might not ever 
appear in any snapshot.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zvol vs zfs send/zfs receive

2012-09-15 Thread Bob Friesenhahn

On Sat, 15 Sep 2012, Dave Pooser wrote:


  The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB.
That... doesn't look right. (Comparing zfs list -t snapshot and looking at
the 5.34 ref for the snapshot vs zfs list on the new system and looking at
space used.)

Is this a problem? Should I be panicking yet?


Does the old pool use 512 byte sectors while the new pool uses 4K 
sectors?  Is there any change to compression settings?


With volblocksize of 8k on disks with 4K sectors one might expect very 
poor space utilization because metadata chunks will use/waste a 
minimum of 4k.  There might be more space consumed by the metadata 
than the actual data.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL iops expectations

2012-08-11 Thread Bob Friesenhahn

On Sat, 11 Aug 2012, Chris Nagele wrote:


So far, running gnu dd with 512b and oflag=sync, the most we can get is 8k
iops on the zil device. I even tried with some SSDs (Crucial M4,


If this is one dd program running, then all you are measuring is 
sequential IOPS.  That is, the next I/O will not start until the 
previous one has returned.  What you want to test is threaded IOPS 
with some number of threads (each one represents a client) running. 
You can use iozone to effectively test that.


This command runs with 16 threads and 8k blocks with a 2GB file:

  iozone -m -t 16 -T -O -r 8k -o -s 2G

If you 'dd' from /dev/zero then the test is meaningless since zfs is 
able to compress zeros.  If you 'dd' from /dev/random then the test is 
meaningless since the random generator is slow.



Is this the expected result? Should I be pushing for more? In IRC I
was told that I should be able to get 12k no problem. We are running
NFS in a heavily used environment with millions of very small files,
so low latency counts.


Your test method is not valid.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-07 Thread Bob Friesenhahn

On Tue, 7 Aug 2012, Sašo Kiselkov wrote:


MLC is so much cheaper that you can simply slap on twice as much and use
the rest for ECC, mirroring or simply overprovisioning sectors. The
common practice to extending the lifecycle of MLC is by short-stroking
it, i.e. using only a fraction of the capacity. E.g. a 40GB MLC unit
with 5-10k cycles per cell can be turned into a 4GB unit (with the
controller providing wear leveling) with effectively 50-100k cycles
(that's SLC land) for about a hundred bucks. Also, since I'm mirroring
it already with ZFS checksums to provide integrity checking, your
argument simply doesn't hold up.


Remember he also said that the current product is based principally on 
an FPGA.  This FPGA must be interfacing directly with the Flash device 
so it would need to be substantially redesigned to deal with MLC Flash 
(probably at least an order of magnitude more complex), or else a 
microcontroller would need to be added to the design, and firmware 
would handle the substantial complexities.  If the Flash device writes 
slower, then the power has to stay up longer.  If the Flash device 
reads slower, then it takes longer for the drive to come back on 
line.


Quite a lot of product would need to be sold in order to pay for both 
re-engineering and the cost of running a business.


Regardless, continual product re-development is necessary or else it 
will surely die.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-06 Thread Bob Friesenhahn

On Mon, 6 Aug 2012, Christopher George wrote:


Intel's brief also clears up a prior controversy of what types of
data are actually cached, per the brief it's both user and system
data!


I am glad to hear that both user AND system data is stored.  That is 
rather reassuring. :-)


Is your DDRDrive product still supported and moving?  Is it well 
supported for Illumos?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-06 Thread Bob Friesenhahn

On Mon, 6 Aug 2012, Stefan Ring wrote:


Intel's brief also clears up a prior controversy of what types of
data are actually cached, per the brief it's both user and system
data!


So you're saying that SSDs don't generally flush data to stable medium
when instructed to? So data written before an fsync is not guaranteed
to be seen after a power-down?

If that -- ignoring cache flush requests -- is the whole reason why
SSDs are so fast, I'm glad I haven't got one yet.


Testing has shown that many SSDs do not flush the data prior to 
claiming that they have done so.  The flush request may hasten the 
time until the next actual cache flush.


As far as I am aware, Intel does not sell any enterprise-class SSDs 
even though they have sold some models with 'E' in the name.  True 
enterprise SSDs can cost 5-10X the price of larger consumer models.


A battery-backed RAM cache with Flash backup can be a whole lot faster 
and still satisfy many users.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-04 Thread Bob Friesenhahn

On Fri, 3 Aug 2012, Neil Perrin wrote:


For the slog, you should look for a SLC technology SSD which saves 
unwritten data on power failure.  In Intel-speak, this is called Enhanced 
Power Loss Data Protection.  I am not running across any Intel SSDs which 
claim to match these requirements.


- That shouldn't be necessary. ZFS flushes the write cache for any device 
written before returning

from the synchronous request to ensure data stability.


Yes, but the problem is that the write IOPS go way way down (and 
device lifetime suffers) if the device is not able to perform write 
caching.  A consumer-grade device advertizing 70K write IOPS is 
definitely not going to offer anything like that if it actually 
flushes its cache when requested.  A device with a reserve of energy 
sufficient to write its cache to backing FLASH on power fail will be 
able to defer cache flush requests.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-03 Thread Bob Friesenhahn

On Fri, 3 Aug 2012, Karl Rossing wrote:

I'm looking at 
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-ssd.html 
wondering what I should get.


Are people getting intel 330's for l2arc and 520's for slog?


For the slog, you should look for a SLC technology SSD which saves 
unwritten data on power failure.  In Intel-speak, this is called 
Enhanced Power Loss Data Protection.  I am not running across any 
Intel SSDs which claim to match these requirements.


Extreme write IOPS claims in consumer SSDs are normally based on large 
write caches which can lose even more data if there is a power 
failure.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL devices and fragmentation

2012-07-30 Thread Bob Friesenhahn

On Mon, 30 Jul 2012, Roy Sigurd Karlsbakk wrote:


Should OI/Illumos be able to boot cleanly without manual action with 
the SLOG devices gone?


If this is allowed, then data may be unnecessarily lost.

When the drives are not all in one chassis, then it is not uncommon 
for one chassis to not come up immediately, or be slow to come up when 
recovering from a power failure.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-29 Thread Bob Friesenhahn

On Sun, 29 Jul 2012, Jim Klimov wrote:


 Would extra copies on larger disks actually provide the extra
reliability, or only add overheads and complicate/degrade the
situation?


My opinion is that complete hard drive failure and block-level media 
failure are two totally different things.  Complete hard drive failure 
rates should not be directly related to total storage size whereas the 
probabily of media failure per drive is directly related to total 
storage size.  Given this, and assuming that complete hard drive 
failure occurs much less often than partial media failure, using the 
copies feature should be pretty effective.



 Would the use of several copies cripple the write speeds?


It would reduce the write rate by 1/2 or by whatever number of copies 
you have requested.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL devices and fragmentation

2012-07-29 Thread Bob Friesenhahn

On Sun, 29 Jul 2012, Jim Klimov wrote:


 For several times now I've seen statements on this list implying
that a dedicated ZIL/SLOG device catching sync writes for the log,
also allows for more streamlined writes to the pool during normal
healthy TXG syncs, than is the case with the default ZIL located
within the pool.


After reading what some others have posted, I should remind that zfs 
always has a ZIL (unless it is specifically disabled for testing). 
If it does not have a dedicated ZIL, then it uses the disks in the 
main pool to construct the ZIL.  Dedicating a device to the ZIL should 
not improve the pool storage layout because the pool already had a 
ZIL.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IO load questions

2012-07-24 Thread Bob Friesenhahn

On Tue, 24 Jul 2012, matth...@flash.shanje.com wrote:


~50,000 IOPS 4k random read.  200MB/sec, 30% CPU utilization on Nexenta, ~90% 
utilization on guest OS.  I’m guessing guest OS is bottlenecking.  Going to 
try physical hardware next week
~25,000 IOPS 4k random write.  100MB/sec, ~70% CPU utilization on Nexenta, 
~45% CPU utilization on guest OS.  Feels like Nexenta CPU is bottleneck. Load 
average of 2.5


A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec 
performance, can’t remember CPU utilization on either side. Will retest and 
report those numbers.


It feels like something is adding more overhead here than I would expect on 
the 4k recordsizes/IO workloads.  Any thoughts where I should start on this? 
I’d really like to see closer to 10Gbit performance here, but it seems like 
the hardware isn’t able to cope with it?


All systems have a bottleneck.  You are highly unlikely to get close 
to 10Gbit performance with 4k random synchronous write.  25K IOPS 
seems pretty good to me.


The 2.4GHz clock rate of the 4-core Xeon CPU you are using is not 
terribly high.  Performance is likely better with a higher-clocked 
more modern design with more cores.


Verify that the zfs checksum algorithm you are using is a low-cost one 
and that you have not enabled compression or deduplication.


You did not tell us how your zfs pool is organized so it is impossible 
to comment more.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question on 4k sectors

2012-07-23 Thread Bob Friesenhahn

On Mon, 23 Jul 2012, Anonymous Remailer (austria) wrote:


The question was relative to some older boxes running S10 and not planning
to upgrade the OS, keeping them alive as long as possible...


Recent Solaris 10 kernel patches are addressing drives with 4k 
sectors.  It seems that Solaris 10 will work with drives with 4k 
sectors so Solaris 10 users will not be stuck.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-21 Thread Bob Friesenhahn

On Sat, 21 Jul 2012, Jim Klimov wrote:

During this quick test I did not manage to craft a test which
would inflate a file in the middle without touching its other
blocks (other than using a text editor which saves the whole
file - so that is irrelevant), in order to see if ZFS can
insert smaller blocks in the middle of an existing file,
and whether it would reallocate other blocks to fit the set
recordsizes.


The POSIX filesystem interface does not support such a thing 
('insert').  Presumably the underlying zfs pool could support such a 
thing if there was a layer on top to request it. The closest 
equivalent in a POSIX filesystem would be if a previously-null block 
in a sparse file is updated to hold content.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Bob Friesenhahn

On Wed, 18 Jul 2012, Michael Traffanstead wrote:


I have an 8 drive ZFS array (RAIDZ2 - 1 Spare) using 5900rpm 2TB SATA drives 
with an hpt27xx controller under FreeBSD 10
(but I've seen the same issue with FreeBSD 9).

The system has 8gigs and I'm letting FreeBSD auto-size the ARC.

Running iozone (from ports), everything is fine for file sizes up to 8GB, but 
when it runs with a 16GB file the random write
performance plummets using 64K record sizes.


This is normal.  The problem is that with zfs 128k block sizes, zfs 
needs to re-read the original 128k block so that it can compose and 
write the new 128k block.  With sufficient RAM, this is normally 
avoided because the original block is already cached in the ARC.


If you were to reduce the zfs blocksize to 64k then the performance 
dive at 64k would go away but there would still be write performance 
loss at sizes other than a multiple of 64k.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any company willing to support a 7410 ?

2012-07-19 Thread Bob Friesenhahn

On Thu, 19 Jul 2012, Gordon Ross wrote:


On Thu, Jul 19, 2012 at 5:38 AM, sol a...@yahoo.com wrote:

Other than Oracle do you think any other companies would be willing to take
over support for a clustered 7410 appliance with 6 JBODs?

(Some non-Oracle names which popped out of google:
Joyent/Coraid/Nexenta/Greenbytes/NAS/RackTop/EraStor/Illumos/???)



I'm not sure, but I think there are people running NexentaStor on that h/w.
If not, then on something pretty close.  NS supports clustering, etc.


You would lose the fancy user interface and monitoring stuff that the 
Fishworks team developed for the product.  It would no longer be an 
appliance.


No doubt, Nexenta has developed new cool stuff for NexentaStor.

As others have said, only Oracle is capable of supporting the system 
as the original product.  It could be re-installed to become something 
else.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Bob Friesenhahn

On Fri, 20 Jul 2012, Jim Klimov wrote:


I am not sure if I misunderstood the question or Bob's answer,
but I have a gut feeling it is not fully correct: ZFS block
sizes for files (filesystem datasets) are, at least by default,
dynamically-sized depending on the contiguous write size as
queued by the time a ZFS transaction is closed and flushed to
disk. In case of RAIDZ layouts, this logical block is further


Zfs data block sizes are fixed size!  Only tail blocks are shorter.

The underlying representation (how the data block gets stored) depends 
on if compression, raidz, deduplication, etc., are used.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-17 Thread Bob Friesenhahn

On Tue, 17 Jul 2012, Michael Hase wrote:


If you were to add a second vdev (i.e. stripe) then you should see very 
close to 200% due to the default round-robin scheduling of the writes.


My expectation would be  200%, as 4 disks are involved. It may not be the 
perfect 4x scaling, but imho it should be (and is for a scsi system) more 
than half of the theoretical throughput. This is solaris or a solaris 
derivative, not linux ;-)


Here are some results from my own machine based on the 'virgin mount' 
test approach.  The results show less boost than is reported by a 
benchmark tool like 'iozone' which sees benefits from caching.


I get an initial sequential read speed of 657 MB/s on my new pool 
which has 1200 MB/s of raw bandwidth (if mirrors could produce 100% 
boost).  Reading the file a second time reports 6.9 GB/s.


The below is with a 2.6 GB test file but with a 26 GB test file (just 
add another zero to 'count' and wait longer) I see an initial read 
rate of 618 MB/s and a re-read rate of 8.2 GB/s.  The raw disk can 
transfer 150 MB/s.


% zpool status
   pool: tank
  state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
 still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
 pool will no longer be accessible on older software versions.
   scan: scrub repaired 0 in 0h10m with 0 errors on Mon Jul 16 04:30:48 2012
config:

 NAME  STATE READ WRITE CKSUM
 tank  ONLINE   0 0 0
   mirror-0ONLINE   0 0 0
 c7t5393E8CA21FAd0p0   ONLINE   0 0 0
 c11t5393D8CA34B2d0p0  ONLINE   0 0 0
   mirror-1ONLINE   0 0 0
 c8t5393E8CA2066d0p0   ONLINE   0 0 0
 c12t5393E8CA2196d0p0  ONLINE   0 0 0
   mirror-2ONLINE   0 0 0
 c9t5393D8CA82A2d0p0   ONLINE   0 0 0
 c13t5393E8CA2116d0p0  ONLINE   0 0 0
   mirror-3ONLINE   0 0 0
 c10t5393D8CA59C2d0p0  ONLINE   0 0 0
 c14t5393D8CA828Ed0p0  ONLINE   0 0 0

errors: No known data errors
% pfexec zfs create tank/zfstest
% pfexec zfs create tank/zfstest/defaults
% cd /tank/zfstest/defaults
% pfexec dd if=/dev/urandom of=random.dat bs=128k count=2
2+0 records in
2+0 records out
262144 bytes (2.6 GB) copied, 36.8133 s, 71.2 MB/s
% cd ..
% pfexec zfs umount tank/zfstest/defaults
% pfexec zfs mount tank/zfstest/defaults
% cd defaults
% dd if=random.dat of=/dev/null bs=128k count=2
2+0 records in
2+0 records out
262144 bytes (2.6 GB) copied, 3.99229 s, 657 MB/s
% pfexec dd if=/dev/rdsk/c7t5393E8CA21FAd0p0 of=/dev/null bs=128k count=2000
2000+0 records in
2000+0 records out
262144000 bytes (262 MB) copied, 1.74532 s, 150 MB/s
% bc
scale=8
657/150
4.3800

It is very difficult to benchmark with a cache which works so well:

% dd if=random.dat of=/dev/null bs=128k count=2
2+0 records in
2+0 records out
262144 bytes (2.6 GB) copied, 0.379147 s, 6.9 GB/s

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem: Disconnected command timeout for Target X

2012-07-17 Thread Bob Friesenhahn

On Tue, 17 Jul 2012, Roberto Scudeller wrote:


Hi all,

I'm using Opensolaris snv_134 with LSI Controllers and a motherboard 
supermicro, with 20 sata disks, zfs in raid-10 conf. I mounted this zfs_storage 
with
NFS.
I'm not opensolaris specialist. What're the commands to show hardware 
information? Like 'lshw' in linux but for opensolaris.


cfgadm, prtconf, prtpicl, prtdiag

zpool status

fmadm faulty

It sounds like you may have a broken cable or power supply failure to 
some disks.


Bob



The storage stopped working, but ping responds. SSH and NFS is out. When I open 
the console showing this messages:

Jul  2 13:00:27 storage scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2):
Jul  2 13:00:27 storage    Disconnected command timeout for Target 4
Jul  2 13:01:28 storage scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2):
Jul  2 13:01:28 storage    Disconnected command timeout for Target 3
Jul  2 13:02:28 storage scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2):
Jul  2 13:02:28 storage    Disconnected command timeout for Target 2
Jul  2 13:03:29 storage scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2):
Jul  2 13:03:29 storage    Disconnected command timeout for Target 1
Jul  2 13:04:29 storage scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2):
Jul  2 13:04:29 storage    Disconnected command timeout for Target 0
Jul  2 13:05:40 storage scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2):
Jul  2 13:05:40 storage    Disconnected command timeout for Target 6
Jul  2 13:06:40 storage scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2):
Jul  2 13:06:40 storage    Disconnected command timeout for Target 5

Any ideas? Could help me?

--
Roberto Scudeller






--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-17 Thread Bob Friesenhahn

On Tue, 17 Jul 2012, Michael Hase wrote:


The below is with a 2.6 GB test file but with a 26 GB test file (just add 
another zero to 'count' and wait longer) I see an initial read rate of 618 
MB/s and a re-read rate of 8.2 GB/s.  The raw disk can transfer 150 MB/s.


To work around these caching effects just use a file  2 times the size of 
ram, iostat then shows the numbers really coming from disk. I always test 
like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but 
quite impressive ;-)


Yes, in the past I have done benchmarking with file size 2X the size 
of memory.  This does not necessary erase all caching because the ARC 
is smart enough not to toss everything.


At the moment I have an iozone benchark run up from 8 GB to 256 GB 
file size.  I see that it has started the 256 GB size now.  It may be 
a while.  Maybe a day.


In the range of  600 MB/s other issues may show up (pcie bus contention, hba 
contention, cpu load). And performance at this level could be just good 
enough, not requiring any further tuning. Could you recheck with only 4 disks 
(2 mirror pairs)? If you just get some 350 MB/s it could be the same problem 
as with my boxes. All sata disks?


Unfortunately, I already put my pool into use and can not conveniently 
destroy it now.


The disks I am using are SAS (7200 RPM, 1 GB) but return similar 
per-disk data rates as the SATA disks I use for the boot pool.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-17 Thread Bob Friesenhahn

On Tue, 17 Jul 2012, Michael Hase wrote:

To work around these caching effects just use a file  2 times the size of 
ram, iostat then shows the numbers really coming from disk. I always test 
like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but 
quite impressive ;-)


Ok, the iozone benchmark finally completed.  The results do suggest 
that reading from mirrors substantially improves the throughput. 
This is interesting since the results differ (better than) from my 
'virgin mount' test approach:


Command line used: iozone -a -i 0 -i 1 -y 64 -q 512 -n 8G -g 256G

  KB  reclen   write rewritereadreread
 8388608  64  572933 1008668  6945355  7509762
 8388608 128 2753805 2388803  6482464  7041942
 8388608 256 2508358 2331419  2969764  3045430
 8388608 512 2407497 2131829  3021579  3086763
16777216  64  671365  879080  6323844  6608806
16777216 128 1279401 2286287  6409733  6739226
16777216 256 2382223 2211097  2957624  3021704
16777216 512 2237742 2179611  3048039  3085978
33554432  64  933712  699966  6418428  6604694
33554432 128  459896  431640  6443848  6546043
33554432 256  90  430989  2997615  3026246
33554432 512  427158  430891  3042620  3100287
67108864  64  426720  427167  6628750  6738623
67108864 128  419328  422581  153  6743711
67108864 256  419441  419129  3044352  3056615
67108864 512  431053  417203  3090652  3112296
   134217728  64  417668   55434   759351   760994
   134217728 128  409383  400433   759161   765120
   134217728 256  408193  405868   763892   766184
   134217728 512  408114  403473   761683   766615
   268435456  64  418910   55239   768042   768498
   268435456 128  408990  399732   763279   766882
   268435456 256  413919  399386   760800   764468
   268435456 512  410246  403019   766627   768739

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem: Disconnected command timeout for Target X

2012-07-17 Thread Bob Friesenhahn

On Tue, 17 Jul 2012, Roberto Scudeller wrote:


Hi Bob,

Thanks for the answers.

How do I test your theory?


I would use 'dd' to see if it is possible to transfer data from one of 
the problem devices.  Gain physical access to the system and check the 
signal and power cables to these devices closely.


Use 'iostat -xe' to see what error counts have accumulated.  Also 
'iostat -E'.



In this case, I use common disks SATA 2, not Nearline SAS (NL SATA) or SAS. Do 
you think the disks SATA are the problem?


There have been reports of congestion leading to timeouts and resets 
when SATA disks are on expanders.  There have also been reports that 
one failing disk can cause problems when on expanders.  Regardless, if 
this system has been previously operating fine for some time, these 
errors would indicate a change in the hardware shared by all these 
devices.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Bob Friesenhahn

On Mon, 16 Jul 2012, Stefan Ring wrote:


I wouldn't expect mirrored read to be faster than single-disk read,
because the individual disks would need to read small chunks of data
with holes in-between. Regardless of the holes being read or not, the
disk will spin at the same speed.


It is normal for reads from mirrors to be faster than for a single 
disk because reads can be scheduled from either disk, with different 
I/Os being handled in parallel.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Bob Friesenhahn

On Mon, 16 Jul 2012, Stefan Ring wrote:


It is normal for reads from mirrors to be faster than for a single disk
because reads can be scheduled from either disk, with different I/Os being
handled in parallel.


That assumes that there *are* outstanding requests to be scheduled in
parallel, which would only happen with multiple readers or a large
read-ahead buffer.


That is true.  Zfs tries to detect the case of sequential reads and 
requests to read more data than the application has already requested. 
In this case the data may be prefetched from the other disk before the 
application has requested it.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Bob Friesenhahn

On Mon, 16 Jul 2012, Michael Hase wrote:


This is my understanding of zfs: it should load balance read requests even 
for a single sequential reader. zfs_prefetch_disable is the default 0. And I 
can see exactly this scaling behaviour with sas disks and with scsi disks, 
just not on this sata pool.


Is the BIOS configured to use AHCI mode or is it using IDE mode?

Are the disks 512 byte/sector or 4K?

Maybe it's a corner case which doesn't matter in real world applications? The 
random seek values in my bonnie output show the expected performance boost 
when going from one disk to a mirrored configuration. It's just the 
sequential read/write case, that's different for sata and sas disks.


I don't have a whole lot of experience with SATA disks but it is my 
impression that you might see this sort of performance if the BIOS was 
configured so that the drives were used as IDE disks.  If not that, 
then there must be a bottleneck in your hardware somewhere.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Bob Friesenhahn

On Tue, 17 Jul 2012, Michael Hase wrote:


So only one thing left: mirror should read 2x


I don't think that mirror should necessarily read 2x faster even 
though the potential is there to do so.  Last I heard, zfs did not 
include a special read scheduler for sequential reads from a mirrored 
pair.  As a result, 50% of the time, a read will be scheduled for a 
device which already has a read scheduled.  If this is indeed true, 
the typical performance would be 150%.  There may be some other 
scheduling factor (e.g. estimate of busyness) which might still allow 
zfs to select the right side and do better than that.


If you were to add a second vdev (i.e. stripe) then you should see 
very close to 200% due to the default round-robin scheduling of the 
writes.


It is really difficult to measure zfs read performance due to caching 
effects.  One way to do it is to write a large file (containing random 
data such as returned from /dev/urandom) to a zfs filesystem, unmount 
the filesystem, remount the filesystem, and then time how long it 
takes to read the file once.  The reason why this works is because 
remounting the filesystem restarts the filesystem cache.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Tue, 10 Jul 2012, Edward Ned Harvey wrote:


CPU's are not getting much faster.  But IO is definitely getting faster.  It's 
best to keep ahead of that curve.


It seems that per-socket CPU performance is doubling every year. 
That seems like faster to me.


If server CPU chipsets offer accelleration for some type of standard 
encryption, then that needs to be considered.  The CPU might not need 
to do the encryption the hard way.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Sašo Kiselkov wrote:

the hash isn't used for security purposes. We only need something that's
fast and has a good pseudo-random output distribution. That's why I
looked toward Edon-R. Even though it might have security problems in
itself, it's by far the fastest algorithm in the entire competition.


If an algorithm is not 'secure' and zfs is not set to verify, doesn't 
that mean that a knowledgeable user will be able to cause intentional 
data corruption if deduplication is enabled?  A user with very little 
privilege might be able to cause intentional harm by writing the magic 
data block before some other known block (which produces the same 
hash) is written.  This allows one block to substitute for another.


It does seem that security is important because with a human element, 
data is not necessarily random.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Joerg Schilling wrote:


Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:


On Tue, 10 Jul 2012, Edward Ned Harvey wrote:


CPU's are not getting much faster.  But IO is definitely getting faster.  It's 
best to keep ahead of that curve.


It seems that per-socket CPU performance is doubling every year.
That seems like faster to me.


This would only apply, if you implement a multi threaded hash.


While it is true that the per-block hash latency does not improve much 
with new CPUs (and may even regress), given multiple I/Os at once, 
hashes may be be computed by different cores and so it seems that 
total system performance will scale with per-socket CPU performance. 
Even with a single stream of I/O, multiple zfs blocks will be read or 
written so mutiple block hashes may be computed at once on different 
cores.


Server OSs like Solaris have been focusing on improving total system 
throughput rather than single-threaded bandwidth.


I don't mean to discount the importance of this effort though.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Sašo Kiselkov wrote:


The reason why I don't think this can be used to implement a practical
attack is that in order to generate a collision, you first have to know
the disk block that you want to create a collision on (or at least the
checksum), i.e. the original block is already in the pool. At that
point, you could write a colliding block which would get de-dup'd, but
that doesn't mean you've corrupted the original data, only that you
referenced it. So, in a sense, you haven't corrupted the original block,
only your own collision block (since that's the copy doesn't get written).


This is not correct.  If you know the well-known block to be written, 
then you can arrange to write your collision block prior to when the 
well-known block is written.  Therefore, it is imperative that the 
hash algorithm make it clearly impractical to take a well-known block 
and compute a collision block.


For example, the well-known block might be part of a Windows 
anti-virus package, or a Windows firewall configuration, and 
corrupting it might leave a Windows VM open to malware attack.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris derivate with the best long-term future

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Eugen Leitl wrote:


It would be interesting to see when zpool versions 28 will
be available in the open forks. Particularly encryption is
a very useful functionality.


Illumos advanced to zpool version 5000 and this is available in the 
latest OpenIndiana development release.  Does that make you happy?


As far as which Solaris derivate has the best future, it is clear that 
Illumos has a lot of development energy right now and there is little 
reason to believe that this energy will cease.  Illumos-derived 
distributions may come and go but it looks like Illumos has a future, 
particularly once it frees itself from all Sun-derived binary 
components.


Oracle continues with Solaris 11 and does seem to be funding necessary 
driver and platform support.  User access to Solaris 11 may be 
abitrarily limited.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Sašo Kiselkov wrote:


For example, the well-known block might be part of a Windows anti-virus
package, or a Windows firewall configuration, and corrupting it might
leave a Windows VM open to malware attack.


True, but that may not be enough to produce a practical collision for
the reason that while you know which bytes you want to attack, these
might not line up with ZFS disk blocks (especially the case with Windows
VMs which are store in large opaque zvols) - such an attack would
require physical access to the machine (at which point you can simply
manipulate the blocks directly).


I think that well-known blocks are much easier to predict than you say 
because operating systems, VMs, and application software behave in 
predictable patterns.  However, deriving another useful block which 
hashes the same should be extremely difficult and any block hashing 
algorithm needs to assure that.  Having an excellent random 
distribution property is not sufficient if it is relatively easy to 
compute some other block producing the same hash.  It may be useful to 
compromise a known block even if the compromized result is complete 
garbage.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Richard Elling wrote:


The last studio release suitable for building OpenSolaris is available in the 
repo.
See the instructions 
at http://wiki.illumos.org/display/illumos/How+To+Build+illumos


Not correct as far as I can tell.  You should re-read the page you 
referenced.  Oracle recinded (or lost) the special Studio releases 
needed to build the OpenSolaris kernel.  The only way I can see to 
obtain these releases is illegally.


However, Studio 12.3 (free download) produces user-space executables 
which run fine under Illumos.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Hung-Sheng Tsao (LaoTsao) Ph.D wrote:


Not correct as far as I can tell.  You should re-read the page you referenced.  
Oracle recinded (or lost) the special Studio releases needed to build the 
OpenSolaris kernel.


you can still download 12 12.1 12.2, AFAIK through OTN


That is true (and I have done so).  Unfortunately the versions offered 
are not the correct ones to build the OpenSolaris kernel.  Special 
patched versions with particular date stamps are required.  The only 
way that I see to obtain these files any more is via distribution

channels primarily designed to perform copyright violations.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-05 Thread Bob Friesenhahn

On Wed, 4 Jul 2012, Nico Williams wrote:


Oddly enough the manpages at the Open Group don't make this clear.  So
I think it may well be advisable to use msync(3C) before munmap() on
MAP_SHARED mappings.  However, I think all implementors should, and
probably all do (Linux even documents that it does) have an implied
msync(2) when doing a munmap(2).  I really makes no sense at all to
have munmap(2) not imply msync(3C).


As long as the system has a way to track which dirty pages map to 
particular files (Solaris historically does), it should not be 
necessary to synchronize the mapping to the underlying store simply 
due to munmap.  It may be more efficient not do to that.  The same 
pages may be mapped and unmapped many times by applications.  In fact, 
several applications may memory map the same file so they access the 
same pages and it seems wrong to flush to underlying store simply 
because one of the applications no longer references the page.


Since mmap() on zfs breaks the traditional coherent memory/filesystem 
that Solaris enjoyed prior to zfs, it may be that some rules should be 
different when zfs is involved because of its redundant use of memory 
(zfs ARC and VM page).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-05 Thread Bob Friesenhahn

On Wed, 4 Jul 2012, Stefan Ring wrote:


I really makes no sense at all to
have munmap(2) not imply msync(3C).


Why not? munmap(2) does basically the equivalent of write(2). In the
case of write, that is: a later read from the same location will see
the written data, unless another write happens in-between. If power


Actually, a write to memory for a memory mapped file is more similar 
to write(2).  If two programs have the same file mapped then the 
effect on the memory they share is instantaneous because it is 
the same physical memory.  A mmapped file becomes shared memory as 
soon as it is mapped at least twice.


It is pretty common for a system of applications to implement shared 
memory via memory mapped files with the mapped memory used for 
read/write.  This is a precursor to POSIX's shm_open(3RT) which 
produces similar functionality without a known file in the filesystem


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-04 Thread Bob Friesenhahn

On Tue, 3 Jul 2012, James Litchfield wrote:


Agreed - msync/munmap is the only guarantee.


I don't see that the munmap definition assures that anything is 
written to disk.  The system is free to buffer the data in RAM as 
long as it likes without writing anything at all.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-02 Thread Bob Friesenhahn

On Mon, 2 Jul 2012, Iwan Aucamp wrote:

I'm interested in some more detail on how ZFS intent log behaves for updated 
done via a memory mapped file - i.e. will the ZIL log updates done to an 
mmap'd file or not ?


I would to expect these writes to go into the intent log unless 
msync(2) is used on the mapping with the MS_SYNC option.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommendation for home NAS external JBOD

2012-06-18 Thread Bob Friesenhahn

On Mon, 18 Jun 2012, Koopmann, Jan-Peter wrote:


looks nice! The only thing coming to mind is that according to the specifications the 
enclosure is 3Gbits only. If I choose
to put in a SSD with 6Gbits this would be not optimal. I looked at their site 
but failed to find 6GBit enclosures. But I will
keep looking since sooner or later they will provide it. 


I browsed the site and saw many 6GBit enclosures.  I also saw one with 
Nexenta (Solaris/zfs appliance) inside.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommendation for home NAS external JBOD

2012-06-18 Thread Bob Friesenhahn

On Mon, 18 Jun 2012, Koopmann, Jan-Peter wrote:


I browsed the site and saw many 6GBit enclosures.  I also saw one with
Nexenta (Solaris/zfs appliance) inside.

I found several high end enclosures. Or ones with bundled RAID cards. But the 
equivalent of the one originally
suggested I was not able to find. However after looking at tons of sites for 
hours I might simply have missed it. If
you found one, can you please forward a link?


So you want high-end performance at a low-end price?

It seems unlikely that you will notice the difference between 3Gbit or 
6Gbit for a home application.


FLASH-based SSDs seem to burn-out pretty quickly if you don't use them 
carefully.  The situation is getting worse rather than better over 
time as FLASH geometries get smaller and they try to store more bits 
in one cell.  What was described as a bright new future is starting to 
look more like an end of the road to me.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommendation for home NAS external JBOD

2012-06-18 Thread Bob Friesenhahn

On Mon, 18 Jun 2012, Carson Gaspar wrote:


What makes you think the Barracuda 7200.14 drives report 4k sectors? I gave 
up looking for 4kn drives, as everything I could find was 512e. I would 
_love_ to be wrong, as I have 8 4TB Hitachis on backorder that I would gladly 
replace with 4kn drives, even if I had to drop to 3TB density.


Why would you want native 4k drives right now?  Not much would work 
with such drives.


Maybe in a dedicated chassis (e.g. the JBOD) they could be of some 
use.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-30 Thread Bob Friesenhahn

On Tue, 29 May 2012, Iwan Aucamp wrote:

 - Is there a  parameter similar to /proc/sys/vm/swappiness that can control 
how long unused pages in page cache stay in physical ram
if there is no shortage of physical ram ? And if not how long will unused pages 
stay in page cache stay in physical ram given there
is no shortage of physical ram ?


Absent pressure for memory, no longer referenced pages will stay in 
memory forever.  They can then be re-referenced in memory.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-21 Thread Bob Friesenhahn

On Mon, 21 May 2012, Jim Klimov wrote:

This is so far a relatively raw idea and I've probably missed
something. Do you think it is worth pursuing and asking some
zfs developers to make a POC? ;)


I did read all of your text. :-)

This is an interesting idea and could be of some use but it would be 
wise to test it first a few times before suggesting it as a general 
course.  Zfs is still totally not foolproof.  I still see postings 
from time to time regarding pools which panic/crash the system 
(probably due to memory corruption).


Zfs will try to keep the data compacted at the beginning of the 
partition so if you have a way to know how far out it extends, then 
the initial 'dd' could be much faster when the pool is not close to 
full.


Zfs scrub does need to do many more reads than a resilver since it 
reads all data and metadata copies.  Triggering a resilver operation 
for the specific disk would likely hasten progress.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs_arc_max values

2012-05-17 Thread Bob Friesenhahn

On Thu, 17 May 2012, Paul Kraus wrote:


   Why are you trying to tune the ARC as _low_ as possible? In my
experience the ARC gives up memory readily for other uses. The only
place I _had_ to tune the ARC in production was a  couple systems
running an app that checks for free memory _before_ trying to allocate
it. If the ARC has all but 1 GB in use, the app (which is looking for


On my system I adjusted the ARC down due to running user-space 
applications with very bursty short-term large memory usage. 
Reducing the ARC assured that there would be no contention 
between zfs ARC and the applications.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migration of a Thumper to bigger HDDs

2012-05-17 Thread Bob Friesenhahn

On Fri, 18 May 2012, Jim Klimov wrote:


Would there be substantial issues if we start out making
and filling the new raidz3 8+3 pool in SXCE snv_129 (with
zpool v22) or snv_130, and later upgrade the big zpool
along with the major OS migration, that can be avoided
by a preemptive upgrade to oi_151a or later (oi_151a3?)

Perhaps, some known pool corruption issues or poor data
layouts in older ZFS software releases?..


I can't attest as to potential issues, but the newer software surely 
fixes many bugs and it is also likely that the data layout improves in 
newer software.  Improved data layout would result in better 
performance.


It seems safest to upgrade the OS before moving a lot of data.  Leave 
a fallback path in case the OS upgrade does not work as expected.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migration of a Thumper to bigger HDDs

2012-05-16 Thread Bob Friesenhahn

On Wed, 16 May 2012, Jim Klimov wrote:


Your idea actually evolved for me into another (#7?), which
is simple and apparent enough to be ingenious ;)
DO use the partitions, but split the 2.73Tb drives into a
roughly 2.5Tb partition followed by a 250Gb partition of
the same size as vdevs of the original old pool. Then the
new drives can replace a dozen of original small disks one
by one, in a one-to-one fashion resilvering, with no worsening
of the situation in regard of downtime or original/new pools'
integrity tradeoffs (in fact, several untrustworthy old disks
will be replaced by newer ones).


I like this idea since it allows running two complete pools on the 
same disks without using files.  Due to using partitions, the disk 
write cache will be disabled unless you specifically enable it.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migration of a Thumper to bigger HDDs

2012-05-15 Thread Bob Friesenhahn
You forgot IDEA #6 where you take advantage of the fact that zfs can 
be told to use sparse files as partitions.  This is rather like your 
IDEA #3 but does not require that disks be partitioned.


This opens up many possibilities.  Whole vdevs can be virtualized to 
files on (i.e. moved onto) remaining physical vdevs.  Then the drives 
freed up can be replaced with larger drives and used to start a new 
pool.  It might be easier to upgrade the existing drives in the pool 
first so that there is assured to be vast amounts of free space and 
the drives get some testing.  There is not initially additional risk 
due to raidz1 in the pool since the drives will be about as full as 
before.


I am not sure what additional risks are involved due to using files.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver restarting several times

2012-05-11 Thread Bob Friesenhahn

On Fri, 11 May 2012, Jim Klimov wrote:


Hello all,

SHORT VERSION:

 What conditions can cause the reset of the resilvering
process? My lost-and-found disk can't get back into the
pool because of resilvers restarting...


I recall that with sufficiently old vintage zfs, resilver would 
restart if a snapshot was taken.  What sort of zfs is being used here?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-07 Thread Bob Friesenhahn

On Mon, 7 May 2012, Edward Ned Harvey wrote:


Apparently I pulled it down at some point, so I don't have a URL for you
anymore, but I did, and I posted.  Long story short, both raidzN and mirror
configurations behave approximately the way you would hope they do.  That
is...

Approximately, as compared to a single disk:  And I *mean* approximately,


Yes, I remember your results.

In a few weeks I should be setting up a new system with OpenIndiana 
and 8 SAS disks.  This will give me an opportunity to test again. 
Last time I got to play was back in Feburary 2008 and I did not bother 
to test raidz 
(http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf).


Most common benchmarking is sequential read/write and rarely 
read-file/write-file where 'file' is a megabyte or two and the file is 
different for each iteration.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow zfs send

2012-05-07 Thread Bob Friesenhahn

On Mon, 7 May 2012, Karl Rossing wrote:


On 12-05-07 12:18 PM, Jim Klimov wrote:

During the send you can also monitor zpool iostat 1 and usual
iostat -xnz 1 in order to see how busy the disks are and how
many IO requests are issued. The snapshots are likely sent in
the order of block age (TXG number), which for a busy pool may
mean heavy fragmentation and lots of random small IOs..
I have been able to verify that I can get a zfs send at 135MB/sec for a 
striped pool with 2 internal drives on the same server.


I see that there are a huge number of reads and hardy any reads.  Are 
you SURE that deduplication was not enabled for this pool?  This is 
the sort of behavior that one might expect if deduplication was 
enabled without enough RAM or L2 read cache.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-05 Thread Bob Friesenhahn

On Fri, 4 May 2012, Erik Trimble wrote:
predictable, and the backing store is still only giving 1 disk's IOPS.   The 
RAIDZ* may, however, give you significantly more throughput (in MB/s) than a 
single disk if you do a lot of sequential read or write.


Has someone done real-world measurements which indicate that raidz* 
actually provides better sequential read or write than simple 
mirroring with the same number of disks?  While it seems that there 
should be an advantage, I don't recall seeing posted evidence of such. 
If there was a measurable advantage, it would be under conditions 
which are unlikely in the real world.


The only thing totally clear to me is that raidz* provides better 
storage efficiency than mirroring and that raidz1 is dangerous with 
large disks.


Provided that the media reliability is sufficiently high, there are 
still many performance and operational advantages obtained from simple 
mirroring (duplex mirroring) with zfs.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance on LSI 9240-8i?

2012-05-04 Thread Bob Friesenhahn

On Fri, 4 May 2012, Rocky Shek wrote:



If I were you, I will not use 9240-8I.

I will use 9211-8I as pure HBA with IT FW for ZFS.


Is there IT FW for the 9240-8i?

They seem to use the same SAS chipset.

My next system will have 9211-8i with IT FW.  Playing it safe.  Good 
enough for Nexenta is good enough for me.


Bob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Bob Friesenhahn

On Mon, 30 Apr 2012, Ray Van Dolson wrote:


I'm trying to run some IOzone benchmarking on a new system to get a
feel for baseline performance.


Unfortunately, benchmarking with IOzone is a very poor indicator of 
what performance will be like during normal use.  Forcing the system 
to behave like it is short on memory only tests how the system will 
behave when it is short on memory.


Testing multi-threaded synchronous writes with IOzone might actually 
mean something if it is representative of your work-load.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Bob Friesenhahn

On Tue, 1 May 2012, Ray Van Dolson wrote:


Testing multi-threaded synchronous writes with IOzone might actually
mean something if it is representative of your work-load.


Sounds like IOzone may not be my best option here (though it does
produce pretty graphs).

bonnie++ actually gave me more realistic sounding numbers, and I've
been reading good thigns about fio.


None of these benchmarks is really useful other than to stress-test 
your hardware.  Assuming that the hardware is working properly, when 
you intentionally break the cache, IOzone should produce numbers 
similar to what you could have estimated from hardware specification 
sheets and an understanding of the algorithms.


Sun engineers used 'filebench' to do most of their performance testing 
because it allowed configuring the behavior to emulate various usage 
models.  You can get it from 
https://sourceforge.net/projects/filebench/;.


Zfs is all about caching so the cache really does need to be included 
(and not intentionally broken) in any realistic measurement of how the 
system will behave.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Bob Friesenhahn

On Wed, 25 Apr 2012, Rich Teer wrote:


Perhaps I'm being overly simplistic, but in this scenario, what would prevent
one from having, on a single file server, /exports/nodes/node[0-15], and then
having each node NFS-mount /exports/nodes from the server?  Much simplier 
than

your example, and all data is available on all machines/nodes.


This solution would limit bandwidth to that available from that single 
server.  With the cluster approach, the objective is for each machine 
in the cluster to primarily access files which are stored locally. 
Whole files could be moved as necessary.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Two disks giving errors in a raidz pool, advice needed

2012-04-22 Thread Bob Friesenhahn

On Mon, 23 Apr 2012, Manuel Ryan wrote:


Do you guys also think I should change disk 5 first or am I missing something ?


From your description, this sounds like the best course of action, but 
you should look at your system log files to see what sort of issues 
are being logged.  Also consult the output of 'iostat -xe' to see what 
low-level errors are being logged.



I'm not an expert with zfs so any insight to help me replace those disks 
without loosing too much
data would be much appreciated :)


If this is really raidz1 then more data is definitely at risk if 
several disks seem to be failing at once.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris 11/ZFS historical reporting

2012-04-16 Thread Bob Friesenhahn

On Mon, 16 Apr 2012, Tomas Forsman wrote:


On 16 April, 2012 - Anh Quach sent me these 0,4K bytes:


Are there any tools that ship w/ Solaris 11 for historical reporting on things 
like network activity, zpool iops/bandwidth, etc., or is it pretty much 
roll-your-own scripts and whatnot?


zpool iostat 5  is the closest built-in..


Otherwise, switch from Solaris 11 to SmartOS or Illumos.  Lots of good 
stuff going on there for monitoring and reporting.  The dtrace.conf 
conference seemed like it was pretty interesting.  See 
http://smartos.org/blog/;.  Lots more good stuff at 
http://www.youtube.com/user/deirdres; and elsewhere on Youtube.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Seagate Constellation vs. Hitachi Ultrastar

2012-04-06 Thread Bob Friesenhahn

On Fri, 6 Apr 2012, Marion Hakanson wrote:


The only caveat I've found is that the Nearline SAS Seagates go really
slow with the Solaris default multipath load-balancing setting
(round-robin).  Set it to none or some large block value and they go
fast.  This issue doesn't appear when used with the PERC H800's.


If the drives are exposed as individual LUNs, then it may be possible 
to arrange things so that 1/2 the drives are accessed (by default) 
down one path, and the other 1/2 down the other.  That way you get the 
effect of load-balancing without the churn which might be caused by 
dynamic load-balancing.  That is what I did for my storage here, but 
the preferences needed to be configured on the remote end.


It is likely possible to configure everything on the host end but 
Solaris has special support for my drive array so it used the drive 
array's preferences.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] test for holes in a file?

2012-03-27 Thread Bob Friesenhahn

On Mon, 26 Mar 2012, Mike Gerdts wrote:


If file space usage is less than file directory size then it must contain a
hole.  Even for compressed files, I am pretty sure that Solaris reports the
uncompressed space usage.


That's not the case.


You are right.  I should have tested this prior to posting. :-(

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Bob Friesenhahn

On Mon, 26 Mar 2012, Andrew Gabriel wrote:

I just played and knocked this up (note the stunning lack of comments, 
missing optarg processing, etc)...

Give it a list of files to check...


This is a cool program, but programmers were asking (and answering) 
this same question 20+ years ago before there was anything like 
SEEK_HOLE.


If file space usage is less than file directory size then it must 
contain a hole.  Even for compressed files, I am pretty sure that 
Solaris reports the uncompressed space usage.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good tower server for around 1,250 USD?

2012-03-24 Thread Bob Friesenhahn

On Sat, 24 Mar 2012, Sandon Van Ness wrote:


This is a very nice chasis IMHO for a desktop machine:

http://www.supermicro.com/products/chassis/4U/743/SC743TQ-865-SQ.cfm


I own the same chassis.  However, when the system was delivered, it 
was quite loud.  The problem was isolated to using the crummy fans 
that Intel provided with the CPUs.  By replacing the Intel fans with 
better quality fans, now the system is whisper quiet.


My system has two 6-core Xeons (E5649) with 48GB of RAM.

It is able to run OpenIndiana quite well but is being used to run 
Linux as a desktop system.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good tower server for around 1,250 USD?

2012-03-23 Thread Bob Friesenhahn

On Fri, 23 Mar 2012, The Honorable Senator and Mrs. John Blutarsky wrote:



Obtaining an approved system seems very difficult.


Because of the list being out of date and so the systems are no longer
available, or because systems available now don't show up on the list?


Sun was slow to update the list and it is not clear if Oracle updates 
the list at all.



great. After reading the horror stories on the list I don't want to take a
chance and buy the wrong machine and then have ZFS fail or Oracle tell me
they don't support the machine.


I can't answer for Oracle.  There may be a chicken-and-egg problem 
since Oracle might not want to answer speculative questions but might 
be more concrete if you have a system in hand.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good tower server for around 1,250 USD?

2012-03-22 Thread Bob Friesenhahn

On Thu, 22 Mar 2012, The Honorable Senator and Mrs. John Blutarsky wrote:


This will be a do-everything machine. I will use it for development, hosting
various apps in zones (web, file server, mail server etc.) and running other
systems (like a Solaris 11 test system) in VirtualBox. Ultimately I would
like to put it under Solaris support so I am looking for something
officially approved. The problem is there are so many systems on the HCL I
don't know where to begin. One of the Supermicro super workstations looks


Almost all of the systems listed on the HCL are defunct and no longer 
purchasable except for on the used market.  Obtaining an approved 
system seems very difficult. In spite of this, Solaris runs very well 
on many non-approved modern systems.


I don't know what that means as far as the ability to purchase Solaris 
support.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-22 Thread Bob Friesenhahn

On Thu, 22 Mar 2012, Jim Klimov wrote:


I think that a certain Bob F. would disagree, especially
when larger native sectors and ashist=12 come into play.
Namely, one scenario where this is important is automated
storage of thumbnails for websites, or some similar small
objects in vast amounts.


I don't know about that Bob F. but this Bob F. just took a look and 
noticed that thumbnail files for full-color images are typically 4KB 
or a bit larger.  Low-color thumbnails can be much smaller.


For a very large photo site, it would make sense to replicate just the 
thumbnails across a number of front-end servers and put the larger 
files on fewer storage servers because they are requested much less 
often and stream out better.  This would mean that those front-end 
thumbnail servers would primarily contain small files.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Server upgrade

2012-02-15 Thread Bob Friesenhahn

On Wed, 15 Feb 2012, David Dyer-Bennet wrote:

version fits my needs for example.)  Upgrading might perhaps save me from
changing all the user passwords (half a dozen, not a huge problem) and
software packages I've added.

(uname -a says SunOS fsfs 5.11 snv_134 i86pc i386 i86pc).

Or should I just export my pool and do a from-scratch install of
something?  (Then recreate the users and install any missing software.
I've got some cron jobs, too.)


I have read (on the OpenIndiana site) that there is an upgrade path 
from what you have to OpenIndiana.  They describe the procedure to 
use.  OpenIndiana does not yet include encryption support in zfs since 
encryption support was never released into OpenSolaris.


If I was you, I would try the upgrade to OpenIndiana first.

The alternative is paid and supported Oracle Solaris 11, which would 
require a from-scratch install, and may or may not even be an option 
for you.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk failing? High asvc_t and %b.

2012-02-01 Thread Bob Friesenhahn

On Wed, 1 Feb 2012, Jan Hellevik wrote:

The disk in question is c6t70d0 - it shows consistently higher %b and asvc_t
than the other disks in the pool. The output is from a 'zfs receive' after 
about 3 hours.
The two c5dx disks are the 'rpool' mirror, the others belong to the 'backup' 
pool.


Are all of the disks the same make and model?  What type of chassis 
are the disks mounted in?  Is it possible that the environment that 
this disk experiences is somehow different than the others (e.g. due 
to vibration)?



Should I be worried? And what other commands can I use to investigate further?


It is difficult to say if you should be worried.

Be sure to do 'iostat -xe' to see if there are any accumulating errors 
related to the disk.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk failing? High asvc_t and %b.

2012-02-01 Thread Bob Friesenhahn

On Wed, 1 Feb 2012, Jan Hellevik wrote:


Are all of the disks the same make and model?


They are different makes - I try to make pairs of different brands to minimise 
risk.


Does your pairing maintain the same pattern of disk type across all 
the pairings?


Some modern disks use 4k sectors while others still use 512 bytes.  If 
the slow disk is a 4k sector model but the others are 512 byte models, 
then that would certainly explain a difference.


Assuming that a couple of your disks are still unused, you could try 
replacing the suspect drive with an unused drive (via zfs command) to 
see if the slowness goes away. You could also make that vdev a 
triple-mirror since it is very easy to add/remove drives from a mirror 
vdev.  Just make sure that your zfs syntax is correct so that you 
don't accidentally add a single-drive vdev to the pool (oops!). 
These sorts of things can be tested with zfs commands without 
physically moving/removing drives or endangering your data.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need hint on pool setup

2012-01-31 Thread Bob Friesenhahn

On Tue, 31 Jan 2012, Thomas Nau wrote:


Dear all
We have two JBODs with 20 or 21 drives available per JBOD hooked up
to a server. We are considering the following setups:

RAIDZ2 made of 4 drives
RAIDZ2 made of 6 drives

The first option wastes more disk space but can survive a JBOD failure
whereas the second is more space effective but the system goes down when
a JBOD goes down. Each of the JBOD comes with dual controllers, redundant
fans and power supplies so do I need to be paranoid and use option #1?
Of course it also gives us more IOPs but high end logging devices should take
care of that


I think that the answer depends on the impact to your business if data 
is temporarily not available.  If your business can not survive data 
being temporarily not available (for hours or even a week) then the 
more conserative approach may be warranted.


If you have a service contract which assures that a service tech will 
show up quickly with replacement hardware in hand, then this may also 
influence the decision which should be made.


Another consideration is that since these JBODs connect to a server, 
the data will also be unavailable when the server is down.  The server 
being down may in fact be a more significant factor than a JBOD being 
down.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-25 Thread Bob Friesenhahn

On Wed, 25 Jan 2012, Anonymous Remailer (austria) wrote:



I've been watching the heat control issue carefully since I had to take a
job offshore (cough reverse H1B cough) in a place without adequate AC and I
was able to get them to ship my servers and some other gear. Then I read
Intel is guaranteeing their servers will work up to 100 degrees F ambient
temps in the pricing wars to sell servers, he who goes green and saves data


Most servers seem to be specified to run up to 95 degrees, with some 
particularly-dense ones specified to only handle 90.  Network 
switching gear is usually specified to handle 105.


My own equipment typically experiences up to 83 degrees during the 
peak of summer (but quite a lot more if the AC fails).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   5   6   7   8   9   10   >