Re: [zfs-discuss] SSD ZIL/L2ARC partitioning

2012-11-14 Thread Neil Perrin

On 11/14/12 03:24, Sašo Kiselkov wrote:

On 11/14/2012 11:14 AM, Michel Jansens wrote:

Hi,

I've ordered a new server with:
- 4x600GB Toshiba 10K SAS2 Disks
- 2x100GB OCZ DENEVA 2R SYNC eMLC SATA (no expander so I hope no
SAS/SATA problems). Specs:
http://www.oczenterprise.com/ssd-products/deneva-2-r-sata-6g-2.5-emlc.html

I want to use the 2 OCZ SSDs as mirrored intent log devices, but as the
intent log needs quite a small amount of the disks (10GB?), I was
wondering if I can use the rest of the disks as L2ARC?

I have a few questions about this:

-Is 10GB enough for a log device?

A log device, essentially, only needs to hold a single
transaction's-worth of small sync writes,


Actually it needs to hold 3 transaction groups worth.
There are 3 phases to ZFS's transaction group model: open, quiescing and 
syncing.
Nowadays the sync phase is targetted at 5s so the log device needs to be able to
hold up to 15s of synchronous data.


  so unless you write more than
that, you'll be fine. In fact, DDRdrive's X1 is only 4GB and works just
fine.


Agreed, 10GB should be fine for your system.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Neil Perrin

  
  
On 10/04/12 05:30, Schweiss, Chip wrote:
Thanks for all the input. It seems information on the
  performance of the ZIL is sparse and scattered. I've spent
  significant time researching this the past day. I'll summarize
  what I've found. Please correct me if I'm wrong.
  
The ZIL can have any number of SSDs attached either mirror
  or individually. ZFS will stripe across these in a raid0 or
  raid10 fashion depending on how you configure.
  


The ZIL code chains blocks together and these are allocated round
robin among slogs or
if they don't exist then the main pool devices.


  
To determine the true maximum streaming performance of the
  ZIL setting sync=disabled will only use the in RAM ZIL. This
  gives up power protection to synchronous writes.
  


There is no RAM ZIL. If sync=disabled then all writes are
asynchronous and are written
as part of the periodic ZFS transaction group (txg) commit that
occurs every 5 seconds.


  
Many SSDs do not help protect against power failure because
  they have their own ram cache for writes. This effectively
  makes the SSD useless for this purpose and potentially
  introduces a false sense of security. (These SSDs are fine
  for L2ARC)

  


The ZIL code issues a write cache flush to all devices it has
written before returning
from the system call. I've heard, that not all devices obey the
flush but we consider them
as broken hardware. I don't have a list to avoid.


  

  

Mirroring SSDs is only helpful if one SSD fails at the time
  of a power failure. This leave several unanswered questions.
  How good is ZFS at detecting that an SSD is no longer a
  reliable write target? The chance of silent data corruption
  is well documented about spinning disks. What chance of data
  corruption does this introduce with up to 10 seconds of data
  written on SSD. Does ZFS read the ZIL during a scrub to
  determine if our SSD is returning what we write to it?

  


If the ZIL code gets a block write failure it will force the txg to
commit before returning.
It will depend on the drivers and IO subsystem as to how hard it
tries to write the block.


  

  

Zpool versions 19 and higher should be able to survive a ZIL
  failure only loosing the uncommitted data.  However, I
  haven't seen good enough information that I would necessarily
  trust this yet. 

  


This has been available for quite a while and I haven't heard of any
bugs in this area.


  

  Several threads seem to suggest a ZIL throughput limit of
  1Gb/s with SSDs. I'm not sure if that is current, but I
  can't find any reports of better performance. I would
  suspect that DDR drive or Zeus RAM as ZIL would push past
  this.
  


1GB/s seems very high, but I don't have any numbers to share.


  

  
  Anyone care to post their performance numbers on current
hardware with E5 processors, and ram based ZIL solutions?
  Thanks to everyone who has responded and contacted me directly
on this issue.
  -Chip
  
  On Thu, Oct 4, 2012 at 3:03 AM, Andrew
Gabriel andrew.gabr...@cucumber.demon.co.uk
wrote:

  
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
  wrote:
  

  From: zfs-discuss-boun...@opensolaris.org
  [mailto:zfs-discuss-
  boun...@opensolaris.org] On
  Behalf Of Schweiss, Chip
  
  How can I determine for sure that my ZIL is my
  bottleneck? If it is the
  bottleneck, is it possible to keep adding mirrored
  pairs of SSDs to the ZIL to
  make it faster? Or should I be looking for a DDR
  drive, ZeusRAM, etc.


Temporarily set sync=disabled
Or, depending on your application, leave it that way
permanently. I know, for the work I do, most systems I
support at most locations have sync=disabled. It all
depends on the workload.
  
  

  
  Noting of course that this means that in the case of an
  unexpected system outage or loss of connectivity to the disks,
  synchronous writes since the last txg commit will be lost,
  even though the applications will believe they are 

Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Neil Perrin

On 10/04/12 15:59, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) 
wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Neil Perrin

The ZIL code chains blocks together and these are allocated round robin
among slogs or
if they don't exist then the main pool devices.

So, if somebody is doing sync writes as fast as possible, would they gain more 
bandwidth by adding multiple slog devices?


In general - yes, but it really depends. Multiple synchronous writes of any size
across multiple file systems will fan out across the log devices. That is
because there is a separate independent log chain for each file system.

Also large synchronous writes (eg 1MB) within a specific file system will be 
spread out.
The ZIL code will try to allocate a block to hold all the records it needs to
commit up to the largest block size - which currently for you should be 128KB.
Anything larger will allocate a new block - on a different device if there are
multiple devices.

However, lots of small synchronous writes to the same file system might not
use more than one 128K block and benefit from multiple slog devices.

Neil.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-03 Thread Neil Perrin

On 08/03/12 19:39, Bob Friesenhahn wrote:

On Fri, 3 Aug 2012, Karl Rossing wrote:


I'm looking at 
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-ssd.html
 wondering what I should get.

Are people getting intel 330's for l2arc and 520's for slog?


For the slog, you should look for a SLC technology SSD which saves unwritten data on 
power failure.  In Intel-speak, this is called Enhanced Power Loss Data 
Protection.  I am not running across any Intel SSDs which claim to match these 
requirements.


- That shouldn't be necessary. ZFS flushes the write cache for any device 
written before returning
from the synchronous request to ensure data stability.




Extreme write IOPS claims in consumer SSDs are normally based on large write 
caches which can lose even more data if there is a power failure.

Bob


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Log disk with all ssd pool?

2011-10-28 Thread Neil Perrin





On 10/28/11 00:04, Mark Wolek wrote:

  
  
  
  
  Still kicking around this idea and didnt see it
addressed in any of the threads before the forum closed. 
  
  If one made an all ssd pool, would a log/cache
drive just slow you down? Would zil slow you down? Thinking rotate
MLC drives with sandforce controllers every few years to avoid losing a
drive to sorry no more writes aloud scenarios. 
  
  Thanks
  Mark
  


Interesting question. I don't think there's a straightforward answer.
Oracle uses write optimised log devices and read optimised cache
devices in it's appliances. However, assuming all the SSDs are the same
then I suspect neither a log nor a cache device would help:

Log
If there is a log then it is solely used, and can be written to in
parallel with periodic TXG commit writes to the other pool devices. If
that log were part of the pool then the ZIL code will spread the load
among all pool devices, but will compete with TXG commit writes. My
gut feeling is that this would be the higher performing option though.
I think, a long time ago, I experimented with designating one disk out
of the pool as a log and saw degradation on synchronous performance.
That seems to be the equivalent to your SSD question.

Cache
Similarly for cache devices the read would compete at TXG commit
writes, but otherwise performance ought to be higher.

Neil.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Log disk with all ssd pool?

2011-10-28 Thread Neil Perrin




On 10/28/11 00:54, Neil Perrin wrote:

  
  
On 10/28/11 00:04, Mark Wolek wrote:
  




Still kicking around this idea and didnt see
it
addressed in any of the threads before the forum closed. 

If one made an all ssd pool, would a log/cache
drive just slow you down? Would zil slow you down? Thinking rotate
MLC drives with sandforce controllers every few years to avoid losing a
drive to sorry no more writes aloud scenarios. 

Thanks
Mark

  
  
Interesting question. I don't think there's a straightforward answer.
Oracle uses write optimised log devices and read optimised cache
devices in it's appliances. However, assuming all the SSDs are the same
then I suspect neither a log nor a cache device would help:
  
  Log
If there is a log then it is solely used, and can be written to in
parallel with periodic TXG commit writes to the other pool devices. If
that log were part of the pool then the ZIL code will spread the load
among all pool devices, but will compete with TXG commit writes. My
gut feeling is that this would be the higher performing option though.
I think, a long time ago, I experimented with designating one disk out
of the pool as a log and saw degradation on synchronous performance.
That seems to be the equivalent to your SSD question.
  
  Cache
Similarly for cache devices the read would compete at TXG commit
writes, but otherwise performance ought to be higher.
  
Neil.

Did some quick tests with disks to check if my memory was correct.
'sb' is a simple problem to spawn a number of threads to fill a file of
a certain size
with specified sized non zero writes. Bandwidth is also important.

1. Simple 2 disk system.
 32KB synchronous writes filling 1GB with 20 threads

zpool create whirl 2 disks; zfs set recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 20
 Elapsed time 95s 10.8MB/s

zpool create whirl disk log disk ; zfs set
recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 20
 Elapsed time 151s 6.8MB/s

2. Higher end 6 disk system.
 32KB synchronous writes filling 1GB with 100 threads

zpool create whirl 6 disks; zfs set recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 100
 Elapsed time 33s 31MB/s

zpool create whirl 5 disks log 1disk; zfs set
recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 100
 Elapsed time 147s 7.0MB/s

and for interest:
zpool create whirl 5 disk log SSD; zfs set
recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 100
 Elapsed time 8s 129MB/s

3. Higher end smaller writes
 2K synchronous writes filling 128MB with 100 threads

zpool create whirl 6 disks: zfs set recordsize=1k whirl
st1 -n /whirl/f -f 134217728 -b 2048 -t 100
 Elapsed time 16s 8.2MB/s

zpool create whirl 5 disks log 1 disk
zfs set recordsize=1k whirl
ds8 -n /whirl/f -f 134217728 -b 2048 -t 100
 Elapsed time 24s 5.5MB/s




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Log disk with all ssd pool?

2011-10-28 Thread Neil Perrin




On 10/28/11 11:21, Mark Wolek wrote:

  
  
  
  
  Having
the log disk slowed it down a lot in your tests (when it wasnt a SSD),
30MB/s vs 7. Is this is also a 100% write / 100% sequential workload?
Forcing sync?
  


100% synchronous write. Writes are random but ZFS will write them
sequentially on disk.

  
  
  
  Its
gotten to the point where I can buy a 120G SSD for less or the same
price as a 146G SAS diskSure the MLC drives have limited lifetime, but
at $150 (and dropping) just replace them every few years to be safe,
work out a rotation/rebuild cycle, its tempting I suppose if we do
end up buying all SSDs it becomes really easy to test if we should use
a log or not!
  


Would highly recommend some form of zpool redundancy (mirroring or
raidz).

  
   
  
  
  
  
  From:
zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Neil
Perrin
  Sent: Friday, October 28, 2011 11:38 AM
  To: zfs-discuss@opensolaris.org
  Subject: Re: [zfs-discuss] Log disk with all ssd pool?
  
  
  
  On 10/28/11 00:54, Neil Perrin wrote: 
  
On 10/28/11 00:04, Mark Wolek wrote: 
  Still kicking around this idea and didnt see it
addressed in any of the threads before the forum closed. 
  
  If one made an all ssd pool, would a log/cache
drive just slow you down? Would zil slow you down? Thinking rotate
MLC drives with sandforce controllers every few years to avoid losing a
drive to sorry no more writes aloud scenarios. 
  
  Thanks
  Mark
  
Interesting question. I don't think there's a straightforward answer.
Oracle uses write optimised log devices and read optimised cache
devices in it's appliances. However, assuming all the SSDs are the same
then I suspect neither a log nor a cache device would help:
  
  Log
If there is a log then it is solely used, and can be written to in
parallel with periodic TXG commit writes to the other pool devices. If
that log were part of the pool then the ZIL code will spread the load
among all pool devices, but will compete with TXG commit writes. My
gut feeling is that this would be the higher performing option though.
I think, a long time ago, I experimented with designating one disk out
of the pool as a log and saw degradation on synchronous performance.
That seems to be the equivalent to your SSD question.
  
  Cache
Similarly for cache devices the read would compete at TXG commit
writes, but otherwise performance ought to be higher.
  
Neil.
  Did
some quick tests with disks to check if my memory was correct.
'sb' is a simple problem to spawn a number of threads to fill a file of
a certain size
with specified sized non zero writes. Bandwidth is also important.
  
1. Simple 2 disk system.
 32KB synchronous writes filling 1GB with 20 threads
  
zpool create whirl 2 disks; zfs set recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 20
 Elapsed time 95s 10.8MB/s
  
zpool create whirl disk log disk ; zfs set
recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 20
 Elapsed time 151s 6.8MB/s
  
2. Higher end 6 disk system.
 32KB synchronous writes filling 1GB with 100 threads
  
zpool create whirl 6 disks; zfs set recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 100
 Elapsed time 33s 31MB/s
  
zpool create whirl 5 disks log 1disk; zfs set
recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 100
 Elapsed time 147s 7.0MB/s
  
and for interest:
zpool create whirl 5 disk log SSD; zfs set
recordsize=32k whirl
st1 -n /whirl/f -f 1073741824 -b 32768 -t 100
 Elapsed time 8s 129MB/s
  
3. Higher end smaller writes
 2K synchronous writes filling 128MB with 100 threads
  
zpool create whirl 6 disks: zfs set recordsize=1k whirl
st1 -n /whirl/f -f 134217728 -b 2048 -t 100
 Elapsed time 16s 8.2MB/s
  
zpool create whirl 5 disks log 1 disk
zfs set recordsize=1k whirl
ds8 -n /whirl/f -f 134217728 -b 2048 -t 100
 Elapsed time 24s 5.5MB/s
  
  
  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advice with SSD, ZIL and L2ARC

2011-09-19 Thread Neil Perrin

On 9/19/11 11:45 AM, Jesus Cea wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I have a new answer: interaction between dataset encryption and L2ARC
and ZIL.

1. I am pretty sure (but not completely sure) that data stored in the
ZIL is encrypted, if the destination dataset uses encryption. Can
anybody confirm?.



If the data set (file system/zvol) is encrypted then
the user data is also encrypted. The ZIL meta data
used to parse blocks and records is kept in the clear
(in order to claim the blocks) but the user data is
encrypted.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advice with SSD, ZIL and L2ARC

2011-08-30 Thread Neil Perrin

On 08/30/11 08:31, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jesus Cea


10. What happens if my 1GB of ZIL is too optimistic?. Will ZFS use the

disks or it will stop writers until flushing ZIL to the HDs?.



Good question.  I don't know.
  


- It will use the pool disks.

Thanks Edward for answering the rest.

Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-01 Thread Neil Perrin
In general the blogs conclusion is correct . When file systems get full 
there is

fragmentation (happens to all file systems) and for ZFS the pool uses gang
blocks of smaller blocks when there are insufficient large blocks.
However, the ZIL never allocates or uses gang blocks. It directly allocates
blocks (outside of the zio pipeline) using zio_alloc_zil() - 
metaslab_alloc().

Gang blocks are only used by the main pool when the pool transaction
group (txg) commit occurs.  Solutions to the problem include:
   - add a separate intent log
   - add more top level devices (hopefully replicated)
   - delete unused files/snapshots etc with in the poll...

Neil.


On 08/01/11 08:29, Josh Simon wrote:

Hello,

One of my coworkers was sent the following explanation from Oracle as 
to why one of backup systems was conducting a scrub so slow. I figured 
I would share it with the group.


http://wildness.espix.org/index.php?post/2011/06/09/ZFS-Fragmentation-issue-examining-the-ZIL 



PS: Thought it was kind of odd that Oracle would direct us to a blog, 
but the post is very thorough.


Thanks,

Josh Simon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-16 Thread Neil Perrin

On 06/16/11 20:26, Daniel Carosone wrote:

On Thu, Jun 16, 2011 at 09:15:44PM -0400, Edward Ned Harvey wrote:
  

My personal preference, assuming 4 disks, since the OS is mostly reads and
only a little bit of writes, is to create a 4-way mirrored 100G partition
for the OS, and the remaining 900G of each disk (or whatever) becomes either
a stripe of mirrors or raidz, as appropriate in your case, for the
storagepool.



Is it still the case, as it once was, that allocating anything other
than whole disks as vdevs forces NCQ / write cache off on the drive
(either or both, forget which, guess write cache)?


It was once the case that using a slice as a vdev forced the write cache 
off,
but I just tried it and found it wasn't disabled - at least with the 
current source.

In fact it looks like we no longer change the setting.
You may want to experiment yourself on your ZFS version (see below for 
how the check).


 


If so, can this be forced back on somehow to regain performance when
known to be safe?  
  


Yes: format -e- select disk -  cache -  write - 
display/enable/disable

I think the original assumption was that zfs-in-a-partition likely
implied the disk was shared with ufs, rather than another async-safe
pool.


- Correct.


Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Neil Perrin

On 05/02/11 14:02, Nico Williams wrote:

Also, sparseness need not be apparent to applications.  Until recent
improvements to lseek(2) to expose hole/non-hole offsets, the only way
to know about sparseness was to notice that a file's reported size is
more than the file's reported filesystem blocks times the block size.
Sparse files in Unix go back at least to the early 80s.

If a filesystem protocol, such as CIFS (I've no idea if it supports
sparse files), were to not support sparse files, all that would mean
is that the server must report a number of blocks that matches a
file's size (assuming the protocol in question even supports any
notion of reporting a file's size in blocks).

There's really two ways in which a filesystem protocol could support
sparse files: a) by reporting file size in bytes and blocks, b) by
reporting lists of file offsets demarcating holes from non-holes.  (b)
is a very new idea; Lustre may be the only filesystem that I know that
supports this (see the Linux FIEMAP APIs)., though work is in progress
to add this to NFSv4.
  


I enhanced the lseek interface a while back now to return information 
about sparse

files, by adding 2 new interfaces: SEEK_HOLE and SEEK_DATA. See

man -s2 lseek

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-30 Thread Neil Perrin

On 04/30/11 01:41, Sean Sprague wrote:



: xvm-4200m2-02 ;
 

I can do the echo | mdb -k.  But what is that : xvm-4200 command?
   


My guess is that is a very odd shell prompt ;-)

- Indeed
   ':' means what follows a comment (at least to /bin/ksh)
   'xvm-4200m2-02' is the comment  - actually the system name (not very 
inventive)

   ';' ends the comment.

I use this because I can cut and paste entire lines back to the shell.

Sorry for the confusion: Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Neil Perrin

On 4/28/11 12:45 PM, Edward Ned Harvey wrote:

From: Erik Trimble [mailto:erik.trim...@oracle.com]

OK, I just re-looked at a couple of things, and here's what I /think/ is
the correct numbers.

I just checked, and the current size of this structure is 0x178, or 376
bytes.

Each ARC entry, which points to either an L2ARC item (of any kind,
cached data, metadata, or a DDT line) or actual data/metadata/etc., is
defined in the struct arc_buf_hdr :

http://src.opensolaris.org/source/xref/onnv/onnv-
gate/usr/src/uts/common/fs/zfs/arc.c#431

It's current size is 0xb0, or 176 bytes.

These are fixed-size structures.

heheheh...  See what I mean about all the conflicting sources of
information?  Is it 376 and 176?  Or is it 270 and 200?
Erik says it's fixed-size.  Richard says The DDT entries vary in size.

So far, what Erik says is at least based on reading the source code, with a
disclaimer of possibly misunderstanding the source code.  What Richard says
is just a statement of supposed absolute fact without any backing.

In any event, thank you both for your input.  Can anyone answer these
authoritatively?  (Neil?)   I'll send you a pizza.  ;-)



- I wouldn't consider myself an authority on the dedup code.
The size of these structures will vary according to the release you're running. 
You can always find out the size for a particular system using ::sizeof within
mdb. For example, as super user :

: xvm-4200m2-02 ; echo ::sizeof ddt_entry_t | mdb -k
sizeof (ddt_entry_t) = 0x178
: xvm-4200m2-02 ; echo ::sizeof arc_buf_hdr_t | mdb -k
sizeof (arc_buf_hdr_t) = 0x100
: xvm-4200m2-02 ;

This shows yet another size. Also there are more changes planned within
the arc. Sorry, I can't talk about those changes and nor when you'll
see them.

However, that's not the whole story. It looks like the arc_buf_hdr_t
use their own kmem cache so there should be little wastage, but the
ddt_entry_t are allocated from the generic kmem caches and so will
probably have some roundup and unused space. Caches for small buffers
are aligned to 64 bytes. See kmem_alloc_sizes[] and comment:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#920

Pizza: Mushroom and anchovy - er, just kidding.

Neil.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Neil Perrin

On 04/25/11 11:55, Erik Trimble wrote:

On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:


And one more comment:  Based on what's below, it seems that the DDT 
gets stored on the cache device and also in RAM.  Is that correct?  
What if you didn't have a cache device?  Shouldn't it *always* be in 
ram?  And doesn't the cache device get wiped every time you reboot?  
It seems to me like putting the DDT on the cache device would be 
harmful...  Is that really how it is?


 

Nope. The DDT is stored only in one place: cache device if present, 
/or/ RAM otherwise (technically, ARC, but that's in RAM).  If a cache 
device is present, the DDT is stored there, BUT RAM also must store a 
basic lookup table for the DDT (yea, I know, a lookup table for a 
lookup table).


No, that's not true. The DDT is just like any other ZFS metadata and can 
be split over the ARC,
cache device (L2ARC) and the main pool devices. An infrequently 
referenced DDT block will get

evicted from the ARC to the L2ARC then evicted from the L2ARC.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot remove zil device

2011-03-31 Thread Neil Perrin

On 03/31/11 12:28, Roy Sigurd Karlsbakk wrote:

http://pastebin.com/nD2r2qmh


Here is zpool status and zpool version



The only thing I wonder about here, is why you have two striped log devices. I 
didn't even know that was supported.
  


Yes it's supported. ZFS will round robin writes to the log devices.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] BOOT, ZIL, L2ARC one one SSD?

2011-01-04 Thread Neil Perrin

On 12/25/10 19:32, Bill Werner wrote:

Understood Edward, and if this was a production data center, I wouldn't be 
doing it this way.  This is for my home lab, so spending hundreds of dollars on 
SSD devices isn't practical.

Can several datasets share a single ZIL and a single L2ARC, or much must each 
dataset have their own?
  
The ZIL and L2ARC devices are per pool and thus shared shared amongst 
all datasets.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ashift and vdevs

2010-12-01 Thread Neil Perrin

On 12/01/10 22:14, Miles Nordin wrote:

Also did anyone ever clarify whether the slog has an ashift?  or is it
forced-512?  or derived from whatever vdev will eventually contain the
separately-logged data?  I would expect generalized immediate Caring
about that since no slogs except ACARD and DDRDrive will have 512-byte
sectors.
  

The minimum slog write is

#define ZIL_MIN_BLKSZ 4096

and all writes are also rounded to multiples of ZIL_MIN_BLKSZ.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does dedup work over iSCSI?

2010-10-22 Thread Neil Perrin

On 10/22/10 15:34, Peter Taps wrote:

Folks,

Let's say I have a volume being shared over iSCSI. The dedup has been turned on.

Let's say I copy the same file twice under different names at the initiator 
end. Let's say each file ends up taking 5 blocks.

For dedupe to work, each block for a file must match the corresponding block 
from the other file. Essentially, each pair of block being compared must have 
the same start location into the actual data.
  


No,  ZFS doesn't care about the file offset, just that the checksum of 
the blocks matches.


For a shared filesystem, ZFS may internally ensure that the block starts match. However, over iSCSI, the initiator does not even know about the whole block mechanism that zfs has. It is just sending raw bytes to the target. This makes me wonder if dedup actually works over iSCSI. 


Can someone please enlighten me on what I am missing?

Thank you in advance for your help.

Regards,
Peter
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does dedup work over iSCSI?

2010-10-22 Thread Neil Perrin

On 10/22/10 17:28, Peter Taps wrote:

Hi Neil,

if the file offset does not match, the chances that the checksum would match, 
especially sha256, is almost 0.

May be I am missing something. Let's say I have a file that contains 11 letters 
- ABCDEFGHIJK. Let's say the block size is 5.

For the first file, the block contents are ABCDE, FGHIJ, and K.

For the second file, let's say the blocks are  ABCD, EFGHI, and JK.

The chance that any checksum would match is very less. The chance that any 
checksum+verify would match is even less.

Regards,
Peter


The block size and contents has to match for ZFS dedup.
See http://blogs.sun.com/bonwick/entry/zfs_dedup

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is dedupditto property on zpool?

2010-09-24 Thread Neil Perrin

On 09/24/10 11:26, Peter Taps wrote:

Folks,

One of the zpool properties that is reported is dedupditto. However, there is 
no documentation available, either in man pages or anywhere else on the Internet. What 
exactly is this property?

Thank you in advance for your help.

Regards,
Peter
  

I found it documented in man zpool:

dedupditto=number

Sets a threshold for number of copies. If the  reference
count  for  a  deduplicated block goes above this thres-
hold, another ditto copy of the block is stored automat-
ically. The default value is 0.



It seems a bit counter-intuitive to start with. The purpose of dedup is 
to remove

copies of blocks. However, if there are say 50 references to the same
block and that block gets checksum errors then all 50 references are bad.
So this is another form of redundancy, by telling zfs to store an additional
copy say a specific number of references.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS COW and simultaneous read write of files

2010-09-22 Thread Neil Perrin

On 09/22/10 11:22, Moazam Raja wrote:

Hi all, I have a ZFS question related to COW and scope.

If user A is reading a file while user B is writing to the same file,
when do the changes introduced by user B become visible to everyone?
Is there a block level scope, or file level, or something else?

Thanks!
  


Assuming the user is using read and write against zfs files.
ZFS has reader/writer range locking within files.
If thread A is trying to read the same section that thread B is writing 
it will
block until the data is written. Note, written in this case means 
written into the zfs
cache and not to the disks. If thread A requires that changes to the 
file be stable (on disk)

before reading it can use the little known O_RSYNC flag.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is l2cache setting?

2010-09-22 Thread Neil Perrin

On 09/22/10 11:23, Peter Taps wrote:

Folks,

While going through zpool source code, I see a configuration option called 
l2cache. What is this option for? It doesn't seem to be documented.

Thank you in advance for your help.

Regards,
Peter
  

man zpool
under Cache Devices section
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is l2cache setting?

2010-09-22 Thread Neil Perrin

On 09/22/10 13:40, Peter Taps wrote:

Neil,

Thank you for your help.

However, I don't see anything about l2cache under Cache devices man pages.

To be clear, there are two different vdev types defined in zfs source code - cache and l2cache. 
I am familiar with cache devices. I am curious about l2cache devices.

Regards,
Peter
  
They are one and the same. It's a bit confusing, but 'cache' was the 
external name given to l2cache (level 2 cache) vdevs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best practice for Sol10U9 ZIL -- mirrored or not?

2010-09-18 Thread Neil Perrin

On 09/17/10 23:31, Ian Collins wrote:

On 09/18/10 04:46 PM, Neil Perrin wrote:

On 09/17/10 18:32, Edward Ned Harvey wrote:

From: Neil Perrin [mailto:neil.per...@oracle.com]



you lose information.  Not your whole pool.  You lose up to
30 sec of writes
   

The default is  now 5 seconds (zfs_txg_timeout).
 


When did that become default?


It was changed more recently than I remember in snv_143 as part of a 
of set of
bug fixes: 6494473, 6743992, 6936821, 6956464. They were integrated 
on 6/8/10.



   Should I *ever* say 30 sec anymore?
   


Well for versions before snv_143 then 30 seconds is correct.  I was just
giving a heads up that it has changed.



In the context of this thread, was the change integrated in update 9?


- No. It looks like it's destined for Update 10.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best practice for Sol10U9 ZIL -- mirrored or not?

2010-09-17 Thread Neil Perrin

On 09/17/10 18:32, Edward Ned Harvey wrote:

From: Neil Perrin [mailto:neil.per...@oracle.com]


you lose information.  Not your whole pool.  You lose up to 
30 sec of writes
  

The default is  now 5 seconds (zfs_txg_timeout).



When did that become default?


It was changed more recently than I remember in snv_143 as part of a of 
set of
bug fixes: 6494473, 6743992, 6936821, 6956464. They were integrated on 
6/8/10.



  Should I *ever* say 30 sec anymore?
  


Well for versions before snv_143 then 30 seconds is correct.  I was just
giving a heads up that it has changed.


In my world, the oldest machine is 10u6.  (Except one machine named
dinosaur that is sol8)


  

I believe George responded on that thread that we do handle log mirrors
correctly.
That is, if one side fails to checksum a block we do indeed check the
other side.
I should have been more cautious with my concern. I think I said I
don't know if we handle
it correctly, and George confirmed we do. Sorry for the false alarm.



Great.  ;-)  Thank you.

So the recommendation is still to mirror log devices, because the
recommendation will naturally be ultra-conservative.  ;-)  The risk is far
smaller now than it was before.  So make up your own mind.  If you are
willing to risk 5sec or 30sec of data in the situation of (a) undetected
failed log device *and* (b) ungraceful system crash, then you are willing to
run with unmirrored log devices.  In no situation does the filesystem become
inconsistent or corrupt.  In the worst case, you have a filesystem which is
consistent with a valid filesystem state, a few seconds before the system
crash.  (Assuming you have a zpool recent enough to support log device
removal.)

  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Neil Perrin

Arne,

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 
30 seconds.
I would expect mush less, but finding room for the rest of the txg data 
and metadata

would also be a challenge.

Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.

Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) blocks
until the txg is flushed. This means a write takes up to 30 seconds. During
this time, the nfs calls block, occupying all NFS server threads. With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server effectively
down to zero.
It may be that the trigger for this behavior is around 95%. I managed to bring
the pool down to 95%, now the writes get served continuously as it should be.

What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Neil Perrin

I should also have mentioned that if the pool has a separate log device
then this shouldn't happen.Assuming the slog is big enough then it
it should have enough blocks to not be forced into using main pool 
device blocks.


Neil.

On 09/09/10 10:36, Neil Perrin wrote:

Arne,

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait 
is 30 seconds.
I would expect mush less, but finding room for the rest of the txg 
data and metadata

would also be a challenge.

Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.

Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) 
blocks
until the txg is flushed. This means a write takes up to 30 seconds. 
During
this time, the nfs calls block, occupying all NFS server threads. 
With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to 
wait
until the writes finish, bringing the performance of the server 
effectively

down to zero.
It may be that the trigger for this behavior is around 95%. I managed 
to bring
the pool down to 95%, now the writes get served continuously as it 
should be.


What is the explanation for this behaviour? Is it intentional and can 
the

threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread Neil Perrin

On 08/25/10 20:33, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Neil Perrin

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can't read
a block due to an IO error then that is reported, but if the checksum
does
not match then we assume it's the end of the intent log chain.
Using this design means we use the minimum number of writes.

So corruption of an intent log is not going to generate any errors.



I didn't know that.  Very interesting.  This raises another question ...

It's commonly stated, that even with log device removal supported, the most
common failure mode for an SSD is to blindly write without reporting any
errors, and only detect that the device is failed upon read.  So ... If an
SSD is in this failure mode, you won't detect it?  At bootup, the checksum
will simply mismatch, and we'll chug along forward, having lost the data ...
(nothing can prevent that) ... but we don't know that we've lost data?
  


- Indeed, we wouldn't know we lost data.


Worse yet ... In preparation for the above SSD failure mode, it's commonly
recommended to still mirror your log device, even if you have log device
removal.  If you have a mirror, and the data on each half of the mirror
doesn't match each other (one device failed, and the other device is good)
... Do you read the data from *both* sides of the mirror, in order to
discover the corrupted log device, and correctly move forward without data
loss?

  


Hmm, I need to check, but if we get a checksum mismatch then I don't 
think we try other
mirror(s). This is automatic for the 'main pool', but of course the ZIL 
code is different
by necessity. This problem can of course be fixed. (It will be  a week 
and a bit before I can

report back on this, as I'm on vacation).

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread Neil Perrin

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can't read
a block due to an IO error then that is reported, but if the checksum does
not match then we assume it's the end of the intent log chain.
Using this design means we the minimum number of writes to add
write an intent log record is just one write.

So corruption of an intent log is not going to generate any errors.

Neil.

On 08/23/10 10:41, StorageConcepts wrote:
Hello, 

we are currently extensivly testing the DDRX1 drive for ZIL and we are going through all the corner cases. 


The headline above all our tests is do we still need to mirror ZIL with all 
current fixes in ZFS (zfs can recover zil failure, as long as you don't export the pool, 
with latest upstream you can also import a poool with a missing zil)? This question  is 
especially interesting with RAM based devices, because they don't wear out, have a very 
low bit error rate and use one PCIx slot - which are rare. Price is another aspect here :)

During our tests we found a strange behaviour of ZFS ZIL failures which are not device related and we are looking for help from the ZFS guru's here :) 

The test in question is called offline ZIL corruption. The question is, what happens if my ZIL data is corrupted while a server is transported or moved and not properly shut down. For this we do: 


- Prepare 2 OS installations (ProdudctOS and CorruptOS)
- Boot ProductOS and create a pool and add the ZIL 
- ProductOS: Issue synchronous I/O with a increasing TNX number (and print the latest committet transaciton)

- ProductOS: Power off the server and record the laast committet transaction
- Boot CorruptOS
- Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL  
~ 300 MB from start of disk, overwriting the first two disk labels)
- Boot ProductOS
- Verify that the data corruption is detected by checking the file with the 
transaction number against the one recorded

We ran the test and it seems with modern snv_134 the pool comes up after the 
corruption with all beeing ok, while ~1 Transactions (this is some seconds 
of writes with DDRX1) are missing and nobody knows about this. We ran a scrub 
and scrub does not even detect this. ZFS automatically repairs the labels on 
the ZIL, however no error is reported about the missing data.

While it is clear to us that if we do not have a mirrored zil, the data we have 
overwritten in the zil is lost, we are really wondering why ZFS does not REPORT 
about this corruption, silently ignoring it.

Is this is a bug or .. aehm ... a feature  :) ?

Regards, 
Robert
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread Neil Perrin

On 08/23/10 13:12, Markus Keil wrote:

Does that mean that when the begin of the intent log chain gets corrupted, all
other intent log data after the corruption area is lost, because the checksum of
the first corrupted block doesn't match? 
  


- Yes, but you wouldn't want to replay the following entries in case the 
log records

in the missing log block were important (eg create file).

Mirroring the slogs is recommended to minimise concerns about slogs 
corruption.



 
Regards,

Markus

Neil Perrin neil.per...@oracle.com hat am 23. August 2010 um 19:44
geschrieben:

  

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can't read
a block due to an IO error then that is reported, but if the checksum does
not match then we assume it's the end of the intent log chain.
Using this design means we the minimum number of writes to add
write an intent log record is just one write.

So corruption of an intent log is not going to generate any errors.

Neil.

On 08/23/10 10:41, StorageConcepts wrote:


Hello,

we are currently extensivly testing the DDRX1 drive for ZIL and we are going
through all the corner cases.

The headline above all our tests is do we still need to mirror ZIL with
all current fixes in ZFS (zfs can recover zil failure, as long as you don't
export the pool, with latest upstream you can also import a poool with a
missing zil)? This question  is especially interesting with RAM based
devices, because they don't wear out, have a very low bit error rate and use
one PCIx slot - which are rare. Price is another aspect here :)

During our tests we found a strange behaviour of ZFS ZIL failures which are
not device related and we are looking for help from the ZFS guru's here :)

The test in question is called offline ZIL corruption. The question is,
what happens if my ZIL data is corrupted while a server is transported or
moved and not properly shut down. For this we do:

- Prepare 2 OS installations (ProdudctOS and CorruptOS)
- Boot ProductOS and create a pool and add the ZIL
- ProductOS: Issue synchronous I/O with a increasing TNX number (and print
the latest committet transaciton)
- ProductOS: Power off the server and record the laast committet transaction
- Boot CorruptOS
- Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL
 ~ 300 MB from start of disk, overwriting the first two disk labels)
- Boot ProductOS
- Verify that the data corruption is detected by checking the file with the
transaction number against the one recorded

We ran the test and it seems with modern snv_134 the pool comes up after the
corruption with all beeing ok, while ~1 Transactions (this is some
seconds of writes with DDRX1) are missing and nobody knows about this. We
ran a scrub and scrub does not even detect this. ZFS automatically repairs
the labels on the ZIL, however no error is reported about the missing data.

While it is clear to us that if we do not have a mirrored zil, the data we
have overwritten in the zil is lost, we are really wondering why ZFS does
not REPORT about this corruption, silently ignoring it.

Is this is a bug or .. aehm ... a feature  :) ?

Regards,
Robert
   
  


--
StorageConcepts Europe GmbH
    Storage: Beratung. Realisierung. Support     


Markus Keil            k...@storageconcepts.de
                       http://www.storageconcepts.de
Wiener StraÃYe 114-116Â  Telefon:Â  Â +49 (351) 8 76 92-21
01219 Dresden          Telefax:   +49 (351) 8 76 92-99
Handelregister Dresden, HRB 28281
Geschäftsführer: Robert Heinzmann, Gerd Jelinek
--
Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind
vertraulich  und ausschlieÃYlich für den Gebrauch durch den Empfänger 
bestimmt,
soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt.
Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten
sein. Soweit eine Weitergabe oder Verteilung nicht ausschlieÃYlich zu internen
Zwecken des Empfängers geschieht, wird jede Weitergabe, Verteilung oder 
sonstige
Kopierung untersagt. Sollten Sie nicht  der beabsichtigte Empfänger der 
Sendung
sein, informieren Sie den Absender bitte unverzüglich.
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-09 Thread Neil Perrin

On 07/09/10 19:40, Erik Trimble wrote:

On 7/9/2010 5:18 PM, Brandon High wrote:
On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey 
solar...@nedharvey.com mailto:solar...@nedharvey.com wrote:


The default ZFS block size is 128K.  If you have a filesystem
with 128G used, that means you are consuming 1,048,576 blocks,
each of which must be checksummed.  ZFS uses adler32 and sha256,
which means 4bytes and 32bytes ...  36 bytes * 1M blocks = an
extra 36 Mbytes and some fluff consumed by enabling dedup.

 


I suspect my numbers are off, because 36Mbytes seems impossibly
small.  But I hope some sort of similar (and more correct) logic
will apply.  ;-)


I think that DDT entries are a little bigger than what you're using. 
The size seems to range between 150 and 250 bytes depending on how 
it's calculated, call it 200b each. Your 128G dataset would require 
closer to 200M (+/- 25%) for the DDT if your data was completely 
unique. 1TB of unique data would require 600M - 1000M for the DDT.


The numbers are fuzzy of course, and assum only 128k blocks. Lots of 
small files will increase the memory cost of dedupe, and using it on 
a zvol that has the default block size (8k) would require 16 times 
the memory.


-B




Go back and read several threads last month about ZFS/L2ARC memory 
usage for dedup. In particular, I've been quite specific about how to 
calculate estimated DDT size.  Richard has also been quite good at 
giving size estimates (as well as explaining how to see current block 
size usage in a filesystem).



The structure in question is this one:

ddt_entry

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108

I'd have to fire up an IDE to track down all the sizes of the 
ddt_entry structure's members, but I feel comfortable using Richard's 
270 bytes-per-entry estimate.




It must have grown a bit because on 64 bit x86 a ddt_entry is currently 
0x178 = 376 bytes  :


# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic 
cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs sata sd ip hook neti 
sockfs arp usba fctl random cpc fcip nfs lofs ufs logindmux ptm sppp ipc ]

 ::sizeof struct ddt_entry
sizeof (struct ddt_entry) = 0x178


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-07-02 Thread Neil Perrin

On 07/02/10 00:57, Erik Trimble wrote:

On 7/1/2010 10:17 PM, Neil Perrin wrote:

On 07/01/10 22:33, Erik Trimble wrote:

On 7/1/2010 9:23 PM, Geoff Nordli wrote:

Hi Erik.

Are you saying the DDT will automatically look to be stored in an 
L2ARC device if one exists in the pool, instead of using ARC?


Or is there some sort of memory pressure point where the DDT gets 
moved from ARC to L2ARC?


Thanks,

Geoff


Good question, and I don't know.  My educated guess is the latter 
(initially stored in ARC, then moved to L2ARC as size increases).


Anyone?



The L2ARC just holds blocks that have been evicted from the ARC due
to memory pressure. The DDT is no different than any other object
(e.g. file). So when looking for a block ZFS checks first in the ARC 
then

the L2ARC and if neither succeeds reads from the main pool.

- Anyone.





That's what I assumed.  One further thought, though.  Is the DDT is 
treated as a single entity - so it's *all* either in the ARC or in the 
L2ARC?  Or does it move one entry at a time into the L2ARC as it fills 
the ARC?





It's not treated as a single entity but at a block at a time.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-07-02 Thread Neil Perrin

On 07/02/10 11:14, Erik Trimble wrote:

On 7/2/2010 6:30 AM, Neil Perrin wrote:

On 07/02/10 00:57, Erik Trimble wrote:
That's what I assumed.  One further thought, though.  Is the DDT is 
treated as a single entity - so it's *all* either in the ARC or in 
the L2ARC?  Or does it move one entry at a time into the L2ARC as it 
fills the ARC?



It's not treated as a single entity but at a block at a time.

Neil.


Where 1 block = ?
I'm assuming that more than on DDT entry will fit in a block (since 
DDT entries are ~270 bytes) - but, how big does the block get?  
Depending on the total size of the DDT? Or does it use fixed-sized 
blocks (I'd assume the smallest block possible, in this case)?
- Yes, a pool block will contain many DDT entries. They are stored as a 
ZAP entries.
I assume but I'm not sure if zap blocks grow to the maximum SPA block 
size (currently 128KB).





Which reminds me: the current DDT is stored on disk - correct? - so 
that when I boot up, ZFS loads a complete DDT into the ARC when the 
pool is mounted?  Or is it all constructed on the fly?


- It's read as needed on the fly.

Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-07-01 Thread Neil Perrin

On 07/01/10 22:33, Erik Trimble wrote:

On 7/1/2010 9:23 PM, Geoff Nordli wrote:

Hi Erik.

Are you saying the DDT will automatically look to be stored in an 
L2ARC device if one exists in the pool, instead of using ARC?


Or is there some sort of memory pressure point where the DDT gets 
moved from ARC to L2ARC?


Thanks,

Geoff
   


Good question, and I don't know.  My educated guess is the latter 
(initially stored in ARC, then moved to L2ARC as size increases).


Anyone?



The L2ARC just holds blocks that have been evicted from the ARC due
to memory pressure. The DDT is no different than any other object
(e.g. file). So when looking for a block ZFS checks first in the ARC then
the L2ARC and if neither succeeds reads from the main pool.

- Anyone.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Neil Perrin

On 06/14/10 12:29, Bob Friesenhahn wrote:

On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:


It is good to keep in mind that only small writes go to the dedicated
slog. Large writes to to main store. A succession of that many small
writes (to fill RAM/2) is highly unlikely. Also, that the zil is not
read back unless the system is improperly shut down.


I thought all sync writes, meaning everything NFS and iSCSI, went 
into the slog - IIRC the docs says so.


Check a month or two back in the archives for a post by Matt Ahrens. 
It seems that larger writes (32k?) are written directly to main 
store.  This is probably a change from the original zfs design.


Bob


If there's a slog then the data, regardless of size, gets written to the 
slog.


If there's no slog and if the data size is greater than 
zfs_immediate_write_sz/zvol_immediate_write_sz
(both default to 32K) then the data is written as a block into the pool 
and the block pointer

written into the log record. This is the WR_INDIRECT write type.

So Matt and Roy are both correct.

But wait, there's more complexity!:

If logbias=throughput is set we always use WR_INDIRECT.

If we just wrote more than 1MB for a single zil commit and there's more 
than 2MB waiting

then we start using the main pool.

Clear as mud?  This is likely to change again...

Neil.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Neil Perrin

On 06/14/10 19:35, Erik Trimble wrote:

On 6/14/2010 12:10 PM, Neil Perrin wrote:

On 06/14/10 12:29, Bob Friesenhahn wrote:

On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:


It is good to keep in mind that only small writes go to the dedicated
slog. Large writes to to main store. A succession of that many small
writes (to fill RAM/2) is highly unlikely. Also, that the zil is not
read back unless the system is improperly shut down.


I thought all sync writes, meaning everything NFS and iSCSI, went 
into the slog - IIRC the docs says so.


Check a month or two back in the archives for a post by Matt Ahrens. 
It seems that larger writes (32k?) are written directly to main 
store.  This is probably a change from the original zfs design.


Bob


If there's a slog then the data, regardless of size, gets written to 
the slog.


If there's no slog and if the data size is greater than 
zfs_immediate_write_sz/zvol_immediate_write_sz
(both default to 32K) then the data is written as a block into the 
pool and the block pointer

written into the log record. This is the WR_INDIRECT write type.

So Matt and Roy are both correct.

But wait, there's more complexity!:

If logbias=throughput is set we always use WR_INDIRECT.

If we just wrote more than 1MB for a single zil commit and there's 
more than 2MB waiting

then we start using the main pool.

Clear as mud?  This is likely to change again...

Neil.



How do I monitor the amount of live (i.e. non-committed) data in the 
slog?  I'd like to spend some time with my setup, seeing exactly how 
much I tend to use.


I think monitoring the capacity when running zpool iostat -v pool 1 
should be fairly accurate.
A simple d script can be written to determine how often the ZIL (code) 
fails to get a slog block and

has to resort to the allocation in the main pool.

One recent change reduced the amount of data written and possibly the 
slog block fragmentation.

This is zpool version 23: Slim ZIL. So be sure to experiment with that.




I'd suspect that very few use cases call for more than a couple (2-4) 
GB of slog...


I agree this is typically true. Of course it depends on your workload. 
The amount slog data will reflect the
uncommitted synchronous txg data, and the size of each txg will depend 
on memory size.

This area is also undergoing tuning.


I'm trying to get hard numbers as I'm working on building a 
DRAM/battery/flash slog device in one of my friend's electronics 
prototyping shops.  It would be really nice if I could solve 99% of 
the need with 1 or 2 2GB SODIMMs and the chips from a cheap 4GB USB 
thumb drive...




Sounds like fun. Good luck.

Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working

2010-06-12 Thread Neil Perrin

On 06/12/10 17:13, zfsnoob4 wrote:

Thanks. As I discovered from that post, VB does not have cache flush enabled by 
default. Ignoreflush must be explicitly turned off.

VBoxManage setextradata VMNAME 
VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush 0

where VMNAME is the name of your virtual machine.


Although I tried that it it returned with no output (indicating it worked) but 
it still won't detect a pool that has been destroyed. Is there any way to 
detect if flushes are working from inside the OS? Maybe a command that tells 
you if cacheflush is enabled?

Thanks.
  
You also need the -D flag. I could successfully import. This was 
running the latest bits:


: trasimene ; mkdir /pf
: trasimene ; mkfile 100m /pf/a /pf/b /pf/c
: trasimene ; zpool create whirl /pf/a /pf/b log /pf/c
: trasimene ; zpool destroy whirl
: trasimene ; zpool import -D -d /pf
 pool: whirl
   id: 1406684148029707587
state: ONLINE (DESTROYED)
action: The pool can be imported using its name or numeric identifier.
config:

   whirl   ONLINE
 /pf/a ONLINE
 /pf/b ONLINE
   logs
 /pf/c ONLINE
: trasimene ; zpool import -D -d /pf whirl
: trasimene ; zpool status whirl
 pool: whirl
state: ONLINE
scan: none requested
config:

   NAMESTATE READ WRITE CKSUM
   whirl   ONLINE   0 0 0
 /pf/a ONLINE   0 0 0
 /pf/b ONLINE   0 0 0
   logs
 /pf/c ONLINE   0 0 0

errors: No known data errors
: trasimene ;


It would, of course, have been easier if you'd been using real devices
but I understand you want to experiment first...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working

2010-06-11 Thread Neil Perrin

On 06/11/10 22:07, zfsnoob4 wrote:

Hey,

I'm running some test right now before setting up my server. I'm running 
Nexenta Core 3.02 (RC2, based on opensolaris build 134 I believe) in Virtualbox.

To do the test, I'm creating three empty files and then making a raidz mirror:
mkfile -n 1g /foo
mkfile -n 1g /foo1
mkfile -n 1g /foo2

Then I make a zpool:
zpool create testpool raidz /foo /foo1 /foo2

Now I destroy the pool and attempt to restore it:
zpool destroy testpool

But when I try to list available imports, the list is empty:
zpool import -D
return nothing.

zpool import testpool
also return nothing.

Even if I try to export the pool (so before destroying it):
zpool export testpool

I see it disappear from the zpool list, but I can't import it (commands return 
nothing).

Is this due to the fact that I'm using test files instead of real drives?
  


- Yes.

zpool import will by default look in /dev/dsk.
You need to specify the directory (using -d dir) if your pool devices are
located elsewhere. See man zpool.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] creating a fast ZIL device for $200

2010-05-26 Thread Neil Perrin

On 05/26/10 07:10, sensille wrote:

Recently, I've been reading through the ZIL/slog discussion and
have the impression that a lot of folks here are (like me)
interested in getting a viable solution for a cheap, fast and
reliable ZIL device.
I think I can provide such a solution for about $200, but it
involves a lot of development work.
The basic idea: the main problem when using a HDD as a ZIL device
are the cache flushes in combination with the linear write pattern
of the ZIL. This leads to a whole rotation of the platter after
each write, because after the first write returns, the head is
already past the sector that will be written next.
My idea goes as follows: don't write linearly. Track the rotation
and write to the position the head will hit next. This might be done
by a re-mapping layer or integrated into ZFS. This works only because
ZIL device are basically write-only. Reads from this device will be
horribly slow.

I have done some testing and am quite enthusiastic. If I take a
decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise
the synchronous write performance from 166 writes/s to about
2000 writes/s (!). 2000 IOPS is more than sufficient for our
production environment.

Currently I'm implementing a re-mapping driver for this. The
reason I'm writing to this list is that I'd like to find support
from the zfs team, find sparring partners to discuss implementation
details and algorithms and, most important, find testers!

If there is interest it would be great to build an official project
around it. I'd be willing to contribute most of the code, but any
help will be more than welcome.

So, anyone interested? :)

--
Arne Jansen

  


Yes, I agree this seems very appealing. I have investigated and
observed similar results. Just allocating larger intent log blocks but
only writing to say the first half of them has seen the same effect.
Despite the impressive results, we have not pursued this further mainly
because of it's maintainability. There is quite a variance between
drives so, as mentioned, feedback profiling of the device is needed
in the working system. The layering of the Solaris IO subsystem doesn't
provide the feedback necessary and the ZIL code is layered on the SPA/DMU.
Still it should be possible. Good luck!

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-04-10 Thread Neil Perrin

On 04/10/10 09:28, Edward Ned Harvey wrote:


Neil or somebody?  Actual ZFS developers?  Taking feedback here?   ;-)

 

While I was putting my poor little server through cruel and unusual 
punishment as described in my post a moment ago, I noticed something 
unexpected:


 

I expected that while I'm stressing my log device by infinite sync 
writes, my primary storage devices would also be busy(ish).  Not 
really busy, but not totally idle either.  Since the primary storage 
is a stripe of spindle mirrors, obviously it can handle much more 
sustainable throughput than the individual log device, but the log 
device can respond with smaller latency.  What I noticed was this:


 

For several seconds, **only** the log device is busy.  Then it stops, 
and for maybe 0.5 secs **only** the primary storage disks are busy.  
Repeat, recycle.




These are the txgs getting pushed out.


 

I expected to see the log device busy nonstop.  And the spindle disks 
blinking lightly.  As long as the spindle disks are idle, why wait for 
a larger TXG to be built?  Why not flush out smaller TXG's as long as 
the disks are idle?


Sometimes it's more efficient to batch up requests. Less blocks are 
written. As you mentioned you weren't stressing the system heavily.
ZFS will perform differently when under pressure. It will shorten the 
time between txgs if the data arrives quicker.


  But worse yet ... During the 1-second (or 0.5 second) that the 
spindle disks are busy, why stop the log device?  (Presumably also 
stopping my application that's doing all the writing.)


Yes, this has been observed by many people. There are two sides to this 
problem related to the

CPU and IO used while pushing a txg:

6806882 need a less brutal I/O scheduler
6881015 ZFS write activity prevents other threads from running in a 
timely manner


The CPU side (6881015) was fixed relatively recently in snv_129.

 

This means, if I'm doing zillions of **tiny** sync writes, I will get 
the best performance with the dedicated log device present.  But if 
I'm doing large sync writes, I would actually get better performance 
without the log device at all.  Or else ... add just as many log 
devices as I have primary storage devices.  Which seems kind of crazy.


Yes you're right, there are times when it's better to bypass the slog 
and use the pool disks which can deliver better bandwidth.


The algorithm for where and what the ZIL writes has got quite complex:

- There was another change recently to bypass the slog if 1MB had been 
sent to it and 2MB were waiting to be sent.
- There's a new property logbias which when set to throughput directs 
the ZIL to send all of it's writes to the main pool devices thus freeing 
the slog for more latency sensitive work (ideal for database data files).
- If synchronous writes are large (32K) and block aligned then the 
blocks are written directly to the pool and a small record
 written to the log. Later when the txg commits then the blocks are 
just linked into the txg. However, this processing is not
 done if there are any slogs because I found it didn't perform as well. 
Probably ought to be re-evaluated.
- There are further tweaks being suggested and which might make it to a 
ZIL near you soon.


Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-04-10 Thread Neil Perrin

On 04/10/10 14:55, Daniel Carosone wrote:

On Sat, Apr 10, 2010 at 11:50:05AM -0500, Bob Friesenhahn wrote:
  
Huge synchronous bulk writes are pretty rare since usually the 
bottleneck is elsewhere, such as the ethernet.



Also, large writes can go straight to the pool, and the zil only logs
the intent to commit those blocks (ie, link them into the zfs data
structure).   I don't recall what the threshold for this is, but I
think it's one of those Evil Tunables.
  

This is zfs_immediate_write_sz which is 32K.  However this only happens
currently if you don't have any slogs. If logbias is set to throughput then
all writes go straight to the pool regardless of zfs_immediate_write_sz.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Neil Perrin

On 04/07/10 09:19, Bob Friesenhahn wrote:

On Wed, 7 Apr 2010, Robert Milkowski wrote:


it is only read at boot if there are uncomitted data on it - during 
normal reboots zfs won't read data from slog.


How does zfs know if there is uncomitted data on the slog device 
without reading it?  The minimal read would be quite small, but it 
seems that a read is still required.


Bob


If there's ever been synchronous activity then there an empty tail block 
(stubby) that

will be read even after a clean shutdown.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Neil Perrin

On 04/07/10 10:18, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Bob Friesenhahn

It is also worth pointing out that in normal operation the slog is
essentially a write-only device which is only read at boot time.  The
writes are assumed to work if the device claims success.  If the log
device fails to read (oops!), then a mirror would be quite useful.



An excellent point.

BTW, does the system *ever* read from the log device during normal
operation?  Such as perhaps during a scrub?  It really would be nice to
detect failure of log devices in advance, that are claiming to write
correctly, but which are really unreadable.


A scrub will read the log blocks but only for unplayed logs.
Because of the transient nature of the log and becuase it operates
outside of the transaction group model it's hard to read the in-flight 
log blocks

to validate them.

There have previously been suggestions to read slogs periodically.
I don't know if  there's a CR raised for this though.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Neil Perrin

On 04/05/10 11:43, Andreas Höschler wrote:

Hi Khyron,

No, he did *not* say that a mirrored SLOG has no benefit, 
redundancy-wise.

He said that YOU do *not* have a mirrored SLOG.  You have 2 SLOG devices
which are striped.  And if this machine is running Solaris 10, then 
you cannot

remove a log device because those updates have not made their way into
Solaris 10 yet.  You need pool version = 19 to remove log devices, 
and S10

does not currently have patches to ZFS to get to a pool version = 19.

If your SLOG above were mirrored, you'd have mirror under logs.  
And you
probably would have log not logs - notice the s at the end 
meaning plural,
meaning multiple independent log devices, not a mirrored pair of logs 
which

would effectively look like 1 device.


Thanks for the clarification! This is very annoying. My intend was to 
create a log mirror. I used


zpool add tank log c1t6d0 c1t7d0

and this was obviously false. Would

zpool add tank mirror log c1t6d0 c1t7d0


zpool add tank log mirror c1t6d0 c1t7d0

You can also do it on the create:

zpool create tank pool devs log mirror c1t6d0 c1t7d0



have done what I intended to do? If so it seems I have to tear down 
the tank pool and recreate it from scratc!?. Can I simply use


zpool destroy -f tank

to do so?


Shouldn't need the -f




Thanks,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Neil Perrin

On 04/02/10 08:24, Edward Ned Harvey wrote:

The purpose of the ZIL is to act like a fast log for synchronous
writes.  It allows the system to quickly confirm a synchronous write
request with the minimum amount of work.  



Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
answer this question, I wrote that code, or at least have read it?
  


I'm one of the ZFS developers. I wrote most of the zil code.
Still I don't have all the answers. There's a lot of knowledgeable people
on this alias. I usually monitor this alias and sometimes chime in
when there's some misinformation being spread, but sometimes the volume 
is so high.

Since I started this reply there's been 20 new posts on this thread alone!


Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls? 
  


- The intent log (separate device(s) or not) is only used by fsync, 
O_DSYNC, O_SYNC, O_RSYNC.

NFS commits are seen to ZFS as fsyncs.
Note sync(1m) and sync(2s) do not use the intent log. They force 
transaction group (txg)
commits on all pools. So zfs goes beyond the the requirement for sync() 
which only requires

it schedules but does not necessarily complete the writing before returning.
The zfs interpretation is rather expensive but seemed broken so we fixed it.


Is it ever used to accelerate async writes?



The zil is not used to accelerate async writes.


Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.
  


Threads can be pre-empted in the OS at any time. So even though thread A 
issued
W1 before thread B issued W2, the order is not guaranteed to arrive at 
ZFS as W1, W2.

Multi-threaded applications have to handle this.

If this was a single thread issuing W1 then W2 then yes the order is 
guaranteed

regardless of whether W1 or W2 are synchronous or asynchronous.
Of course if the system crashes then the async operations might not be 
there.



I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?
  


- Kind of. The uberblock contains the root of the txg.



At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?
  


A txg is for the whole pool which can contain many filesystems.
The latest txg defines the current state of the pool and each individual fs.


My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.


Correct (except replace sync() with O_DSYNC, etc).
This also assumes hardware that for example handles correctly the 
flushing of it's caches.



  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.
  


The ZIL doesn't make such guarantees. It's the DMU that handles transactions
and their grouping into txgs. It ensures that writes are committed in order
by it's transactional nature.

The function of the zil is to merely ensure that synchronous operations are
stable and replayed after a crash/power fail onto the latest txg.


Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.
  

No, disabling the ZIL does not disable the DMU.


Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you 

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Neil Perrin

On 03/30/10 20:00, Bob Friesenhahn wrote:

On Tue, 30 Mar 2010, Edward Ned Harvey wrote:


But the speedup of disabling the ZIL altogether is
appealing (and would
probably be acceptable in this environment).


Just to make sure you know ... if you disable the ZIL altogether, and 
you
have a power interruption, failed cpu, or kernel halt, then you're 
likely to

have a corrupt unusable zpool, or at least data corruption.  If that is
indeed acceptable to you, go nuts.  ;-)


I believe that the above is wrong information as long as the devices 
involved do flush their caches when requested to.  Zfs still writes 
data in order (at the TXG level) and advances to the next transaction 
group when the devices written to affirm that they have flushed their 
cache.  Without the ZIL, data claimed to be synchronously written 
since the previous transaction group may be entirely lost.


If the devices don't flush their caches appropriately, the ZIL is 
irrelevant to pool corruption.


Bob

Yes Bob is correct - that is exactly how it works.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance benchmarks in various configurations

2010-02-19 Thread Neil Perrin



If I understand correctly, ZFS now adays will only flush data to
non volatile storage (such as a RAID controller NVRAM), and not
all the way out to disks. (To solve performance problems with some
storage systems, and I believe that it also is the right thing
to do under normal circumstances.)

Doesn't this mean that if you enable write back, and you have
a single, non-mirrored raid-controller, and your raid controller
dies on you so that you loose the contents of the nvram, you have
a potentially corrupt file system?


ZFS requires,that all writes be flushed to non-volatile storage.
This is needed for both transaction group (txg) commits to ensure pool integrity
and for the ZIL to satisfy the synchronous requirement of fsync/O_DSYNC etc.
If the caches weren't flushed then it would indeed be quicker but the pool
would be susceptible to corruption. Sadly some hardware doesn't honour
cache flushes and this can cause corruption.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-09 Thread Neil Perrin



On 02/09/10 08:18, Kjetil Torgrim Homme wrote:

Richard Elling richard.ell...@gmail.com writes:


On Feb 8, 2010, at 9:10 PM, Damon Atkins wrote:


I would have thought that if I write 1k then ZFS txg times out in
30secs, then the 1k will be written to disk in a 1k record block, and
then if I write 4k then 30secs latter txg happen another 4k record
size block will be written, and then if I write 130k a 128k and 2k
record block will be written.

Making the file have record sizes of
1k+4k+128k+2k

Close. Once the max record size is achieved, it is not reduced.  So
the allocation is: 1KB + 4KB + 128KB + 128KB


I think the above is easily misunderstood.  I assume the OP means
append, not rewrites, and in that case (with recordsize=128k):

* after the first write, the file will consist of a single 1 KiB record.
* after the first append, the file will consist of a single 5 KiB
  record.


Good so far.


* after the second append, one 128 KiB record and one 7 KiB record.


A long time ago we used to write short tail blocks, but not any more.
So after the 2nd append we actually have 2 128KB blocks.



in each of these operations, the *whole* file will be rewritten to a new
location, but after a third append, only the tail record will be
rewritten.


So after the third append we'd actually have 3 128KB blocks. The first doesn't
need to be re-written.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL to disk

2010-01-15 Thread Neil Perrin



On 01/15/10 12:59, Jeffry Molanus wrote:
 

Sometimes people get confused about the ZIL and separate logs. For
sizing purposes,
the ZIL is a write-only workload.  Data which is written to the ZIL is
later asynchronously
written to the pool when the txg is committed.


Right; the tgx needs time to transfer the ZIL.


I think you misunderstand the function of the ZIL. It's not a journal,
and doesn't get transferred to the pool as of a txg. It's only ever written 
except
after a crash it's read to do replay. See:

http://blogs.sun.com/perrin/entry/the_lumberjack





The ZFS write performance for this configuration should consistently
be greater than 80 IOPS.  We've seen measurements in the 600 write
IOPS range.  Why?  Because ZFS writes tend to be contiguous. Also,
with the SATA disk write cache enabled, bursts of writes are handled
quite nicely.
 -- richard


Is there a method to determine this value before pool configuration ? Some sort 
of rule of thumb? It would be sad when you configure the pool and have to 
reconfigure later one because you discover the pool can't handle the tgx 
commits from SSD to disk fast enough. In other words; with Y as expected load 
you would require a minimal of X mirror devs or X raid-z vdevs in order to have 
a  pool with enough bandwith/IO to flush the ZIL without stalling the system.


Jeffry


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!

2010-01-13 Thread Neil Perrin

Hi Adam,

So was FW aware of this or in contact with these guys?
Also are you requesting/ordering any of these cards to evaluate?

The device seems kind of small at 4GB, and uses a double wide PCI Express slot.

Neil.

On 01/13/10 12:27, Adam Leventhal wrote:

Hey Chris,


The DDRdrive X1 OpenSolaris device driver is now complete,
please join us in our first-ever ZFS Intent Log (ZIL) beta test 
program.  A select number of X1s are available for loan,
preferred candidates would have a validation background 
and/or a true passion for torturing new hardware/driver :-)


We are singularly focused on the ZIL device market, so a test
environment bound by synchronous writes is required.  The
beta program will provide extensive technical support and a
unique opportunity to have direct interaction with the product
designers.


Congratulations! This is great news for ZFS. I'll be very interested to
see the results members of the community can get with your device as part
of their pool. COMSTAR iSCSI performance should be dramatically improved
in particular.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs on ssd

2009-12-11 Thread Neil Perrin



On 12/11/09 14:56, Bill Sommerfeld wrote:

On Fri, 2009-12-11 at 13:49 -0500, Miles Nordin wrote:

sh == Seth Heeren s...@zfs-fuse.net writes:

sh If you don't want/need log or cache, disable these? You might
sh want to run your ZIL (slog) on ramdisk.

seems quite silly.  why would you do that instead of just disabling
the ZIL?  I guess it would give you a way to disable it pool-wide
instead of system-wide.

A per-filesystem ZIL knob would be awesome.


for what it's worth, there's already a per-filesystem ZIL knob: the
logbias property.  It can be set either to latency or 
throughput.  


That's a bit different. logbias controls whether the intent log block blocks
go to main pool or the log devices (if they exist). I think Miles
was requesting a per fs knob to disable writing any log blocks.
A proposal for this exists that suggests a new sync property: everything
synchrnous; everything not synchronous (ie zil disabled on fs); and the
current behaviour (the default). The RFE is:

6280630 zil synchronicity

My problem with implementing this is that people might actually use it!
Well actually my concern is more that it will be misused.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Planed ZFS-Features - Is there a List or something else

2009-12-09 Thread Neil Perrin



On 12/09/09 13:52, Glenn Lagasse wrote:

* R.G. Keen (k...@geofex.com) wrote:

I didn't see remove a simple device anywhere in there.

Is it:
too hard to even contemplate doing, 
or

too silly a thing to do to even consider letting that happen
or 
too stupid a question to even consider

or
too easy and straightforward to do the procedure I see recommended (export the 
whole pool, destroy the pool, remove the device, remake the pool, then reimport 
the pool) to even bother with?


You missed:

Too hard to do correctly with current resource levels and other higher
priority work.

As always, volunteers I'm sure are welcome. :-)



This gives the impression that development is not actively working
on it. This is not true. As has been said often it is a difficult problem
and has been actively worked on for a few months now. I don't think
we are prepared to give a date as to when it will be delivered though.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Transaction consistency of ZFS

2009-12-06 Thread Neil Perrin



On 12/06/09 10:11, Anurag Agarwal wrote:

Hi,

My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all 
the writes in zfs are logged in the ZIL.


Each write gets recorded in memory in case it needs to be forced out
later (eg fsync()), but is not written to the on-disk log until then
or until the transaction group commits which contains the write
in which case the in-memory transaction is discarded.


And if that indeed is the case, 
then yes, ZFS does guarantee the sequential consistency, even when there 
are power outage or server crash. You might loose some writes if ZIL has 
not committed to disk. But that would not change the sequential 
consistency guarantee.


There is no need to do a fsync or open the file with O_SYNC. It should 
work as it is.


I have not done any experiments to verify this, so please take my 
observation with pinch of salt.

Any ZFS developers to verify or refute this.

Regards,
Anurag.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Transaction consistency of ZFS

2009-12-06 Thread Neil Perrin



I'll try to find out whether ZFS binding the same file always to the same
opening transaction group.


Not sure what you mean by this. Transactions (eg writes) will go into
the current open transaction group (txg). Subsequent writes may enter
the same or a future txg. Txgs are obviously committed in order.
So writes are not committed out of order. The txg commit is all or nothing,
so on a crash you get to see all the transactions in that txg or none.
I think this answers your original question/concern.


If so, I guess my assumption here would be true.
Seems like there is only one opening transaction group at anytime.
Can anybody give me a definitive answer here?


ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing.
Transactions enter in Open. Quiescing is where a new Open stage has
started and waits for transactions that have yet to commit to finish.
Syncing is where all the completed transactions are pushed to the pool
in an atomic manner with the last write being the root of the new tree
of blocks (uberblock).

All the guarantees assume good hardware. As part of the new uberblock update
we flush the write caches of the pool devices. If this is broken all bets
are off.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs on ssd

2009-12-05 Thread Neil Perrin



On 12/05/09 01:36, anu...@kqinfotech.com wrote:

Hi,

What you say is probably right with respect to L2ARC, but logging (ZIL or 
database log) is required for consistency purpose.


No, the ZIL is not required for consistency. The pool is fully consistent 
without
the ZIL. See  http://blogs.sun.com/perrin/entry/the_lumberjack for more details.

Neil. 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Separate Zil on HDD ?

2009-12-03 Thread Neil Perrin



On 12/03/09 09:21, mbr wrote:

Hello,

Bob Friesenhahn wrote:

On Thu, 3 Dec 2009, mbr wrote:


What about the data that were on the ZILlog SSD at the time of 
failure, is
a copy of the data still in the machines memory from where it can be 
used

to put the transaction to the stable storage pool?


The intent log SSD is used as 'write only' unless the system reboots, 
in which case it is used to support recovery.  The system memory is 
used as the write path in the normal case.  Once the data is written 
to the intent log, then the data is declared to be written as far as 
higher level applications are concerned.


thank you Bob for the clarification.
So I don't need a mirrored ZILlog for security reasons, all the information
is still in memory and will be used from there by default if only the 
ZILlog SSD fails.


Mirrored log devices are advised to improve reliablity. As previously mentioned,
if during writing a log device fails or is temporarily full then we use the 
main pool
devices to chain the log blocks. If we get read errors when trying to replay the
intent log (after a crash/power fail) then the admin is given the option to 
ignore
the log and continue or somehow fix the device (eg re-attach) and then retry.
Multiple log devices would provide extra reliability here.
We do not look in memory for the log records if we can't get the records
from the log blocks.



If the intent log SSD fails and the system spontaneously reboots, then 
data may be lost.


I can live with the data loss as long as the machine comes up with the 
faulty ZILlog SSD but otherwise without probs and with a clean zpool.


The log records are not required for consistency of the pool (it's not a 
journal).



Has the following error no consequences?

 Bug ID 6538021
 Synopsis   Need a way to force pool startup when zil cannot be replayed
 State  3-Accepted (Yes, that is a problem)
 Link   
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6538021


Er that bug should probably be closed as a duplicate.
We now have this functionality.



Michael.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is write(2) made durable atomically?

2009-11-30 Thread Neil Perrin





Under the hood in ZFS, writes are committed using either shadow paging or
logging, as I understand it. So I believe that I mean to ask whether a
write(2), pushed to ZPL, and pushed on down the stack, can be split into
multiple transactions? Or, instead, is it guaranteed to be committed in a
single transaction, and so committed atomically?


A write made through the ZPL (zfs_write()) will be broken into transactions
that contain at most 128KB user data. So a large write could well be split
across transaction groups, and thus committed separately. 


Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and NFS

2009-11-20 Thread Neil Perrin



On 11/18/09 12:21, Joe Cicardo wrote:

Hi,

My customer says:


Application has NFS directories with millions of files in a directory, 
and this can't changed.
We are having issues with the EMC appliance and RPC timeouts on the NFS 
lookup. I am looking doing
is moving one of the major NFS exports to as Sun 25k using VCS to 
cluster a ZFS RAIDZ that is then NFS exported.


For performance I am looking at disabling ZIL, since these files have 
almost identical names.


I think there's some confusion about the function of the ZIL
because having files with identical names is irrelevant to the ZIL.
Perhaps the customer is thinking of the DNLC, which is a cache of name lookups.
The ZIL does handle changes to these NFS files though, as the NFS protocol
requires they be on stable storage after most NFS operations.

We don't recommend recommend disabling the ZIL as this can lead to
integrity of user data issues. This is not the same as zpool corruption.
One way to speed the ZIL up is to use a SSD as a separate log device.

You can check how much activity is going through the ZIL by running zilstat:

  http://www.richardelling.com/Home/scripts-and-programs-1/zilstat

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does ZFS work with SAN-attached devices?

2009-10-13 Thread Neil Perrin


Also, ZFS does things like putting the ZIL data (when not on a dedicated 
device) at the outer edge of disks, that being faster.


No, ZFS does not do that. It will chain the intent log from blocks allocated
from the same metaslabs that the pool is allocating from.
This actually works out well because there isn't a large seek back to the
beginning of the device. When the pool gets near full then there will be
a noticeable slowness - but then all file systems performance suffer
when searching for space.

When the log is on a separate device it uses the same allocation scheme but
those blocks will tend to be allocated at the outer edge of the disk.
They only exist for a short time before getting freed, so the same
blocks gets re-used.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Neil Perrin



On 09/25/09 16:19, Bob Friesenhahn wrote:

On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems 
wrong to me.  Previously we were advised that the slog is basically a 
log of uncommitted system calls so the size of the data chunks written 
to the slog should be similar to the data sizes in the system calls.


Log blocks are variable in size dependent on what needs to be committed.
The minimum size is 4KB and the max 128KB. Log records are aggregated
and written together as much as possible.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to verify if the ZIL is disabled

2009-09-23 Thread Neil Perrin



On 09/23/09 10:59, Scott Meilicke wrote:
How can I verify if the ZIL has been disabled or not? 


I am trying to see how much benefit I might get by using an SSD as a ZIL. I 
disabled the ZIL via the ZFS Evil Tuning Guide:

echo zil_disable/W0t1 | mdb -kw


- this only temporarily disables the zil until the reboot.
In fact it has no effect unless file systems are remounted as
the variable is only looked at on mount.



and then rebooted. However, I do not see any benefits for my NFS workload.


To set zil_disable from boot put the following in /etc/system and reboot:

set zfs:zil_disable=1

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] lots of zil_clean threads

2009-09-21 Thread Neil Perrin

Nils,

A zil_clean() is started for each dataset after every txg.
this includes snapshots (which is perhaps a bit inefficient).
Still, zil_clean() is fairly lightweight if there's nothing
to do (grab a non contended lock; find nothing on a list;
drop the lock  exit).

Neil.

On 09/21/09 08:08, Nils Goroll wrote:

Hi All,

out of curiosity: Can anyone come up with a good idea about why my 
snv_111 laptop computer should run more than 1000 zil_clean threads?


ff0009a9dc60 fbc2c0300 tq:zil_clean
ff0009aa3c60 fbc2c0300 tq:zil_clean
ff0009aa9c60 fbc2c0300 tq:zil_clean
ff0009aafc60 fbc2c0300 tq:zil_clean
ff0009ab5c60 fbc2c0300 tq:zil_clean
ff0009abbc60 fbc2c0300 tq:zil_clean
ff0009ac1c60 fbc2c0300 tq:zil_clean
  ::threadlist!grep zil_clean| wc -l
1037

Thanks, Nils

P.S.: Please don't spend too much time on this, for me, this question is 
really academic - but I'd be grateful for any good answers.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] lots of zil_clean threads

2009-09-21 Thread Neil Perrin

Thinking more about this I'm confused about what you are seeing.
The function dsl_pool_zil_clean() will serialise separate calls to
zil_clean() within a pool. I don't expect you have 1037 pools on your laptop!
So I don't know what's going on. What is the typical call stack for those
zil_clean() threads?

Neil.

On 09/21/09 08:53, Neil Perrin wrote:

Nils,

A zil_clean() is started for each dataset after every txg.
this includes snapshots (which is perhaps a bit inefficient).
Still, zil_clean() is fairly lightweight if there's nothing
to do (grab a non contended lock; find nothing on a list;
drop the lock  exit).

Neil.

On 09/21/09 08:08, Nils Goroll wrote:

Hi All,

out of curiosity: Can anyone come up with a good idea about why my 
snv_111 laptop computer should run more than 1000 zil_clean threads?


ff0009a9dc60 fbc2c0300 tq:zil_clean
ff0009aa3c60 fbc2c0300 tq:zil_clean
ff0009aa9c60 fbc2c0300 tq:zil_clean
ff0009aafc60 fbc2c0300 tq:zil_clean
ff0009ab5c60 fbc2c0300 tq:zil_clean
ff0009abbc60 fbc2c0300 tq:zil_clean
ff0009ac1c60 fbc2c0300 tq:zil_clean
  ::threadlist!grep zil_clean| wc -l
1037

Thanks, Nils

P.S.: Please don't spend too much time on this, for me, this question 
is really academic - but I'd be grateful for any good answers.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Neil Perrin



On 09/04/09 09:54, Scott Meilicke wrote:

Roch Bourbonnais Wrote:
100% random writes produce around 200 IOPS with a 4-6 second pause
around every 10 seconds. 

This indicates that the bandwidth you're able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers.


When I have tested using 50% reads, 60% random using iometer over NFS,
I can see the data going straight to disk due to the sync nature of NFS.
But I also see writes coming to a stand still every 10 seconds or so,
which I have attributed to the ZIL dumping to disk. Therefore I conclude
that it is the process of dumping the ZIL to disk that (mostly?) blocks
writes during the dumping.


The ZIL does does not work like that. It is not a journal.

Under a typical write load write transactions are batched and
written out in a group transaction (txg). This txg sync occurs
every 30s under light load but more frequently or continuously
under heavy load.

When writing synchronous data (eg NFS) the transactions get written immediately
to the intent log and are made stable. When the txg later commits the
intent log blocks containing those committed transactions can be
freed. So as you can see there is no periodic dumping of
the ZIL to disk. What you are probably observing is the periodic txg
commit.

Hope that helps: Neil. 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ssd for zil on a dell 2950

2009-08-22 Thread Neil Perrin



On 08/20/09 06:41, Greg Mason wrote:
Something our users do quite a bit of is untarring archives with a lot 
of small files. Also, many small, quick writes are also one of the many 
workloads our users have.


Real-world test: our old Linux-based NFS server allowed us to unpack a 
particular tar file (the source for boost 1.37) in around 2-4 minutes, 
depending on load. This machine wasn't special at all, but it had fancy 
SGI disk on the back end, and was using the Linux-specific async NFS 
option.


I'm glad you mentioned this option. It turns all synchronous requests
from the client into async allowing the server to immediately return
without making the data stable. This is the equivalent of setting zil_disable.
Async used to be the default behaviour. It must have been a shock to Linux
users when suddenly NFS slowed down when synchronous became the default!
I wonder what the perf numbers were without the async option.



We turned up our X4540s, and this same tar unpack took over 17 minutes! 
We disabled the ZIL for testing, and we dropped this to under 1 minute. 
With the X25-E as a slog, we were able to run this test in 2-4 minutes, 
same as the old storage.


That's pretty impressive. So with a X25-E slog ZFS is as fast synchronously as
your previously hardware was asynchronously - but with no risk of data 
corruption.
Of course the hardware is different so it's not really apples to apples.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-07 Thread Neil Perrin



On 08/07/09 10:54, Scott Meilicke wrote:
ZFS absolutely observes synchronous write requests (e.g. by NFS or a 
database). The synchronous write requests do not benefit from the 
long write aggregation delay so the result may not be written as 
ideally as ordinary write requests. Recently zfs has added support 
for using a SSD as a synchronous write log, and this allows zfs to 
turn synchronous writes into more ordinary writes which can be written 
more intelligently while returning to the user with minimal latency.


Bob, since the ZIL is used always, whether a separate device or not,
won't writes to a system without a separate ZIL also be written as
intelligently as with a separate ZIL?


- Yes. ZFS uses the same code path (intelligence?) to write out the data
from NFS - regardless of whether there's a separate log (slog) or not.



Thanks,
Scott

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How Virtual Box handles the IO

2009-07-31 Thread Neil Perrin

I understand  that the ZILs are allocated out of the general pool.


There is one intent log chain per dataset (file system or zvol).
The head of each log the log is kept in the main pool.
Without slog(s) we allocate (and chain) blocks from the
main pool. If separate intent log(s) exist then blocks are allocated
and chained there. If we fail to allocate from the
slog(s) then we revert to allocation from the main pool.


Is there a ZIL for the ZILs, or does this make no sense?


There is no ZIL for the ZILs. Note the ZIL is not a journal
(like ext3 or ufs logging). It simply contains records of
system calls (including data) that need to be replayed if the
system crashes and those records have not been committed
in a transaction group.

Hope that helps: Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool iostat and iostat discrepancy

2009-06-20 Thread Neil Perrin

On 06/20/09 11:14, tester wrote:

Hi,

Does anyone know the difference between zpool iostat  and iostat?


dd if=/dev/zero of=/test/test1/trash count=1 bs=1024k;sync

pool only shows 236K IO and 13 write ops. whereas iostat shows a correctly meg 
of activity.


The zfs numbers are per second as well. So 236K * 5 = 1180K
zpool iostat -v test 1 would make this clearer.

The iostat output below also shows 237K (88+37+112) being written per second.
I'm not sure why any reads occurred though. When I did a quick
experiment there were no reads.

Enabling compression gives much better numbers when writing zeros!

Neil.



zpool iostat -v test 5

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
test1.14M   100G  0 13  0   236K
  c8t60060E800475F50075F50525d0   182K  25.0G  0  4  0  
36.8K
  c8t60060E800475F50075F50526d0   428K  25.0G  0  4  0  
87.7K
  c8t60060E800475F50075F50540d0   558K  50.0G  0  4  0   
111K
--  -  -  -  -  -  -

iostat -xnz [devices] 5

   extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
2.46.06.8   88.2  0.0  0.00.01.0   0   0 
c8t60060E800475F50075F50540d0
2.45.46.8   37.0  0.0  0.00.00.9   0   0 
c8t60060E800475F50075F50526d0
2.45.06.8  112.0  0.0  0.00.00.9   0   0 
c8t60060E800475F50075F50525d0

dtrace also concurs with iostat
 
device bytes IOPS

==      
  /devices/scsi_vhci/s...@g60060e800475f50075f50525:a   224416  
 35
  /devices/scsi_vhci/s...@g60060e800475f50075f50526:a   486560  
 37
  /devices/scsi_vhci/s...@g60060e800475f50075f50540:a   608416  
 33

Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Degraded log device in zpool status output

2009-04-18 Thread Neil Perrin

Will,

This is bug:

6710376 log device can show incorrect status when other parts of pool are 
degraded

This is just an error in the reporting. There was nothing actually wrong with
the log device. It is picking up the degraded status from the rest of the pool.
The bug was fixed only yesterday and checked into snv_114.

Neil.

On 04/18/09 23:52, Will Murnane wrote:

I have a pool, huge, composed of one six-disk raidz2 vdev and a log
device.  I failed to plug in one disk when I took the machine down to
plug in the log device, and booted all the way before I realized this,
so the raidz2 vdev was rightly listed as degraded.  Then I brought the
machine down, plugged the disk in, and brought it back up.  I ran
zpool scrub huge to make sure that the missing disk was completely
synced.  After a few minutes, zpool status huge showed this:
$ zpool status huge
  pool: huge
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress for 0h8m, 1.19% done, 11h15m to go
config:

NAMESTATE READ WRITE CKSUM
hugeDEGRADED 0 0 0
  raidz2DEGRADED 0 0 0
c4t4d0  DEGRADED 0 015  too many errors
c4t1d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
c4t5d0  ONLINE   0 0 0
c4t6d0  ONLINE   0 0 0
logsDEGRADED 0 0 0
  c7d1  ONLINE   0 0 0

errors: No known data errors

I understand that not all of the blocks may have been synced onto
c4t4d0 (the missing disk), so some checksum errors are normal there.
But the log disk reports no errors, and its sole component reports
none either, yet the log device is marked as degraded.  To see what
would happen, I executed this:
$ pfexec zpool clear huge c4t4d0
$ zpool status huge
  pool: huge
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress for 0h12m, 1.87% done, 10h32m to go
config:

NAMESTATE READ WRITE CKSUM
hugeONLINE   0 0 0
  raidz2ONLINE   0 0 0
c4t4d0  ONLINE   0 0 2
c4t1d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
c4t5d0  ONLINE   0 0 0
c4t6d0  ONLINE   0 0 0
logsONLINE   0 0 0
  c7d1  ONLINE   0 0 0

errors: No known data errors

So clearing the errors from one device has an effect on the status of
another device?  Is this expected behavior, or is something wrong with
my log device?  I'm running snv_111.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great

2009-04-10 Thread Neil Perrin



On 04/10/09 20:15, Toby Thain wrote:


On 10-Apr-09, at 5:05 PM, Mark J Musante wrote:


On Fri, 10 Apr 2009, Patrick Skerrett wrote:

degradation) when these write bursts come in, and if I could buffer 
them even for 60 seconds, it would make everything much smoother.


ZFS already batches up writes into a transaction group, which 
currently happens every 30 seconds.


Isn't that 5 seconds?


It used to be, and it may still be for what you are running.
However, Mark is right, it is now 30 seconds. In fact 30s is
the maximum. The actual time will depend on load. If the pool
is heavily used then the txg's fire more frequently.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great

2009-04-09 Thread Neil Perrin

Patrick,

The ZIL is only used for synchronous requests like O_DSYNC/O_SYNC and
fsync(). Your iozone command must be doing some synchronous writes.
All the other tests (dd, cat, cp, ...) do everything asynchronously.
That is they do not require the data to be on stable storage on
return from the write. So asynchronous writes get cached in memory
(the ARC) and written out periodically (every 30 seconds or less)
when the transaction group commits.

The ZIL would be heavily used if your system were a NFS server.
Databases also do synchronous writes.

Neil.

On 04/09/09 15:13, Patrick Skerrett wrote:

Hi folks,

I would appreciate it if someone can help me understand some weird 
results I'm seeing with trying to do performance testing with an SSD 
offloaded ZIL.



I'm attempting to improve my infrastructure's burstable write capacity 
(ZFS based WebDav servers), and naturally I'm looking at implementing 
SSD based ZIL devices.
I have a test machine with the crummiest hard drive I can find installed 
in it, Quantum Fireball ATA-100 4500RPM 128K cache, and an Intel X25-E 
32gig SSD drive.
I'm trying to do A-B comparisons and am coming up with some very odd 
results:


The first test involves doing IOZone write testing on the fireball 
standalone, the SSD standalone, and the fireball with the SSD as a log 
device.


My test command is:  time iozone -i 0 -a -y 64 -q 1024 -g 32M

Then I check the time it takes to complete this operation in each scenario:

Fireball alone - 2m15s (told you it was crappy)
SSD alone - 0m3s
Fireball + SSD zil - 0m28s

This looks great! Watching 'zpool iostat-v' during this test further 
proves that the ZIL device is doing the brunt of the heavy lifting 
during this test. If I can get these kind of write results in my prod 
environment, I would be one happy camper.




However, ANY other test I can think of to run on this test machine shows 
absolutely no performance improvement of the Fireball+SSD Zil over the 
Fireball by itself. Watching zpool iostat -v shows no activity on the 
ZIL at all whatsoever.

Other tests I've tried to run:

A scripted batch job of 10,000 -
dd if=/dev/urandom of=/fireball/file_$i.dat bs=1k count=1000

A scripted batch job of 10,000 -
cat /sourcedrive/$file  /fireball/$file

A scripted batch job of 10,000 -
cp /sourcedrive/$file /fireball/$file

And a scripted batch job moving 10,000 files onto the fireball using 
Apache Webdav mounted on the fireball (similar to my prod environment):

curl -T /sourcedrive/$file http://127.0.0.1/fireball/




So what is IOZone doing differently than any other write operation I can 
think of???



Thanks,

Pat S.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Neil Perrin

I'd like to correct a few misconceptions about the ZIL here.

On 03/06/09 06:01, Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does maintain 
filesystem consistency through coordination between the ZPL (ZFS POSIX 
Layer) and the ZIL (ZFS Intent Log).


Pool and file system consistency is more a function of the DMU  SPA.

Unfortunately for SNDR, ZFS caches 
a lot of an applications filesystem data in the ZIL, therefore the data 
is in memory, not written to disk,


ZFS data is actually cached in the ARC. The ZIL code keeps in-memory records
of system call transactions in case a fsync() occurs.

so SNDR does not know this data 
exists. ZIL flushes to disk can be seconds behind the actual application 
writes completing,


It's the DMU/SPA that handles the transaction group commits (not the ZIL).
Currently these occur 30 seconds or more frequently on a loaded system.

and if SNDR is running asynchronously, these 
replicated writes to the SNDR secondary can be additional seconds behind 
the actual application writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 
'supported' way to get ZFS to empty the ZIL to disk on demand.


The sync(2) system call is implemented differently in ZFS.
For UFS it initiates a flush of cached data to disk, but does
not wait for completion. This satisfies the POSIX requirement but
never seemed right. For ZFS we wait for all transactions
to complete and commit to stable storage (including flushing any
disk write caches) before returning. So any asynchronous data
in the ARC is written.

Alternatively, a lockfs will flush just a file system to stable storage
but in this case just the intent log is written. (Then later when
the txg commits those intent log records are discarded).

For some basic info on the ZIL see:
http://blogs.sun.com/perrin/entry/the_lumberjack

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Neil Perrin



On 03/06/09 08:10, Jim Dunham wrote:

Andrew,


Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does 
maintain filesystem consistency through coordination between the ZPL 
(ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for 
SNDR, ZFS caches a lot of an applications filesystem data in the ZIL, 
therefore the data is in memory, not written to disk, so SNDR does 
not know this data exists. ZIL flushes to disk can be seconds behind 
the actual application writes completing, and if SNDR is running 
asynchronously, these replicated writes to the SNDR secondary can be 
additional seconds behind the actual application writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 
'supported' way to get ZFS to empty the ZIL to disk on demand.


I'm wondering if you really meant ZIL here, or ARC?


It is my understanding that the ZFS intent log (ZIL) satisfies POSIX 
requirements for synchronous transactions,


True.


thus filesystem consistency.


No. The filesystems in the pool are always consistent with or without
the ZIL.  The ZIL is not the same as a journal (or the log in UFS).

The ZFS adaptive replacement cache (ARC) is where uncommitted filesystem 
data is being cached. So although unwritten filesystem data allocated 
from the DMU, retained in the ARC, it is the ZIL which influences 
filesystem metadata and data consistency on disk.


No. It just ensures the synchronous requests (O_DSYNC, fsync() etc)
are on stable storage in case a crash/power fail occurs before
the dirty ARC is written when the txg commits.



In either case, creating a snapshot should get both flushed to disk, I 
think?


No. A ZFS snapshot is a control path, verse data path operation and (to 
the best of my understanding, and testing) has no influence over POSIX 
filesystem consistency. See the discussion here: 
http://www.opensolaris.org/jive/click.jspa?searchID=1695691messageID=124809


Invoking a ZFS snapshot will assure the ZFS snapshot is consistent on 
the replicated disk, but not all actively opened files.


A simple test I performed to verify this, was to append to a ZFS file 
(no synchronous filesystem options being set) a series of blocks with a 
block order pattern contained within. At some random point in this 
process, I took a ZFS snapshot, immediately dropped SNDR into logging 
mode. When importing the ZFS storage pool on the SNDR remote host, I 
could see the ZFS snapshot just taken, but neither the snapshot version 
of the file, or the file itself contained all of the data previously 
written to it.


That seems like a bug in ZFS to me. A snapshot ought to contain all data
that has been written (whether synchronous or asynchronous) prior to the 
snapshot.



I then retested, but opened the file with O_DSYNC, and when following 
the same test steps above, both the snapshot version of the file, and 
the file itself contained all of the data previously written to it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Neil Perrin



On 03/06/09 14:51, Miles Nordin wrote:

np == Neil Perrin neil.per...@sun.com writes:


np Alternatively, a lockfs will flush just a file system to
np stable storage but in this case just the intent log is
np written. (Then later when the txg commits those intent log
np records are discarded).

In your blog it sounded like there's an in-RAM ZIL through which
_everything_ passes, and parts of this in-RAM ZIL are written to the
on-disk ZIL as needed.


Thats correct.


so maybe I was using the word ZIL wrongly in
my last post.


I understood what you meant.



are you saying, lockfs will divert writes that would normally go
straight to the pool, to pass through the on-disk ZIL instead?


- Not instead but as well. The ZIL (code) will write immediately
to the stable intent logs, then later the data cached in the ARC
will be written as part of the pool transaction group (txg).
As soon as that happens the intent log blocks can be re-used.



assuming any separate slog isn't destroyed while the power's off,
lockfs and sync should get you the same end result after an unclean
shutdown, right?


Right.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Neil Perrin

Having a separate intent log on good hardware will not prevent corruption
on a pool with bad hardware. By good I mean hardware that correctly
flush their write caches when requested.

Note, a pool is always consistent (again when using good hardware).
The function of the intent log is not to provide consistency (like a journal),
but to speed up synchronous requests like fsync and O_DSYNC.

Neil.

On 02/13/09 06:29, Jiawei Zhao wrote:

While mobility could be lost, usb storage still has the advantage of being cheap
and easy to install comparing to install internal disks on pc, so if I just 
want to
use it to provide zfs storage space for home file server, can a small  intent 
log
located on internal sata disk prevent the pool corruption caused by a power cut?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris and zfs versions

2009-02-12 Thread Neil Perrin

Mark,

I believe creating a older version pool is supported:

zpool create -o version=vers whirl c0t0d0

I'm not sure what version of ZFS in Solaris 10 you are running.
Try running zpool upgrade and replacing vers above with that version number.

Neil.

: trasimene ; zpool create -o version=11 whirl c0t0d0
: trasimene ; zpool get version  whirl  
NAME   PROPERTY  VALUESOURCE

whirl  version   11   local
: trasimene ; zpool upgrade
This system is currently running ZFS pool version 14.

The following pools are out of date, and can be upgraded.  After being
upgraded, these pools will no longer be accessible by older software versions.

VER  POOL
---  
11   whirl

Use 'zpool upgrade -v' for a list of available versions and their associated
features.
: trasimene ;


On 02/12/09 11:42, Mark Winder wrote:
We’ve been experimenting with zfs on OpenSolaris 2008.11.  We created a 
pool in OpenSolaris and filled it with data.  Then we wanted to move it 
to a production Solaris 10 machine (generic_137138_09) so I “zpool 
exported” in OpenSolaris, moved the storage, and “zpool imported” in 
Solaris 10.  We got:



Cannot import ‘deadpool’: pool is formatted using a newer ZFS version


We would like to be able to move pools back and forth between the OS’s.  
Is there a way we can upgrade Solaris 10 to the same supported zfs 
version (or create downgraded pools in OpenSolaris)?



Thanks!

Mark

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Max size of log device?

2009-02-08 Thread Neil Perrin


On 02/08/09 11:50, Vincent Fox wrote:
 So I have read in the ZFS Wiki:
 
 #  The minimum size of a log device is the same as the minimum size of device 
 in
 pool, which is 64 Mbytes. The amount of in-play data that might be stored on 
 a log
 device is relatively small. Log blocks are freed when the log transaction 
 (system call)
 is committed.
 # The maximum size of a log device should be approximately 1/2 the size of 
 physical
 memory because that is the maximum amount of potential in-play data that can 
 be stored.
 For example, if a system has 16 Gbytes of physical memory, consider a maximum 
 log device
 size of 8 Gbytes. 
 
 What is the downside of over-large log device?

- Wasted disk space.

 
 Let's say I have  a 3310 with 10 older 72-gig 10K RPM drives and RAIDZ2 them.
 Then I throw an entire 72-gig 15K RPM drive in as slog.
 
 What is behind this maximum size recommendation?

- Just guidance on what might be used in the most stressed environment.
Personally I've never seen anything like the maximum used but it's
theoretically possible. 

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS core contributor nominations

2009-02-02 Thread Neil Perrin
Looks reasonable
+1

Neil.

On 02/02/09 08:55, Mark Shellenbaum wrote:
 The time has come to review the current Contributor and Core contributor 
 grants for ZFS.  Since all of the ZFS core contributors grants are set 
 to expire on 02-24-2009 we need to renew the members that are still 
 contributing at core contributor levels.   We should also add some new 
 members to both Contributor and Core contributor levels.
 
 First the current list of Core contributors:
 
 Bill Moore (billm)
 Cindy Swearingen (cindys)
 Lori M. Alt (lalt)
 Mark Shellenbaum (marks)
 Mark Maybee (maybee)
 Matthew A. Ahrens (ahrens)
 Neil V. Perrin (perrin)
 Jeff Bonwick (bonwick)
 Eric Schrock (eschrock)
 Noel Dellofano (ndellofa)
 Eric Kustarz (goo)*
 Georgina A. Chua (chua)*
 Tabriz Holtz (tabriz)*
 Krister Johansen (johansen)*
 
 All of these should be renewed at Core contributor level, except for 
 those with a *.  Those with a * are no longer involved with ZFS and 
 we should let their grants expire.
 
 I am nominating the following to be new Core Contributors of ZFS:
 
 Jonathan W. Adams (jwadams)
 Chris Kirby
 Lin Ling
 Eric C. Taylor (taylor)
 Mark Musante
 Rich Morris
 George Wilson
 Tim Haley
 Brendan Gregg
 Adam Leventhal
 Pawel Jakub Dawidek
 Ricardo Correia
 
 For Contributor I am nominating the following:
 Darren Moffat
 Richard Elling
 
 I am voting +1 for all of these (including myself)
 
 Feel free to nominate others for Contributor or Core Contributor.
 
 
 -Mark
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache and cache flush

2009-01-29 Thread Neil Perrin


On 01/29/09 21:32, Greg Mason wrote:
 This problem only manifests itself when dealing with many small files 
 over NFS. There is no throughput problem with the network.
 
 I've run tests with the write cache disabled on all disks, and the cache 
 flush disabled. I'm using two Intel SSDs for ZIL devices.
 
 This setup is faster than using the two Intel SSDs with write caches 
 enabled on all disks, and with the cache flush enabled.
 
 My test would run around 3.5 to 4 minutes, now it is completing in 
 abound 2.5 minutes. I still think this is a bit slow, but I still have 
 quite a bit of testing to perform. I'll keep the list updated with my 
 findings.
 
 I've already established both via this list and through other research 
 that ZFS has performance issues over NFS when dealing with many small 
 files. This seems to maybe be an issue with NFS itself, where 
 NVRAM-backed storage is needed for decent performance with small files. 
 Typically such an NVRAM cache is supplied by a hardware raid controller 
 in a disk shelf.
 
 I find it very hard to explain to a user why an upgrade is a step down 
 in performance. For the users these Thors are going to serve, such a 
 drastic performance hit is a deal breaker...

Perhaps I missed something, but what was your previous setup?
I.e. what did you upgrade from? 

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lackluster ZFS performance trials using various ZIL and L2ARC configurations...

2009-01-16 Thread Neil Perrin
I don't believe that iozone does any synchronous calls (fsync/O_DSYNC/O_SYNC),
so the ZIL and separate logs (slogs) would be unused.

I'd recommend performance testing by configuring filebench to
do synchronous writes:

http://opensolaris.org/os/community/performance/filebench/

Neil.

On 01/15/09 00:36, Gray Carper wrote:
 Hey, all!
 
 Using iozone (with the sequential read, sequential write, random read, 
 and random write categories), on a Sun X4240 system running OpenSolaris 
 b104 (NexentaStor 1.1.2, actually), we recently ran a number of relative 
 performance tests using a few ZIL and L2ARC configurations (meant to try 
 and uncover which configuration would be the best choice). I'd like to 
 share the highlights with you all (without bogging you down with raw 
 data) to see if anything strikes you.
 
 Our first (baseline) test used a ZFS pool which had a self-contained ZIL 
 and L2ARC (i.e. not moved to other devices, the default configuration). 
 Note that this system had both SSDs and SAS drive attached to the 
 controller, but only the SAS drives were in use.
 
 In the second test, we rebuilt the ZFS pool with the ZIL on a 32GB SSD 
 and the L2ARC on four 146GB SAS drives. Random reads were significantly 
 worse than the baseline, but all other categories were slightly better.
 
 In the third test, we rebuilt the ZFS pool with the ZIL on a 32GB SSD 
 and the L2ARC on four 80GB SSDs. Sequential reads were better than the 
 baseline, but all other categories were worse.
 
 In the fourth test, we rebuilt the ZFS pool with no separate ZIL, but 
 with the L2ARC on four 146GB SAS drives. Random reads were significantly 
 worse than the baseline and all other categories were about the same as 
 the baseline.
 
 As you can imagine, we were disappointed. None of those configurations 
 resulted in any significant improvements, and all of the configurations 
 resulted in at least one category being worse. This was very much not 
 what we expected.
 
 For the sake of sanity checking, we decided to run the baseline case 
 again (ZFS pool which had a self-contained ZIL and L2ARC), but this time 
 remove the SSDs completely from the box. Amazingly, the simple presence 
 of the SSDs seemed to be a negative influence - the new SSD-free test 
 showed improvement in every single category when compared to the 
 original baseline test.
 
 So, this has lead us to the conclusion that we shouldn't be mixing SSDs 
 with SAS drives on the same controller (at least, not the controller we 
 have in this box). Has anyone else seen problems like this before that 
 might validate that conclusion? If so, we think we should probably build 
 an SSD JBOD, hook it up to the box, and re-run the tests. This leads us 
 to another question: Does anyone have any recommendations for 
 SSD-performant controllers that have great OpenSolaris driver support?
 
 Thanks!
 -Gray
 -- 
 Gray Carper
 MSIS Technical Services
 University of Michigan Medical School
 gcar...@umich.edu mailto:gcar...@umich.edu  |  skype:  graycarper  | 
  734.418.8506
 http://www.umms.med.umich.edu/msis/
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs panic

2009-01-13 Thread Neil Perrin
I'm sorry about the problems. We try to be responsive to fixing bugs and
implementing new features that people are requesting for ZFS.
It's not always possible to get it right. In this instance I don't think the
bug was reproducible, and perhaps that's why it hasn't received the attention
it deserves. As far as I know yours is the second reported instance.
It may be that the problem has been fixed and that's why
we haven't seen it in-house. However, that's just speculation, and
some serious investigation is needed.

Neil.

On 01/13/09 06:39, Krzys wrote:
 To be honest I am quite surprised as this bug you referring to was submited 
 early in 2008 and last updated over the summer. Quite surprised that Sun did 
 not 
 come up with a fix for it so far. ZFS is certainly gaining some popularity at 
 my 
 workplace, and we were thinking of using it instead of veritas, but I am not 
 sure what to do with it now.. what if we have systems that we quite depend on 
 and we have similar issue? How could we solve it? is calling sun support 
 going 
 to help me in such case? This particular system is my playground and I do not 
 care about it to that extend but if I had other system that has much greater 
 importance and I get such situation its quite scary... :(
 
 On Mon, 12 Jan 2009, Neil Perrin wrote:
 
 This is a known bug:

 6678070 Panic from vdev_mirror_map_alloc()
 http://bugs.opensolaris.org/view_bug.do?bug_id=6678070

 Neil.

 On 01/12/09 21:12, Krzys wrote:
 any idea what could cause my system to panic? I get my system rebooted 
 daily at various times. very strange, but its pointing to zfs. I have U6 
 with all latest patches.


 Jan 12 05:47:12 chrysek unix: [ID 836849 kern.notice]
 Jan 12 05:47:12 chrysek ^Mpanic[cpu1]/thread=30002c8d4e0:
 Jan 12 05:47:12 chrysek unix: [ID 799565 kern.notice] BAD TRAP: type=28 
 rp=2a10285c790 addr=7b76a0a8 mmu_fsr=0
 Jan 12 05:47:12 chrysek unix: [ID 10 kern.notice]
 Jan 12 05:47:12 chrysek unix: [ID 839527 kern.notice] zfs:
 ...
 ...
 ...
 374706 pages dumped, compression ratio 3.50,
 Jan 12 05:48:51 chrysek genunix: [ID 851671 kern.notice] dump succeeded
 Jan 12 05:49:40 chrysek genunix: [ID 540533 kern.notice] ^MSunOS Release 
 5.10 Version Generic_13-02 64-bit
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Several questions concerning ZFS

2009-01-12 Thread Neil Perrin


On 01/12/09 20:45, Simon wrote:
 Hi Experts,
 
 IHAC who using Solaris 10 + ZFS,two questions they're concerned:
 
 - ZIL(zfs intent log) is enabled by default for ZFS,there are varied
 storage purchased by customer(such as EMC CX/DMX series,HDS AMS/USP
 series and etc),customer wonder whether there is any impact to these
 storages if enable ZIL,if have,what is the negative factors?

As far as I know there hasn't been any performance reports comparing
various devices specifically for ZFS or the ZIL. Richard Elling may know?

 
 - Under what circumstances,should I disable zil ?

- It's not recommended to ever disable it! It was originally added as a
switch to allow the new ZIL to be disabled if it proved unstable. It should
have been removed shortly afterwards. If the system loses power or crashes
then some recent synchronous changes that were claimed to be safely on disk
might not be. If you know this then I suppose you could take advantage of the
speed and redo the recent changes. For instance, it has been suggested that 
Solaris
binaries be built with the ZIL disabled. If the system crashes then the build
would be started again from scratch. Panics are sufficiently rare and the build
time can be cut significantly (eg 30%). However, these are somewhat
contrived circumstances and I wouldn't recommend ever configuring a customers
system with the ZIL disabled.

A safer option is to turn off disk write cache flushing (set 
zfs:zfs_nocacheflush=1).
This should only be done if it's known *all* zpool devices are non-volatile.
This has almost the same performance effect as disabling the ZIL, as it's
the actual writing of the bits to the rotating rust that takes most of the time.
This also helps speed up other writes - ie committing transaction groups.

 
 - If the device used by zpool is come from above listed external storage
 (EMC or HDS),what size is suggested for storage LUN ?
 As current rules,customer uses 100G for EMC CX,and 55G/110G for EMC DMX,
 and 52G for HDS USP V,100G for HDS AMS as LUN size,the filesystem over
 LUNs is UFS.

Sorry - I don't know.

 
 Any reply are much appreciated,thanks in advance.
 
 Best Rgds,
 Simon
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs panic

2009-01-12 Thread Neil Perrin
This is a known bug:

6678070 Panic from vdev_mirror_map_alloc()
http://bugs.opensolaris.org/view_bug.do?bug_id=6678070

Neil.

On 01/12/09 21:12, Krzys wrote:
 any idea what could cause my system to panic? I get my system rebooted daily 
 at 
 various times. very strange, but its pointing to zfs. I have U6 with all 
 latest 
 patches.
 
 
 Jan 12 05:47:12 chrysek unix: [ID 836849 kern.notice]
 Jan 12 05:47:12 chrysek ^Mpanic[cpu1]/thread=30002c8d4e0:
 Jan 12 05:47:12 chrysek unix: [ID 799565 kern.notice] BAD TRAP: type=28 
 rp=2a10285c790 addr=7b76a0a8 mmu_fsr=0
 Jan 12 05:47:12 chrysek unix: [ID 10 kern.notice]
 Jan 12 05:47:12 chrysek unix: [ID 839527 kern.notice] zfs:
 Jan 12 05:47:12 chrysek unix: [ID 983713 kern.notice] integer divide zero 
 trap:
 Jan 12 05:47:12 chrysek unix: [ID 381800 kern.notice] addr=0x7b76a0a8
 Jan 12 05:47:12 chrysek unix: [ID 101969 kern.notice] pid=18941, 
 pc=0x7b76a0a8, 
 sp=0x2a10285c031, tstate=0x4480001606, context=0x1
 Jan 12 05:47:12 chrysek unix: [ID 743441 kern.notice] g1-g7: 7b76a07c, 1, 0, 
 0, 
 241b2a, 16, 30002c8d4e0
 Jan 12 05:47:12 chrysek unix: [ID 10 kern.notice]
 Jan 12 05:47:12 chrysek genunix: [ID 723222 kern.notice] 02a10285c4b0 
 unix:die+9c (28, 2a10285c790, 7b76a0a8, 0, 2a10285c570, 1)
 Jan 12 05:47:12 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
 000a 0028 000a 0801
 Jan 12 05:47:12 chrysek   %l4-7: 02a10285cd18 02a10285cd3c 
 0006 0109a000
 Jan 12 05:47:13 chrysek genunix: [ID 723222 kern.notice] 02a10285c590 
 unix:trap+644 (2a10285c790, 1, 0, 0, 180c000, 30002c8d4e0)
 Jan 12 05:47:13 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
  06002c5b9130 0028 0600118fa088
 Jan 12 05:47:13 chrysek   %l4-7:  00db 
 004480001606 00010200
 Jan 12 05:47:13 chrysek genunix: [ID 723222 kern.notice] 02a10285c6e0 
 unix:ktl0+48 (0, 70021d50, 349981, 180c000, 10394e8, 2a10285c8e8)
 Jan 12 05:47:13 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
 0007 1400 004480001606 0101bedc
 Jan 12 05:47:13 chrysek   %l4-7: 0600110bd630 0600110be400 
  02a10285c790
 Jan 12 05:47:13 chrysek genunix: [ID 723222 kern.notice] 02a10285c830 
 zfs:spa_get_random+c (0, 0, d15c4746ef9ddd65, 0, , 8)
 Jan 12 05:47:13 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
 01ff 7b772a00 000e 
 Jan 12 05:47:13 chrysek   %l4-7: 00020801 ee00 
 060031b23680 
 Jan 12 05:47:13 chrysek genunix: [ID 723222 kern.notice] 02a10285c8f0 
 zfs:vdev_mirror_map_alloc+b8 (60012ec20e0, 30006a9a3c8, 1, 30006a9a370, 0, 
 ff)
 Jan 12 05:47:13 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
    
 Jan 12 05:47:13 chrysek   %l4-7:   
  0600112cc080
 Jan 12 05:47:14 chrysek genunix: [ID 723222 kern.notice] 02a10285c9a0 
 zfs:vdev_mirror_io_start+4 (30006a9a370, 0, 0, 30006a9a3c8, 0, 7b772bc4)
 Jan 12 05:47:14 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
  0001  7b7a4688
 Jan 12 05:47:14 chrysek   %l4-7: 7b7a4400  
  
 Jan 12 05:47:14 chrysek genunix: [ID 723222 kern.notice] 02a10285ca80 
 zfs:zio_execute+74 (30006a9a370, 7b783f70, 78, f, 1, 70496c00)
 Jan 12 05:47:14 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
 030083edb728 00c44002 00038000 70496d88
 Jan 12 05:47:14 chrysek   %l4-7: 00efc006  
 0801 8000
 Jan 12 05:47:14 chrysek genunix: [ID 723222 kern.notice] 02a10285cb30 
 zfs:arc_read+724 (1, 600112cc080, 30075baba00, 200, 0, 300680b9288)
 Jan 12 05:47:14 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
 0001 70496060 0006 0801
 Jan 12 05:47:14 chrysek   %l4-7: 02a10285cd18  
 030083edb728 02a10285cd3c
 Jan 12 05:47:14 chrysek genunix: [ID 723222 kern.notice] 02a10285cc40 
 zfs:dbuf_prefetch+13c (60035ce1050, 70496c00, 30075baba00, 0, 0, 3007578b0a0)
 Jan 12 05:47:14 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
 000a 0028 000a 0801
 Jan 12 05:47:14 chrysek   %l4-7: 02a10285cd18 02a10285cd3c 
 0006 
 Jan 12 05:47:15 chrysek genunix: [ID 723222 kern.notice] 02a10285cd50 
 zfs:dmu_zfetch_fetch+2c (60035ce1050, 8b67, 100, 100, cd, 8c34)
 Jan 12 05:47:15 chrysek genunix: [ID 179002 kern.notice]   %l0-3: 
 7049d098 4000 7049d000 7049d188
 Jan 12 05:47:15 chrysek   %l4-7: 06d8 00db 
 7049d178 

Re: [zfs-discuss] Problems at 90% zpool capacity 2008.05

2009-01-06 Thread Neil Perrin


On 01/06/09 21:25, Nicholas Lee wrote:
 Since zfs is so smart is other areas is there a particular reason why a 
 high water mark is not calculated and the available space not reset to this?
 
 I'd far rather have a zpool of 1000GB that said it only had 900GB but 
 did not have corruption as it ran out of space.
 
 Nicholas

Is there any evidence of corruption at high capacity or just
a lack of performance? All file systems will slow down when
near capacity, as they struggle to find space and then have to
spread writes over the disk. Our priorities are integrity first,
followed somewhere by performance.

I vaguely remember a time when UFS had limits to prevent
ordinary users from consuming past a certain limit, allowing
only the super-user to use it. Not that I'm advocating that
approach for ZFS.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs_nocacheflush, nvram, and root pools

2008-12-03 Thread Neil Perrin
On 12/02/08 03:47, River Tarnell wrote:
 hi,
 
 i have a system connected to an external DAS (SCSI) array, using ZFS.  the
 array has an nvram write cache, but it honours SCSI cache flush commands by
 flushing the nvram to disk.  the array has no way to disable this behaviour.  
 a
 well-known behaviour of ZFS is that it often issues cache flush commands to
 storage in order to ensure data integrity; while this is important with normal
 disks, it's useless for nvram write caches, and it effectively disables the
 cache.
 
 so far, i've worked around this by setting zfs_nocacheflush, as described at
 [1], which works fine.  but now i want to upgrade this system to Solaris 10
 Update 6, and use a ZFS root pool on its internal SCSI disks (previously, the
 root was UFS).  the problem is that zfS_nocacheflush applies to all pools,
 which will include the root pool.
 
 my understanding of ZFS is that when run on a root pool, which uses slices
 (instead of whole disks), ZFS won't enable the write cache itself.  i also
 didn't enable the write cache manually.  so, it _should_ be safe to use
 zfs_nocacheflush, because there is no caching on the root pool.
 
 am i right, or could i encounter problems here?

Yes you are right and this should work. You may want to check that
the write cache is disabled on the root pool disks
using 'format -e' + cache + write_cache + display.

 
 (the system is an NFS server, which means lots of synchronous writes (and
 therefore ZFS cache flushes), so i *really* want the performance benefit from
 using the nvram write cache.)

Indeed, performance would be bad without it.

 
   - river.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs znode changes getting lost

2008-11-26 Thread Neil Perrin
I suspect ZFS is unaware that anything has changed in the
z_phys so it never gets written out. You probably need
to create a dmu transaction and call dmu_buf_will_dirty(zp-z_dbuf, tx);

Neil.

On 11/26/08 03:36, shelly wrote:
 In place of padding in zfs znode i added a new field. stored an integer value 
 and am able to see saved information. 
 
 but after reboot it is not there.  If i was able to access before reboot so 
 it must be in memory. I think i need to  save it to disk.
 how does one force zfs znode to disk.
 right now i dont do anything special for it. Just made an ioctl, accessed 
 znode and made changes.
 
 example in zfs_ioctl
 
 case add_new:
 zp = VTOZ(vp);
   zp-z_phys-new_field = 2;
return(0);
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL performance on traditional HDDs

2008-11-20 Thread Neil Perrin


On 11/20/08 12:52, Danilo Poccia wrote:
 Hi,
 
 I was wondering is there is a performance gain for an OLTP-like workload 
 in putting the ZFS Intent Log (ZIL) on traditional HDDs.
 

It's probably always best to benchmark it yourself, but my
experience has shown that it's better to only have a separate log
when the log devices are faster. Without a separate log (slog) the log
is allocated dynamically from the pool and at a location where
current allocations are happening for transaction groups.
So there is little head movement needed to write the log.
There may be a problem when the pool is very full and fragmented
as log block allocation will be all over the place and seek latency will
be high. However, this is a problem for the whole pool.
Also, the log is spread across the pool devices so the more devices
in the pool the faster the intent log can be written when the load is heavy.

Hope that helps: Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] s10u6--will using disk slices for zfs logs improve nfs performance?

2008-11-13 Thread Neil Perrin
I wouldn't expect any improvement using a separate disk slice for the Intent Log
unless that disk was much faster and was otherwise largely idle. If it was 
heavily
used then I'd expect quite the performance degradation as the disk head bounces
around between slices. Separate intent logs are really recommended for fast 
devices
(SSDs or NVRAM).

When you're comparing against UFS is the write cache disabled (use format -e)?
Otherwise UFS is unsafe. 

To get a apples to apples perf comparison, you can compare either:

Safe mode
-
ZFS with default settings (zil_disable=0  zfs_nocacheflush=0)
against UFS with write cache disabled. Ie the safe mode.

Unsafe mode - unless device is volatile.
---
ZFS with zil_disable=0  zfs_nocacheflush=1
against UFS with write cache enabled.

From my reading of one your comparisons, ZFS takes 10s vs 15s for UFS
(unsafe mode)

Neil.

On 11/13/08 16:23, Doug wrote:
 I've got an X4500/thumper that is mainly used as an NFS server.
 
 It has been discussed in the past that NFS performance with ZFS can be slow 
 (when running tar to expand an archive with lots of files, for example.)  
 My understanding is the reason that zfs/nfs is slow in this case is because 
 it is doing the correct/safe thing of waiting for the files to be written 
 to disk.
 
 I can (and have) improved nfs/zfs performance by about 15x by adding set 
 zfs:zil_disable=1 or zfs:zfs_nocacheflush=1 to /etc/system but this is 
 unsafe (though a common workaround?)
 
 But, I have never understood why zfs/nfs is so much slower than ufs/nfs in 
 the case of expanding a tar archive.  Is ufs/nfs not properly committing the 
 data to disk?
 
 Anyway, with the just released Solaris 10 10/08, zpool has been upgraded to 
 version 10 which includes option of using a separate storage device for the 
 ZIL.
 It had been my impression that you would need to use an flash disk/SSD to 
 store the ZIL to improve performance, but Richard Elling mentioned in a 
 earlier post that you could use a regular disk slice for this also (see 
 http://www.opensolaris.org/jive/thread.jspa?threadID=80213tstart=15)
 
 On an X4500 server, I had a zpool of 8 disks arranged in RAID 10.  I 
 installed a flash archive of s10u6 on the server then ran zpool upgrade.  
 Next, I used 
 zpool add log to add a 50GB slice on the boot disk for the zfs intent log.  
 
 But, I didn't see any improvement in NFS performance in running gtar zxf 
 Python-2.5.2.tgz (Python language source code)  It took 0.6sec to run on the 
 local system (no NFS) and 2min20sec over NFS.  If I disable the ZIL, the 
 command runs in about 10sec on the NFS client.  (It runs in about 15 seconds 
 over NFS to a UFS slice on the NFS server.)  The separate intent log didn't 
 seem to do anything in this case.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DNLC and ARC

2008-10-30 Thread Neil Perrin
Leal,

ZFS uses the DNLC. It still provides the fastest lookup of directory, name to 
vnode.
The DNLC is kind of LRU. An async process will use a rotor to move
through the hash chains and select the LRU entry but will select first
negative cache entries and vnodes only referenced by the DNLC.
Underlying this ZFS uses the ZAP and Fat ZAP to store the mappings.

ZFS does not use the 2nd level DNLC which allows caching of directories.
This is only used by UFS to avoid a linear search of large directories.

Neil.

On 10/30/08 04:50, Marcelo Leal wrote:
 Hello,
  In ZFS the DNLC concept is gone, or is in ARC too? I mean, all the cache in 
 ZFS is ARC right? 
  I was thinking if we can tune the DNLC in ZFS like in UFS.. if we have too 
 *many* files and
  directories, i guess we can have a better performance having all the 
 metadata cached, and that
  is even more important in NFS operations.  
  DNLC is LRU right? And ARC should be totally dynamic, but as in another 
 thread here,
  i think reading a *big* file can mess with the whole thing. Can we hold an 
 area in memory
  for DNLC cache, or that is not the ARC way?
 
  thanks,
 
  Leal.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DNLC and ARC

2008-10-30 Thread Neil Perrin


On 10/30/08 11:00, Marcelo Leal wrote:
 Hello Neil,
 
 Leal,

 ZFS uses the DNLC. It still provides the fastest
 lookup of directory, name to vnode.
 
  Ok, so the whole concept remains true? We can tune the DNLC and expect the 
 same behaviour on ZFS?

Yes.

 
 The DNLC is kind of LRU. An async process will use a
 rotor to move
 through the hash chains and select the LRU entry but
 will select first
 negative cache entries and vnodes only referenced by
 the DNLC.
 Underlying this ZFS uses the ZAP and Fat ZAP to store
 the mappings.
 
  Here i did not understand very well. You are saying that ZFS uses DNLC just 
 for one level?

Yes, the DNLC also supports entire directory caches, however ZFS doesn't
use this as it's better organised on disk not to be linear.
Normally name lookups check the normal/original (1st level) DNLC then if
that fails the entire directory name cache (2nd level) is checked.

 
 ZFS does not use the 2nd level DNLC which allows
 caching of directories.
 This is only used by UFS to avoid a linear search of
 large directories.
 
  What is the ZFS way here? One of the points of my question is exactly 
 that... in an environment with many directories with *many* files, i think 
 ZFS would has the *same* problems too. 
  So, having directories cache on DNLC could be a good solution. Can you 
 explain how ZFS handles the performance in directories with hundreds of files?
  There is a lot of docs around UFS/DNLC, but for now i think the only doc 
 about ZFS/ARC and DNLC is the source code. ;-) 
 
 Neil.
 
  Thanks a lot! 
  I was thinking in tune DNLC to have as many metadata (directories and files) 
 as i can, to minimize lookups/stats and etc (in NFS there is a lot of getattr 
 ops). So we could have *all* the metadata cached, and use what remains in 
 memory to cache data.
  Maybe that kind of tuning would be usefull for just a few workloads, but 
 could be a *huge* enhancement for that workloads.
 
  Leal
   -- posix rules --
 [http://www.posix.brte.com.br/blog] 
 
 On 10/30/08 04:50, Marcelo Leal wrote:
 Hello,
  In ZFS the DNLC concept is gone, or is in ARC too?
 I mean, all the cache in ZFS is ARC right? 
  I was thinking if we can tune the DNLC in ZFS like
 in UFS.. if we have too *many* files and
  directories, i guess we can have a better
 performance having all the metadata cached, and that
  is even more important in NFS operations.  
  DNLC is LRU right? And ARC should be totally
 dynamic, but as in another thread here,
  i think reading a *big* file can mess with the
 whole thing. Can we hold an area in memory
  for DNLC cache, or that is not the ARC way?

  thanks,

  Leal.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot remove slog device from zpool

2008-10-28 Thread Neil Perrin
Ethan,

It is still not possible to remove a slog from a pool. This is bug:

6574286 removing a slog doesn't work

The error message:

cannot remove c4t15d0p0: only inactive hot spares or cache devices can be 
removed

is correct and this is the same as documented in the zpool man page:

 zpool remove pool device ...

 Removes the specified device from the pool. This command
 currently  only  supports  removing hot spares and cache
 devices.

It's actually relatively easy to implement removal of slogs. We simply flush the
outstanding transactions and start using the main pool for the Intent Logs.
Thus the vacated device can be removed.
However, we wanted to make sure it fit into the framework for
the removal of any device. This a much harder problem which we
have made progress, but it's not there yet...

Neil.


On 10/26/08 11:41, Ethan Erchinger wrote:
 Sorry for the first incomplete send,  stupid Ctrl-Enter. :-)
 
 Hello,
 
 I've looked quickly through the archives and haven't found mention of 
 this issue.  I'm running SXCE (snv_99), which uses zfs version 13.  I 
 had an existing zpool:
 --
 [EMAIL PROTECTED] ~]$ zpool status -v data
   pool: data
  state: ONLINE
  scrub: none requested
 config:
 
 NAME   STATE READ WRITE CKSUM
 data   ONLINE   0 0 0
   mirror   ONLINE   0 0 0
 c4t1d0p0   ONLINE   0 0 0
 c4t9d0p0   ONLINE   0 0 0
   ...
 cache
   c4t15d0p0ONLINE   0 0 0
 
 errors: No known data errors
 
 --
 
 The cache device (c4t15d0p0) is an Intel SSD.  To test zil, I removed 
 the cache device, and added it as a log device:
 --
 [EMAIL PROTECTED] ~]$ pfexec zpool remove data c4t15d0p0
 [EMAIL PROTECTED] ~]$ pfexec zpool add data log c4t15d0p0
 [EMAIL PROTECTED] ~]$ zpool status -v data
   pool: data
  state: ONLINE
  scrub: none requested
 config:
 
 NAME   STATE READ WRITE CKSUM
 data   ONLINE   0 0 0
   mirror   ONLINE   0 0 0
 c4t1d0p0   ONLINE   0 0 0
 c4t9d0p0   ONLINE   0 0 0
   ...
 logs   ONLINE   0 0 0
   c4t15d0p0ONLINE   0 0 0
 
 errors: No known data errors
 --
 
 The device is working fine.  I then said, that was fun, time to remove 
 and add as cache device.  But that doesn't seem possible:
 --
 [EMAIL PROTECTED] ~]$ pfexec zpool remove data c4t15d0p0
 cannot remove c4t15d0p0: only inactive hot spares or cache devices can 
 be removed
 --
 
 I've also tried using detach, offline, each failing in other more 
 obvious ways.  The manpage does say that those devices should be 
 removable/replaceable.  At this point the only way to reclaim my SSD 
 device is to destroy the zpool.
 
 Just in-case you are wondering about versions:
 --
 [EMAIL PROTECTED] ~]$ zpool upgrade data
 This system is currently running ZFS pool version 13.
 
 Pool 'data' is already formatted using the current version.
 [EMAIL PROTECTED] ~]$ uname -a
 SunOS opensolaris 5.11 snv_99 i86pc i386 i86pc
 --
 
 Any ideas?
 
 Thanks,
 Ethan
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis

2008-10-22 Thread Neil Perrin


On 10/22/08 10:26, Constantin Gonzalez wrote:
 Hi,
 
 On a busy NFS server, performance tends to be very modest for large amounts
 of small files due to the well known effects of ZFS and ZIL honoring the
 NFS COMMIT operation[1].
 
 For the mature sysadmin who knows what (s)he does, there are three
 possibilities:
 
 1. Live with it. Hard, if you see 10x less performance than could be and your
 users complain a lot.
 
 2. Use a flash disk for a ZIL, a slog. Can add considerable extra cost,
 especially if you're using an X4500/X4540 and can't swap out fast SAS
 drives for cheap SATA drives to free the budget for flash ZIL drives.[2]
 
 3. Disable ZIL[1]. This is of course evil, but one customer pointed out to me
 that if a tar xvf were writing locally to a ZFS file system, the writes
 wouldn't be synchronous either, so there's no point in forcing NFS users
 to having a better availability experience at the expense of performance.
 
 
 So, if the sysadmin draws the informed and conscious conclusion that (s)he
 doesn't want to honor NFS COMMIT operations, what are options less disruptive
 than disabling ZIL completely?
 
 - I checked the NFS tunables from:
http://dlc.sun.com/osol/docs/content/SOLTUNEPARAMREF/chapter3-1.html
But could not find a tunable that would disable COMMIT honoring.
Is there already an RFE asking for a share option that disable's the
translation of COMMIT to synchronous writes?

- None that I know of...
 
 - The ZIL exists on a per filesystem basis in ZFS. Is there an RFE already
that asks for the ability to disable the ZIL on a per filesystem basis?

Yes: 6280630 zil synchronicity

Though personally I've been unhappy with the exposure that zil_disable has got.
It was originally meant for debug purposes only. So providing an official
way to make synchronous behaviour asynchronous is to me dangerous.

 
Once Admins start to disable the ZIL for whole pools because the extra
performance is too tempting, wouldn't it be the lesser evil to let them
disable it on a per filesystem basis?
 
 Comments?
 
 
 Cheers,
 Constantin
 
 [1]: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine
 [2]: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis

2008-10-22 Thread Neil Perrin
 But the slog is the ZIL. formaly a *separate* intent log.

No the slog is not the ZIL!

Here's the definition of the terms as we've been trying to use them:

ZIL:
The body of code the supports synchronous requests, which writes
out to the Intent Logs
Intent Log:
A stable storage log. There is one per file system  zvol.
slog:
An Intent Log on a separate stable device - preferably high speed.

We don't really have name for an Intent Log when it's embedded in the main
pool. I have in the past used the term clog for chained log. Originally before
slogs existed, it was just the Intent Log.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis

2008-10-22 Thread Neil Perrin
On 10/22/08 13:56, Marcelo Leal wrote:
 But the slog is the ZIL. formaly a *separate*
 intent log.

 No the slog is not the ZIL!
  Ok, when you did write this:
  I've been slogging for a while on support for separate intent logs (slogs) 
 for ZFS.
  Without slogs, the ZIL is allocated dynamically from the main pool.
 
  You were talking about The body of code  in the statement: the ZIL is 
 allocated ? 
  So i have misunderstood you...
 
  Leal.

I guess I need to fix that!
Anyway the slog is not the ZIL it's one of the two
currently possible Intent Log types.

Sorry for the confusion: Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs cp hangs when the mirrors are removed ..

2008-10-15 Thread Neil Perrin
Karthik,

The pool failmode property as implemented governs the behaviour when all
the devices needed are unavailable. The default behaviour is to wait
(block) until the IO can continue - perhaps by re-enabling the device(s).
The behaviour you expected can be achieved by zpool set failmode=continue 
pool,
as shown in the link you indicated below.

Neil.

On 10/15/08 22:38, Karthik Krishnamoorthy wrote:
 Hello All,
 
   Summary:
   
   cp command for mirrored zfs hung when all the disks in the mirrored
   pool were unavailable.
   
   Detailed description:
   ~
   
   The cp command (copy a 1GB file from nfs to zfs) hung when all the disks
   in the mirrored pool (both c1t0d9 and c2t0d9) were removed physically.
   
  NAMESTATE READ WRITE CKSUM
  testONLINE  0 0 0
mirrorONLINE  0 0 0
  c1t0d9  ONLINE  0 0 0
  c2t0d9  ONLINE  0 0 0
   
   We think if all the disks in the pool are unavailable, cp command should
   fail with error (not cause hang).
   
   Our request:
   
   Please investigate the root cause of this issue.
  
   How to reproduce:
   ~
   1. create a zfs mirrored pool
   2. execute cp command from somewhere to the zfs mirrored pool.
   3. remove the both of disks physically during cp command working
 =  hang happen (cp command never return and we can't kill cp command)
 
 One engineer pointed me to this page  
 http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ and 
 indicated that if all the mirrors are removed zfs enters a hang like 
 state to prevent the kernel from going into a panic mode and this type 
 of feature would be an RFE.
 
 My questions are
 
 Are there any documentation of the mirror configuration of zfs that 
 explains what happens when the underlying
 drivers detect problems in one of the mirror devices?
 
 It seems that the traditional views of mirror or raid-2 would expect 
 that the
 mirror would be able to proceed without interruption and that does not 
 seem to be this case in ZFS. 
 
 What is the purpose of the mirror, in zfs?  Is it more like an instant
 backup?  If so, what can the user do to recover, when there is an
 IO error on one of the devices?
 
 
 Appreciate any pointers and help,
 
 Thanks and regards,
 Karthik
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL NVRAM partitioning?

2008-09-05 Thread Neil Perrin
On 09/05/08 14:42, Narayan Venkat wrote:
 I understand that if you want to use ZIL, then the requirement is one or more 
 ZILs per pool.

A little clarification of ZFS terms may help here. The term ZIL is somewhat
overloaded. I think what you mean here is a separate log device (slog), because 
intent
logs are always present in ZFS. Without a slog, the logs are present in the 
main pool.
There is one log per file system and it allocates blocks in the main pool to 
form a chain.
When a slog is defined, then it can be made up of multiple devices (in which 
case the 
writes are striped across the devices) or it can be in the form on a N way
mirror - to provide redundancy.
 
 With an SSD you can partition the disk to allow usage of a single disk for 
 multiple ZILs
 Can we do the same thing with an PCIe-based NVRAM card
 (like http://www.vmetro.com/category4304.html)?

I don't think there's a Solaris supported driver for that device.
However, any Solaris device, whether a partition or not, will work
with ZFS provided it's at least 64MB. It's performance is another matter.

 
 Thanks 
 Narayan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working - I broke my pool...

2008-08-06 Thread Neil Perrin
Ross,

Thanks, I have updated the bug with this info.

Neil.

Ross Smith wrote:
 Hmm... got a bit more information for you to add to that bug I think.
  
 Zpool import also doesn't work if you have mirrored log devices and 
 either one of them is offline.
  
 I created two ramdisks with:
 # ramdiskadm -a rc-pool-zil-1 256m
 # ramdiskadm -a rc-pool-zil-2 256m
  
 And added them to the pool with:
 # zpool add rc-pool log mirror /dev/ramdisk/rc-pool-zil-1 
 /dev/ramdisk/rc-pool-zil-2
  
 I can reboot fine, the pool imports ok without the ZIL and I have a 
 script that recreates the ramdisks and adds them back to the pool:
 #!/sbin/sh
 state=$1
 case $state in
 'start')
echo 'Starting Ramdisks'
/usr/sbin/ramdiskadm -a rc-pool-zil-1 256m
/usr/sbin/ramdiskadm -a rc-pool-zil-2 256m
echo 'Attaching to ZFS ZIL'
/usr/sbin/zpool replace test /dev/ramdisk/rc-pool-zil-1
/usr/sbin/zpool replace test /dev/ramdisk/rc-pool-zil-2
;;
 'stop')
;;
 esac
  
 However, if I export the pool, and delete one ramdisk to check that the 
 mirroring works fine, the import fails:
 # zpool export rc-pool
 # ramdiskadm -d rc-pool-zil-1
 # zpool import rc-pool
 cannot import 'rc-pool': one or more devices is currently unavailable
  
 Ross
 
 
   Date: Mon, 4 Aug 2008 10:42:43 -0600
   From: [EMAIL PROTECTED]
   Subject: Re: [zfs-discuss] Zpool import not working - I broke my pool...
   To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
   CC: zfs-discuss@opensolaris.org
  
  
  
   Richard Elling wrote:
Ross wrote:
I'm trying to import a pool I just exported but I can't, even -f 
 doesn't help. Every time I try I'm getting an error:
cannot import 'rc-pool': one or more devices is currently 
 unavailable
   
Now I suspect the reason it's not happy is that the pool used to 
 have a ZIL :)
   
   
Correct. What you want is CR 6707530, log device failure needs some 
 work
http://bugs.opensolaris.org/view_bug.do?bug_id=6707530
which Neil has been working on, scheduled for b96.
  
   Actually no. That CR mentioned the problem and talks about splitting out
   the bug, as it's really a separate problem. I've just done that and 
 here's
   the new CR which probably won't be visible immediately to you:
  
   6733267 Allow a pool to be imported with a missing slog
  
   Here's the Description:
  
   ---
   This CR is being broken out from 6707530 log device failure needs 
 some work
  
   When Separate Intent logs (slogs) were designed they were given equal 
 status in the pool device tree.
   This was because they can contain committed changes to the pool.
   So if one is missing it is assumed to be important to the integrity 
 of the
   application(s) that wanted the data committed synchronously, and thus
   a pool cannot be imported with a missing slog.
   However, we do allow a pool to be missing a slog on boot up if
   it's in the /etc/zfs/zpool.cache file. So this sends a mixed message.
  
   We should allow a pool to be imported without a slog if -f is used
   and to not import without -f but perhaps with a better error message.
  
   It's the guidsum check that actually rejects imports with missing 
 devices.
   We could have a separate guidsum for the main pool devices (non 
 slog/cache).
   ---
  
 
 
 
 Find out how to make Messenger your very own TV! Try it Now! 
 http://clk.atdmt.com/UKM/go/101719648/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs crash CR6727355 marked incomplete

2008-08-06 Thread Neil Perrin


Michael Hale wrote:
 A bug report I've submitted for a zfs-related kernel crash has been  
 marked incomplete and I've been asked to provide more information.
 
 This CR has been marked as incomplete by User 1-5Q-2508
 for the reason Need More Info.  Please update the CR
 providing the information requested in the Evaluation and/or Comments  
 field.
 
 However, when I pull up 6727355 in the bugs.opensolaris.org, it  
 doesn't allow me to make any edits, nor do I see an evaluation or  
 comments field - am I doing something wrong?

1. The Comments field asks that the core dump be made readable by our
   zfs group, and the CR was made incomplete until the person who
   saved the core does this.
2. You do not see this because the Comments is not readable outside
   of Sun as it is used to contain customer information.
3. Finally there is no Evaluation yet.

Bottom line is that you can ignore the Need more info - it wasn't
directed at you. Sorry about the confusion. I guess the kinks in the
system aren't ironed out yet. Usually if we need more info we
will email you directly.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working - I broke my pool...

2008-08-04 Thread Neil Perrin


Richard Elling wrote:
 Ross wrote:
 I'm trying to import a pool I just exported but I can't, even -f doesn't 
 help.  Every time I try I'm getting an error:
 cannot import 'rc-pool':  one or more devices is currently unavailable

 Now I suspect the reason it's not happy is that the pool used to have a ZIL 
 :)
   
 
 Correct.  What you want is CR 6707530, log device failure needs some work
 http://bugs.opensolaris.org/view_bug.do?bug_id=6707530
 which Neil has been working on, scheduled for b96.

Actually no. That CR mentioned the problem and talks about splitting out
the bug, as it's really a separate problem. I've just done that and here's
the new CR which probably won't be visible immediately to you:

6733267 Allow a pool to be imported with a missing slog

Here's the Description:

---
This CR is being broken out from 6707530 log device failure needs some work

When Separate Intent logs (slogs) were designed they were given equal status in 
the pool device tree.
This was because they can contain committed changes to the pool.
So if one is missing it is assumed to be important to the integrity of the
application(s) that wanted the data committed synchronously, and thus
a pool cannot be imported with a missing slog.
However, we do allow a pool to be missing a slog on boot up if
it's in the /etc/zfs/zpool.cache file. So this sends a mixed message.

We should allow a pool to be imported without a slog if -f is used
and to not import without -f but perhaps with a better error message.

It's the guidsum check that actually rejects imports with missing devices.
We could have a separate guidsum for the main pool devices (non slog/cache).
---

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   >