Re: [zfs-discuss] hard drive write cache

2006-05-26 Thread Neil Perrin



ZFS enables the write cache and flushes it when committing transaction
groups; this insures that all of a transaction group appears or does
not appear on disk.


It also flushes the disk write cache before returning from every
synchronous request (eg fsync, O_DSYNC). This is done after
writing out the intent log blocks.

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

2006-06-21 Thread Neil Perrin

Well this does look more and more like a duplicate of:

6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the 
same FS


Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

2006-06-21 Thread Neil Perrin



Torrey McMahon wrote On 06/21/06 10:29,:

Roch wrote:


Sean Meighan writes:
  The vi we were doing was a 2 line file. If you just vi a new file, 
add   one line and exit it would take 15 minutes in fdsynch. On 
recommendation   of a workaround we set

set zfs:zil_disable=1
after the reboot the fdsynch is now  0.1 seconds. Now I have no 
idea if it was this setting or the fact that we went through a reboot. 
Whatever the root cause we are now back to a well behaved file system.
 


well behaved...In appearance only !

Maybe it's nice to validate hypothesis but you should not
run with this option set, ever., it disable O_DSYNC and
fsync() and I don't know what else.

Bad idea, bad.   




Why is this option available then? (Yes, that's a loaded question.)


I wouldn't call it an option, but an internal debugging switch that I
originally added to allow progress when initially integrating the ZIL.
As Roch says it really shouldn't be ever set (as it does negate POSIX
synchronous semantics). Nor should it be mentioned to a customer.
In fact I'm inclined to now remove it - however it does still have a use
as it helped root cause this problem.

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

2006-06-21 Thread Neil Perrin



Robert Milkowski wrote On 06/21/06 11:09,:

Hello Neil,

Why is this option available then? (Yes, that's a loaded question.)


NP I wouldn't call it an option, but an internal debugging switch that I
NP originally added to allow progress when initially integrating the ZIL.
NP As Roch says it really shouldn't be ever set (as it does negate POSIX
NP synchronous semantics). Nor should it be mentioned to a customer.
NP In fact I'm inclined to now remove it - however it does still have a use
NP as it helped root cause this problem.

Isn't it similar to unsupported fastfs for ufs?


It is similar in the sense that it speeds up the file system.
Using fastfs can be much more dangerous though as it can lead
to a badly corrupted file system as writing meta data is delayed
and written out of order. Whereas disabling the ZIL does not affect
the integrity of the fs. The transaction group model of ZFS gives
consistency in the event of a crash/power fail. However, any data that
was promised to be on stable storage may not be unless the transaction
group committed (an operation that is started every 5s).

We once had plans to add a mount option to allow the admin
to control the ZIL. Here's a brief section of the RFE (6280630):

sync={deferred,standard,forced}

Controls synchronous semantics for the dataset.

When set to 'standard' (the default), synchronous operations
such as fsync(3C) behave precisely as defined in
fcntl.h(3HEAD).

When set to 'deferred', requests for synchronous semantics
are ignored.  However, ZFS still guarantees that ordering
is preserved -- that is, consecutive operations reach stable
storage in order.  (If a thread performs operation A followed
by operation B, then the moment that B reaches stable storage,
A is guaranteed to be on stable storage as well.)  ZFS also
guarantees that all operations will be scheduled for write to
stable storage within a few seconds, so that an unexpected
power loss only takes the last few seconds of change with it.

When set to 'forced', all operations become synchronous.
No operation will return until all previous operations
have been committed to stable storage.  This option can be
useful if an application is found to depend on synchronous
semantics without actually requesting them; otherwise, it
will just make everything slow, and is not recommended.

Of course we would need to stress the dangers of setting 'deferred'.
What do you guys think?

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS

2006-06-24 Thread Neil Perrin

Chris,

The data will be written twice on ZFS using NFS. This is because NFS
on closing the file internally uses fsync to cause the writes to be
committed. This causes the ZIL to immediately write the data to the intent log.
Later the data is also written committed as part of the pools transaction group
commit, at which point the intent block blocks are freed.

It does seem inefficient to doubly write the data. In fact for blocks
larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed)
we write the data block and also an intent log record with the block pointer.
During txg commit we link this block into the pool tree. By experimentation
we found 32K to be the (current) cutoff point. As the nfsd at most write 32K
they do not benefit from this.

Anyway this is an area we are actively working on.

Neil.

Chris Csanady wrote On 06/23/06 23:45,:

While dd'ing to an nfs filesystem, half of the bandwidth is unaccounted
for.  What dd reports amounts to almost exactly half of what zpool iostat
or iostat show; even after accounting for the overhead of the two mirrored
vdevs.  Would anyone care to guess where it may be going?

(This is measured over 10 second intervals.  For 1 second intervals,
the bandwidth to the disks jumps around from 40MB/s to 240MB/s)

With a local dd, everything adds up.  This is with a b41 server, and a
MacOS 10.4 nfs client.  I have verified that the bandwidth at the network
interface is approximately that reported by dd, so the issue would appear
to be within the server.

Any suggestions would be welcome.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS

2006-06-26 Thread Neil Perrin



Robert Milkowski wrote On 06/25/06 04:12,:

Hello Neil,

Saturday, June 24, 2006, 3:46:34 PM, you wrote:

NP Chris,

NP The data will be written twice on ZFS using NFS. This is because NFS
NP on closing the file internally uses fsync to cause the writes to be
NP committed. This causes the ZIL to immediately write the data to the intent 
log.
NP Later the data is also written committed as part of the pools transaction 
group
NP commit, at which point the intent block blocks are freed.

NP It does seem inefficient to doubly write the data. In fact for blocks
NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed)
NP we write the data block and also an intent log record with the block 
pointer.
NP During txg commit we link this block into the pool tree. By experimentation
NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K
NP they do not benefit from this.

Is 32KB easily tuned (mdb?)?


I'm not sure. NFS folk?


I guess not but perhaps.

And why only for blocks larger than zfs_immediate_write_sz?


When data is large enough (currently 32K) it's more efficient to directly
write the block, and additionally save the block pointer in a ZIL record.
Otherwise it's more efficient to copy the data into a large log block
potentially along with other writes.

--

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS

2006-06-27 Thread Neil Perrin



Robert Milkowski wrote On 06/27/06 03:00,:

Hello Chris,

Tuesday, June 27, 2006, 1:07:31 AM, you wrote:

CC On 6/26/06, Neil Perrin [EMAIL PROTECTED] wrote:



Robert Milkowski wrote On 06/25/06 04:12,:


Hello Neil,

Saturday, June 24, 2006, 3:46:34 PM, you wrote:

NP Chris,

NP The data will be written twice on ZFS using NFS. This is because NFS
NP on closing the file internally uses fsync to cause the writes to be
NP committed. This causes the ZIL to immediately write the data to the intent 
log.
NP Later the data is also written committed as part of the pools transaction 
group
NP commit, at which point the intent block blocks are freed.

NP It does seem inefficient to doubly write the data. In fact for blocks
NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed)
NP we write the data block and also an intent log record with the block 
pointer.
NP During txg commit we link this block into the pool tree. By experimentation
NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K
NP they do not benefit from this.

Is 32KB easily tuned (mdb?)?


I'm not sure. NFS folk?



CC I think he is referring to the zfs_immediate_write_sz variable, but

Exactly, I was asking about this not NFS.


Sorry for the confusion. The zfs_immediate_write_sz varaible was meant for
internal use and not really intended for public tuning. However, yes it could
be tuned dynamically anytime using mdb, or set in /etc/system

--

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supporting ~10K users on ZFS

2006-06-27 Thread Neil Perrin



[EMAIL PROTECTED] wrote On 06/27/06 17:17,:

We have over 1 filesystems under /home in strongspace.com and it works fine.

 I forget but there was a bug or there was an improvement made around nevada
 build 32 (we're currently at 41) that made the initial mount on reboot
 significantly faster.

Before that it was around 10-15 minutes. I wonder if that improvement didn't 
make

 it into sol10U2?

That fix (bug 6377670) made it into build 34 and S10_U2.



-Jason

Sent via BlackBerry from Cingular Wireless  


-Original Message-
From: eric kustarz [EMAIL PROTECTED]
Date: Tue, 27 Jun 2006 15:55:45 
To:Steve Bennett [EMAIL PROTECTED]

Cc:zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Supporting ~10K users on ZFS

Steve Bennett wrote:



OK, I know that there's been some discussion on this before, but I'm not sure 
that any specific advice came out of it. What would the advice be for 
supporting a largish number of users (10,000 say) on a system that supports 
ZFS? We currently use vxfs and assign a user quota, and backups are done via 
Legato Networker.





Using lots of filesystems is definitely encouraged - as long as doing so 
makes sense in your environment.




From what little I currently understand, the general advice would seem to be to 
assign a filesystem to each user, and to set a quota on that. I can see this 
being OK for small numbers of users (up to 1000 maybe), but I can also see it 
being a bit tedious for larger numbers than that.


I just tried a quick test on Sol10u2:
  for x in 0 1 2 3 4 5 6 7 8 9;  do for y in 0 1 2 3 4 5 6 7 8 9; do
  zfs create testpool/$x$y; zfs set quota=1024k testpool/$x$y
  done; done
[apologies for the formatting - is there any way to preformat text on this 
forum?]
It ran OK for a minute or so, but then I got a slew of errors:
  cannot mount '/testpool/38': unable to create mountpoint
  filesystem successfully created, but not mounted

So, OOTB there's a limit that I need to raise to support more than approx 40 
filesystems (I know that this limit can be raised, I've not checked to see 
exactly what I need to fix). It does beg the question of why there's a limit 
like this when ZFS is encouraging use of large numbers of filesystems.





There is no 40 filesystem limit.  You most likely had a pre-existing 
file/directory in testpool of the same name of the filesystem you tried 
to create.


fsh-hake# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
testpool77K  7.81G  24.5K  /testpool
fsh-hake# echo hmm  /testpool/01
fsh-hake# zfs create testpool/01
cannot mount 'testpool/01': Not a directory
filesystem successfully created, but not mounted
fsh-hake#



If I have 10,000 filesystems, is the mount time going to be a problem?
I tried:
  for x in 0 1 2 3 4 5 6 7 8 9;  do for x in 0 1 2 3 4 5 6 7 8 9; do
  zfs umount testpool/001; zfs mount testpool/001
  done; done
This took 12 seconds, which is OK until you scale it up - even if we assume 
that mount and unmount take the same amount of time, so 100 mounts will take 6 
seconds, this means that 10,000 mounts will take 5 minutes. Admittedly, this is 
on a test system without fantastic performance, but there *will* be a much 
larger delay on mounting a ZFS pool like this over a comparable UFS filesystem.





So this really depends on why and when you're unmounting filesystems.  I 
suspect it won't matter much since you won't be unmounting/remounting 
your filesystems.




I currently use Legato Networker, which (not unreasonably) backs up each 
filesystem as a separate session - if I continue to use this I'm going to have 
10,000 backup sessions on each tape backup. I'm not sure what kind of 
challenges restoring this kind of beast will present.

Others have already been through the problems with standard tools such as 'df' 
becoming less useful.





Is there a specific problem you had in mind regarding 'df;?



One alternative is to ditch quotas altogether - but even though disk is cheap, it's not 
free, and regular backups take time (and tapes are not free either!). In any case, 10,000 
undergraduates really will be able to fill more disks than we can afford to provision. We tried 
running a Windows fileserver back in the days when it had no support for per-user quotas; we did 
some ad-hockery that helped to keep track of the worst offenders (ableit after the event), but what 
really killed us was the uncertainty over whether some idiot would decide to fill all available 
space with vital research data (or junk, depending on your point of view).

I can see the huge benefits that ZFS quotas and reservations can bring, but I 
can also see that there is a possibility that there are situations where ZFS 
could be useful, but the lack of 'legacy' user-based quotas make it 
impractical. If the ZFS developers really are not going to implement user 
quotas is there any advice on what someone like me could do - at the moment I'm 
presuming that I'll just have to leave ZFS 

Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

2006-06-28 Thread Neil Perrin



Robert Milkowski wrote On 06/28/06 15:52,:

Hello Neil,

Wednesday, June 21, 2006, 8:15:54 PM, you wrote:


NP Robert Milkowski wrote On 06/21/06 11:09,:


Hello Neil,


Why is this option available then? (Yes, that's a loaded question.)


NP I wouldn't call it an option, but an internal debugging switch that I
NP originally added to allow progress when initially integrating the ZIL.
NP As Roch says it really shouldn't be ever set (as it does negate POSIX
NP synchronous semantics). Nor should it be mentioned to a customer.
NP In fact I'm inclined to now remove it - however it does still have a use
NP as it helped root cause this problem.

Isn't it similar to unsupported fastfs for ufs?



NP It is similar in the sense that it speeds up the file system.
NP Using fastfs can be much more dangerous though as it can lead
NP to a badly corrupted file system as writing meta data is delayed
NP and written out of order. Whereas disabling the ZIL does not affect
NP the integrity of the fs. The transaction group model of ZFS gives
NP consistency in the event of a crash/power fail. However, any data that
NP was promised to be on stable storage may not be unless the transaction
NP group committed (an operation that is started every 5s).

NP We once had plans to add a mount option to allow the admin
NP to control the ZIL. Here's a brief section of the RFE (6280630):

NP  sync={deferred,standard,forced}

NP  Controls synchronous semantics for the dataset.

NP  When set to 'standard' (the default), synchronous 
operations
NP  such as fsync(3C) behave precisely as defined in
NP  fcntl.h(3HEAD).

NP  When set to 'deferred', requests for synchronous semantics
NP  are ignored.  However, ZFS still guarantees that ordering
NP  is preserved -- that is, consecutive operations reach 
stable
NP  storage in order.  (If a thread performs operation A 
followed
NP  by operation B, then the moment that B reaches stable 
storage,
NP  A is guaranteed to be on stable storage as well.)  ZFS also
NP  guarantees that all operations will be scheduled for write 
to
NP  stable storage within a few seconds, so that an unexpected
NP  power loss only takes the last few seconds of change with 
it.

NP  When set to 'forced', all operations become synchronous.
NP  No operation will return until all previous operations
NP  have been committed to stable storage.  This option can be
NP  useful if an application is found to depend on synchronous
NP  semantics without actually requesting them; otherwise, it
NP  will just make everything slow, and is not recommended.

NP Of course we would need to stress the dangers of setting 'deferred'.
NP What do you guys think?

I think it would be really useful.
I found myself many times in situation that such features (like
fastfs) were my last resort help.


The over-whelming consensus was that it would be useful. So I'll go ahead and
put that on my to do list.



The same with txg_time - in some cases tuning it could probably be
useful. Instead of playing with mdb it would be much better put into
zpool/zfs or other util (and if possible made per fs not per host).


This one I'm less sure about. I have certainly tuned txg_time myself to
force certain situations, but I wouldn't be happy exposing the inner workings
of ZFS - which may well change.

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: zvol Performance

2006-07-17 Thread Neil Perrin

This is change request:

6428639 large writes to zvol synchs too much, better cut down a little

which I have a fix for, but it hasn't been put back.

Neil.

Jürgen Keil wrote On 07/17/06 04:18,:

Further testing revealed
that it wasn't an iSCSI performance issue but a zvol
issue.  Testing on a SATA disk locally, I get these
numbers (sequentual write):

UFS: 38MB/s
ZFS: 38MB/s
Zvol UFS: 6MB/s
Zvol Raw: ~6MB/s

ZFS is nice and fast but Zvol performance just drops
off a cliff.  Suggestion or observations by others
using zvol would be extremely helpful.   



# zfs create -V 1g data/zvol-test
# time dd if=/data/media/sol-10-u2-ga-x86-dvd.iso 
of=/dev/zvol/rdsk/data/zvol-test bs=32k count=1
1+0 records in
1+0 records out
0.08u 9.37s 2:21.56 6.6%

That's ~ 2.3 MB/s.

I do see *frequent* DKIOCFLUSHWRITECACHE ioctls
(one flush write cache ioctl after writing ~36KB of data, needs ~6-7 
milliseconds per flush):


  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02778, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 5736778 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e027c0, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6209599 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02808, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6572132 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02850, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6732316 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02898, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6175876 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e028e0, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6251611 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02928, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7756397 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02970, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6393356 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e029b8, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6147003 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a00, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6247036 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a48, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6061991 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a90, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6284297 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02ad8, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6174818 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02b20, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6245923 
nsec, error 0



dtrace with stack backtraces:


  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5d1ec10, count 9000
  0  39404  zio_ioctl:entry
  zfs`zil_flush_vdevs+0x144
  zfs`zil_commit+0x311
  zfs`zvol_strategy+0x4bc
  genunix`default_physio+0x308
  genunix`physio+0x1d
  zfs`zvol_write+0x22
  genunix`cdev_write+0x25
  specfs`spec_write+0x4d6
  genunix`fop_write+0x2e
  genunix`write+0x2ae
  unix`sys_sysenter+0x104

  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6638189 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5d1ec58, count 9000
  0  39404  zio_ioctl:entry
  zfs`zil_flush_vdevs+0x144
  zfs`zil_commit+0x311
  zfs`zvol_strategy+0x4bc
  genunix`default_physio+0x308
  genunix`physio+0x1d
  zfs`zvol_write+0x22
  genunix`cdev_write+0x25
  specfs`spec_write+0x4d6
  genunix`fop_write+0x2e
  genunix`write+0x2ae
  unix`sys_sysenter+0x104

  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7881400 
nsec, error 0



Re: [zfs-discuss] How to best layout our filesystems

2006-07-26 Thread Neil Perrin



Brian Hechinger wrote On 07/26/06 06:49,:

On Tue, Jul 25, 2006 at 03:54:22PM -0700, Eric Schrock wrote:


If you give zpool(1M) 'whole disks' (i.e. no 's0' slice number) and let
it label and use the disks, it will automatically turn on the write
cache for you.



What if you can't give ZFS whole disks?  I run snv_38 on the Optiplex
GX620 on my desk at work and I run snv_40 on the Latitude D610 that I
carry with me.  In both cases the machines only have one disk, so I need
to split it up for UFS for the OS and ZFS for my data.  How do I turn on
write cache for partial disks?

-brian


You can't enable write caching for just part of the disk.
We don't enable it for slices because UFS (and other
file systems) doesn't do write cache flushing and so
could get corruption on power failure. I suppose if you know
the disk only contains zfs slices then write caching could be
manually enabled using format -e - cache - write_cache - enable

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zil_disable

2006-08-07 Thread Neil Perrin

Not quite, zil_disable is inspected on file system mounts.
It's also looked at dynamically on every write for zvols.

Neil.

Robert Milkowski wrote On 08/07/06 10:07,:

Hello zfs-discuss,

  Just a note to everyone experimenting with this - if you change it
  online it has only effect when pools are exported and then imported.


  ps. I didn't use for my last posted benchmarks - with it I get about
   35,000IOPS and 0.2ms latency - but it's meaningless.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zil_disable

2006-08-08 Thread Neil Perrin

Robert Milkowski wrote:

Hello Neil,

Monday, August 7, 2006, 6:40:01 PM, you wrote:

NP Not quite, zil_disable is inspected on file system mounts.

I guess you right that umount/mount will suffice - I just hadn't time
to check it and export/import worked.

Anyway is there a way for file systems to make it active without
unmount/mount in current nevada?


No, sorry.

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zil_disable

2006-08-08 Thread Neil Perrin

Robert Milkowski wrote:

Hello Eric,

Monday, August 7, 2006, 6:29:45 PM, you wrote:

ES Robert -

ES This isn't surprising (either the switch or the results).  Our long term
ES fix for tweaking this knob is:

ES 6280630 zil synchronicity

ES Which would add 'zfs set sync' as a per-dataset option.  A cut from the
ES comments (which aren't visible on opensolaris):

ES sync={deferred,standard,forced}

ES Controls synchronous semantics for the dataset.
ES 
ES When set to 'standard' (the default), synchronous

ES operations such as fsync(3C) behave precisely as defined
ES in fcntl.h(3HEAD).

ES When set to 'deferred', requests for synchronous
ES semantics are ignored.  However, ZFS still guarantees
ES that ordering is preserved -- that is, consecutive
ES operations reach stable storage in order.  (If a thread
ES performs operation A followed by operation B, then the
ES moment that B reaches stable storage, A is guaranteed to
ES be on stable storage as well.)  ZFS also guarantees that
ES all operations will be scheduled for write to stable
ES storage within a few seconds, so that an unexpected
ES power loss only takes the last few seconds of change
ES with it.

ES When set to 'forced', all operations become synchronous.
ES No operation will return until all previous operations
ES have been committed to stable storage.  This option can
ES be useful if an application is found to depend on
ES synchronous semantics without actually requesting them;
ES otherwise, it will just make everything slow, and is not
ES recommended.

ES There was a thread describing the usefulness of this (for builds where
ES all-or-nothing over a long period of time), but I can't find it.

I remember the thread. Do you know if anyone is currently working on
it and when is it expected to be integrated into snv?


I'm slated to work on it after I finish up some other ZIL bugs and performance
fixes.

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS RAID10

2006-08-10 Thread Neil Perrin

Robert Milkowski wrote:

Hello Matthew,

Thursday, August 10, 2006, 6:55:41 PM, you wrote:

MA On Thu, Aug 10, 2006 at 06:50:45PM +0200, Robert Milkowski wrote:


btw: wouldn't it be possible to write block only once (for synchronous
IO) and than just point to that block instead of copying it again?



MA We actually do exactly that for larger (32k) blocks.

Why such limit (32k)?


By experimentation that was the cutoff where it was found to be
more efficient. It was recently reduced from 64K with a more
efficient dmu-sync() implementaion.
Feel free to experiment with the dynamically changable tunable:

ssize_t zfs_immediate_write_sz = 32768;

--

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS RAID10

2006-08-10 Thread Neil Perrin

Robert Milkowski wrote:

Hello Neil,

Thursday, August 10, 2006, 7:02:58 PM, you wrote:

NP Robert Milkowski wrote:


Hello Matthew,

Thursday, August 10, 2006, 6:55:41 PM, you wrote:

MA On Thu, Aug 10, 2006 at 06:50:45PM +0200, Robert Milkowski wrote:



btw: wouldn't it be possible to write block only once (for synchronous
IO) and than just point to that block instead of copying it again?



MA We actually do exactly that for larger (32k) blocks.

Why such limit (32k)?



NP By experimentation that was the cutoff where it was found to be
NP more efficient. It was recently reduced from 64K with a more
NP efficient dmu-sync() implementaion.
NP Feel free to experiment with the dynamically changable tunable:

NP ssize_t zfs_immediate_write_sz = 32768;


I've just checked using dtrace on one of production nfs servers that
90% of the time arg5 in zfs_log_write() is exactly 32768 and the rest
is always smaller.

With default 32768 value of 32768 it means that for NFS servers it
will always copy data as I've just checked in the code and there is:

245 if (len  zfs_immediate_write_sz) {

So in nfs server case above never will be true (with default nfs srv
settings).

Wouldn't nfs server benefit from lowering zfs_immediate_write_sz to
32767?


Yes NFS (with default 32K max write sz) would benefit if WR_INDIRECT
writes (using dmu_sync()) were faster, but that wasn't the case when
last benchmarked. I'm sure there are some cases currently where
tuning zfs_immediate_write_sz will help certain workloads.
Anyway, I think this whole area deserves more thought.
If you experiment with tuning zfs_immediate_write_sz, then please share
any performance data for your application/benchmark(s).

Thanks: Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fdatasync

2006-08-10 Thread Neil Perrin

Myron Scott wrote:

Is there any difference between fdatasync and fsync on ZFS?


-No. ZFS does not log data and meta data separately. rather
it logs essentially the system call records, eg writes, mkdir,
truncate, setattr, etc. So fdatasync and fsync are identical
on ZFS.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Significant pauses during zfs writes

2006-08-14 Thread Neil Perrin

Yes James is right this is normal behaviour. Unless the writes are
synchronous (O_DSYNC) or explicitely flushed (fsync()) then they
are batched up, written out and committed as a transaction
every txg_time (5 seconds).

Neil.

James C. McPherson wrote:

Bob Evans wrote:


Just getting my feet wet with zfs.  I set up a test system (Sunblade
1000, dual channel scsi card, disk array with 14x18GB 15K RPM SCSI
disks) and was trying to write a large file (10 GB) to the array to
see how it performed.  I configured the raid using raidz.

During the write, I saw the disk access lights come on, but I noticed
a peculiar behavior.  The system would write to the disk, but then
pause for a few seconds, then contineu, then pause for a few seconds.


I saw the same behavior when I made a smaller raidz using 4x36 GB
scsi drives in a different enclosure.

Since I'm new to zfs, and realize that I'm probably missing
something, I was hoping somebody might help shed some light on my
problem.



Hi Bob,
I'm pretty sure that's not a problem that you're seeing, just
ZFS' normal behaviour. Writes are coalesced as much as possible,
so the pauses that you observed are most likely going to be
the system waiting for suitable IOs to be gathered up and sent
out to your storage.

If you want to examine this a bit more then might I suggest the
DTrace Toolkit's iosnoop utility.


best regards,
James C. McPherson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-14 Thread Neil Perrin

Robert Milkowski wrote:


ps. however I'm really concerned with ZFS behavior when a pool is
almost full, there're lot of write transactions to that pool and
server is restarted forcibly or panics. I observed that file systems
on that pool will mount in 10-30 minutes each during zfs mount -a, and
one CPU is completely consumed. It's during system start-up so basically
whole system boots waits for it. It means additional 1 hour downtime.
This is something really unexpected for me and unfortunately no one
was really interested in my report - I know people are busy. But still
if it hits other users when zfs pools will be already populated people
won't be happy. For more details see my post here with subject: zfs
mount stuck in zil_replay.


That problem must have fallen through the cracks. Yes we are busy, but
we really do care about your experiences and bugs. I have just raised
a bug to cover this issue:

6460107 Extremely slow mounts after panic - searching space maps during replay

Thanks for reporting this and helping make ZFS better.

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Bizzare problem with ZFS filesystem

2006-09-15 Thread Neil Perrin

It is highly likely you are seeing a duplicate of:

6413510 zfs: writing to ZFS filesystem slows down fsync() on
other files in the same FS

which was fixed recently in build 48 on Nevada.
The symptoms are very similar. That is a fsync from the vi would, prior
to the bug being fixed, have to force out all other data through the
intent log.

Neil.


Anantha N. Srirama wrote On 09/13/06 15:58,:
One more piece of information. I was able to ascertain the slowdown happens only when ZFS is used heavily; meaning lots of inflight I/O. This morning when the system was quiet my writes to the /u099 filesystem was excellent and it has gone south like I reported earlier. 


I am currently awaiting the completion of a write to /u099, well over 60 
seconds. At the same time I was able create/save files in /u001 without any 
problems. The only difference between the /u001 and /u099 is the size of the 
filesystem (256GB vs 768GB).

Per your suggestion I ran a 'zfs set' command and it completed after a wait of 
around 20 seconds while my file save from vi against /u099 is still pending!!!
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Importing ZFS filesystems across architectures...

2006-09-21 Thread Neil Perrin



Philip Brown wrote On 09/21/06 20:28,:

Eric Schrock wrote:


If you're using EFI labels, yes (VTOC labels are not endian neutral).
ZFS will automatically convert endianness from the on-disk format, and
new data will be written using the native endianness, so data will be
gradually be rewritten to avoid the byteswap overhead.



now, when you say data, you just mean metadata, right?


Yes. ZFS has no knowledge of the layout of any structured records
written by applications, so it can't byteswap user data.

Neil

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] panic string assistance

2006-10-03 Thread Neil Perrin

ZFS will currently panic on a write failure to a non replicated pool.
In the case below the Intent Log (though it could have been any module)
could not write an intent log block. Here's a previous response from Eric
Schrock explaining how ZFS intends to handle this:



Yes, there are three incremental fixes that we plan in this area:

6417772 need nicer message on write failure

This just cleans up the failure mode so that we get a nice
FMA failure message and can distinguish this from a random
failed assert.

6417779 ZFS: I/O failure (write on ...) -- need to reallocate writes

In a multi-vdev pool, this would take a failed write and attempt
to do the write on another toplevel vdev.  This would all but
elminate the problem for multi-vdev pools.

6322646 ZFS should gracefully handle all devices failing (when writing)

This is the real fix.  Unfortunately, it's also really hard.
Even if we manage to abort the current transaction group,
dealing with the semantics of a filesystem which has lost an
arbitrary amount of change and notifying the user in a
meaningful way is difficult at best.

Hope that helps.

- Eric

Frank Leers wrote On 10/03/06 15:10,:
Could someone offer insight into this panic, please?  



panic string:   ZFS: I/O failure (write on unknown off 0: zio
6000c5fbc0
0 [L0 ZIL intent log] 1000L/1000P DVA[0]=1:249b68000:1000 zilog uncompre
ssed BE contiguous birth=318892 fill=0 cksum=3b8f19730caa4327:9e102
 panic kernel thread: 0x2a1015d7cc0  PID: 0  on CPU: 530 
cmd: sched
t_procp: 0x187c780(proc_sched)
  p_as: 0x187e4d0(kas)
  zone: global
t_stk: 0x2a1015d7ad0  sp: 0x18aa901  t_stkbase: 0x2a1015d2000
t_pri: 99(SYS)  pctcpu: 0.00
t_lwp: 0x0psrset: 0  last CPU: 530  
idle: 0 ticks (0 seconds)

start: Wed Sep 20 18:17:22 2006
age: 1788 seconds (29 minutes 48 seconds)
tstate: TS_ONPROC - thread is being run on a processor
tflg:   T_TALLOCSTK - thread structure allocated from stk
T_PANIC - thread initiated a system panic
tpflg:  none set
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
TS_SIGNALLED - thread was awakened by cv_signal()
pflag:  SSYS - system resident process

pc:  0x105f7f8  unix:panicsys+0x48:   call  unix:setjmp
startpc: 0x119fa64  genunix:taskq_thread+0x0:   save%sp, -0xd0, %sp

unix:panicsys+0x48(0x7b6e53a0, 0x2a1015d77c8, 0x18ab2d0, 0x1, , , 0x4480001601, 
, , , , , , , 0x7b6e53a0, 0x2a1015d77c8)

unix:vpanic_common+0x78(0x7b6e53a0, 0x2a1015d77c8, 0x7b6e3bf8, 0x7080bc30, 0x708
0bc70, 0x7080b840)
unix:panic+0x1c(0x7b6e53a0, 0x7080bbf0, 0x7080bbc0, 0x7b6e4428, 0x0, 0x6000c5fbc
00, , 0x5)
zfs:zio_done+0x284(0x6000c5fbc00)
zfs:zio_next_stage(0x6000c5fbc00) - frame recycled
zfs:zio_vdev_io_assess+0x178(0x6000c5fbc00, 0x6000c586da0, 0x7b6c79f0)
genunix:taskq_thread+0x1a4(0x6000bc5ea38, 0x0)
unix:thread_start+0x4()

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fsflush and zfs

2006-10-13 Thread Neil Perrin

ZFS ignores the fsflush. Here's a snippet of the code in zfs_sync():

/*
 * SYNC_ATTR is used by fsflush() to force old filesystems like UFS
 * to sync metadata, which they would otherwise cache indefinitely.
 * Semantically, the only requirement is that the sync be initiated.
 * The DMU syncs out txgs frequently, so there's nothing to do.
 */
if (flag  SYNC_ATTR)
return (0);

However, for a user initiated sync(1m) and sync(2) ZFS does force
all outstanding data/transactions synchronously to disk .
This goes beyond the requirement of sync(2) which says IO is inititiated
but not waited on (ie asynchronous).

Neil.

ttoulliu2002 wrote On 10/13/06 00:06,:

Is there any change regarding fsflush such as autoup tunable for zfs ?

Thanks
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshots impact on performance

2006-10-16 Thread Neil Perrin



Matthew Ahrens wrote On 10/16/06 09:07,:

Robert Milkowski wrote:


Hello zfs-discuss,

  S10U2+patches. ZFS pool of about 2TB in size. Each day snapshot is
  created and 7 copies are kept. There's quota set for a file system
  however there's always at least 50GB of free space in a file system
  (and much more in a pool). ZFS file system is exported over NFS.

  Snapshots consume about 280GB of space.
We have noticed so performance problems on nfs clients to this file
  system even during times with smaller load. Rising quota didn't
  help. However removing oldest snapshot automatically solved
  performance problems.

  I do not have more details - sorry.

  Is it expected for snapshots to have very noticeable performance
  impact on file system being snapshoted?



No, this behavior is unexpected.  The only way that snapshots should 
have a performance impact on access to the filesystem is if you are 
running low on space in the pool or quota (which it sounds like you are 
not).


Can you describe what the performance problems were?  What was the 
workload like?  What problem did you identify?  How did it improve when 
you 'zfs destroy'-ed the oldest snapshot?  Are you sure that the oldest 
snapshot wasn't pushing you close to your quota?


--matt


I could well believe there would be a hiccup when the snapshot is
taken on the rest of the pool. Each snapshot calls txg_wait_synced
four times. A few related to the zil and one from dsl_sync_task_group_wait
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Porting ZFS file system to FreeBSD.

2006-10-27 Thread Neil Perrin

Pawel,

I second that praise. Well done!

Attached is a copy of ziltest. You will have to adapt this a bit
to your environment. In particular it uses bringover to pull a subtree
of our source and then builds and later runs it. This tends to create
a fair number of transactions with various dependencies.
You'll obviously have to update the paths and tools.
However, at least initially, I'd recommend you simplify things by
perhaps jhaving the only test as a creation of a file.

The basic flow behind ziltest is:
1. Create an empty file system FS1
2. Freeze FS1
3. Perform various user commands that create files, directories, etc
4. Copy FS1 to FS2
5. Unmount and unfreeze FS1
6. Remount FS1 (resulting in replay of log)
7. Compare FS1  FS2 and complain if not equal

Hope this helps and good luck: Neil.

Eric Schrock wrote On 10/27/06 10:18,:

Congrats, Pawel.  This is truly an impressive piece of work.  As you're
probably aware, Noel integrated the patches your provided us into build
51.  Hopefully that got rid of some spurious differences between the
code bases.

We do have a program called 'ziltest' that Neil can probably provide for
you that does a good job stressing the ZIL.   We also have a complete
test suite (functional and stress), but it would be non-trivial to port,
and I don't know what the current status is for open sourcing the test
suites in general.

Let us know if there's anything else we can help with.

- Eric

On Fri, Oct 27, 2006 at 05:41:49AM +0200, Pawel Jakub Dawidek wrote:


Here is another update:

After way too much time spend on fighting the buffer cache I finally
made mmap(2)ed reads/writes to work and (which is also very important)
keep regular reads/writes working.

Now I'm able to build FreeBSD's kernel and userland with both sources
and objects placed on ZFS file system.

I also tried to crash it with fsx, fsstress and postmark, but no luck,
it works stable.

On the other hand I'm quite sure there are many problems in ZPL still,
but fixing mmap(2) allows me to move forward.

As a said note - ZVOL seems to be full functional.

I need to find a way to test ZIL, so if you guys at SUN have some ZIL
tests like uncleanly stopped file system, which at mount time will
exercise entire ZIL functionality where we can verify that my FS was
fixed properly that would be great.

PS. There is still a lot to do, so please, don't ask me for patches yet.

--
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!



--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
#!/bin/ksh -x
#
# CDDL HEADER START
#
# The contents of this file are subject to the terms of the
# Common Development and Distribution License, Version 1.0 only
# (the License).  You may not use this file except in compliance
# with the License.
#
# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
# or http://www.opensolaris.org/os/licensing.
# See the License for the specific language governing permissions
# and limitations under the License.
#
# When distributing Covered Code, include this CDDL HEADER in each
# file and include the License file at usr/src/OPENSOLARIS.LICENSE.
# If applicable, add the following below this CDDL HEADER, with the
# fields enclosed by brackets [] replaced with your own identifying
# information: Portions Copyright [] [name of copyright owner]
#
# CDDL HEADER END
#
#
# Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
#
# ident @(#)ziltest 1.2 06/01/30 SMI
#
# - creates a 150MB pool in /tmp
# - Should take about a minute (depends on access to the gate for bringover).
# - You can change the gate to local by setting and exporting ZILTEST_GATE
#

PATH=/usr/bin
PATH=$PATH:/usr/sbin
PATH=$PATH:/usr/ccs/bin
#PATH=$PATH:/net/slug.eng/opt/export/`uname -p`/opt/SUNWspro/SOS8/bin
#PATH=$PATH:/net/anthrax.central/export/tools/onnv-tools/SUNWspro/SOS8/bin
PATH=$PATH:/net/haulass.central/export/tools/onnv-tools/SUNWspro/SOS8/bin
#PATH=$PATH:/net/slug.eng/opt/onbld/bin
PATH=$PATH:/opt/onbld/bin
export PATH

#
# SETUP
#
ZILTEST_GATE=${ZILTEST_GATE-/net/haulass.central/export/clones/onnv}
CMD=`basename $0`
POOL=ziltestpool.$$
DEVSIZE=${DEVSIZE-150m}
POOLDIR=/tmp
POOLFILE=$POOLDIR/ziltest_poolfile.$$
FS=$POOL/fs
ROOT=/$FS
COPY=/tmp/${POOL}
KEEP=no

cleanup() {
zfs destroy $FS
zpool iostat $POOL
print
zpool status $POOL
zpool destroy $POOL
rm -rf $COPY
rm $POOLFILE
}

bail() {
test $KEEP = no  cleanup
print $1
exit 1
}

test $# -eq 0 || bail usage: $CMD

mkfile $DEVSIZE $POOLFILE || bail can't make 

Re: [zfs-discuss] Re: Re: ZFS hangs systems during copy

2006-10-27 Thread Neil Perrin



Jürgen Keil wrote On 10/27/06 11:55,:

This is:
6483887 without direct management, arc ghost lists can run amok



That seems to be a new bug?
http://bugs.opensolaris.org does not yet find it.



It's not so new as it was created on 10/19, but as you say bug
search doesn't find it. However, you can access it directly:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6483887
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] linux versus sol10

2006-11-08 Thread Neil Perrin



Robert Milkowski wrote On 11/08/06 08:16,:

Hello Paul,

Wednesday, November 8, 2006, 3:23:35 PM, you wrote:

PvdZ On 7 Nov 2006, at 21:02, Michael Schuster wrote:



listman wrote:

hi, i found a comment comparing linux and solaris but wasn't sure  
which version of solaris was being referred. can the list confirm  
that this issue isn't a problem with solaris10/zfs??
Linux also supports asynchronous directory updates which can make  
a significant performance improvement when branching. On Solaris  
machines, inode creation is very slow and can result in very long  
iowait states.


I think this cannot be commented on in a useful fashion without  
more information this supposed issue. AFAIK, neither ufs nor zfs  
create inodes (at run time), so this is somewhat hard to put into  
context.


get a complete description of what this is about, then maybe we can  
give you a useful answer.





PvdZ This could be related to Linux trading reliability for speed by doing
PvdZ async metadata updates.
PvdZ If your system crashes before your metadata is flushed to disk your  
PvdZ filesystem might be hosed and a restore

PvdZ from backups may be needed.

you can achieve something similar with fastfs on ufs file systems and
setting zil_disable to 1 on ZFS.


There's a difference for both of these.

UFS now has logging (journalling) as the default, and so any crashes/power fails
will keep the integrity of the metadata intact (ie no fsck/restore).

ZFS has no problem either as its fully transacts both data and meta data
and should never see corruption with intent log disabled or enabled.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48

2006-11-09 Thread Neil Perrin



Tomas Ögren wrote On 11/09/06 09:59,:


1. DNLC-through-ZFS doesn't seem to listen to ncsize.

The filesystem currently has ~550k inodes and large portions of it is
frequently looked over with rsync (over nfs). mdb said ncsize was about
68k and vmstat -s  said we had a hitrate of ~30%, so I set ncsize to
600k and rebooted.. Didn't seem to change much, still seeing hitrates at
about the same and manual find(1) doesn't seem to be that cached
(according to vmstat and dnlcsnoop.d).
When booting, the following message came up, not sure if it matters or not:
NOTICE: setting nrnode to max value of 351642
NOTICE: setting nrnode to max value of 235577

Is there a separate ZFS-DNLC knob to adjust for this? Wild guess is that
it has its own implementation which is integrated with the rest of the
ZFS cache which throws out metadata cache in favour of data cache.. or
something..


A more complete and useful set of dnlc statistic can be obtained via
kstat -n dnlcstats. As well as soft the limit on dnlc entries (ncsize)
the current number of cached entries is also useful:

echo ncsize/D | mdb -k
echo dnlc_nentries/D | mdb -k

nfs does have a maximum nmber of rnodes which is calculated from the
memory available. It doesn't look like nrnode_max can be overridden.

Having said that I actually think your problem is lack of memory.
For each ZFS vnode held by the DNLC it uses a *lot* more memory
than say UFS. Consequently it has to purge dnlc entries and I
suspect with only 1GB that the ZFS ARC doesn't allow many dnlc entries.
I don't know if that number is maintained anywhere, for you to check.
Mark?

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48

2006-11-09 Thread Neil Perrin



Tomas Ögren wrote On 11/09/06 13:47,:


On 09 November, 2006 - Neil Perrin sent me these 1,6K bytes:

 


Tomas Ögren wrote On 11/09/06 09:59,:

   


1. DNLC-through-ZFS doesn't seem to listen to ncsize.

The filesystem currently has ~550k inodes and large portions of it is
frequently looked over with rsync (over nfs). mdb said ncsize was about
68k and vmstat -s  said we had a hitrate of ~30%, so I set ncsize to
600k and rebooted.. Didn't seem to change much, still seeing hitrates at
about the same and manual find(1) doesn't seem to be that cached
(according to vmstat and dnlcsnoop.d).
When booting, the following message came up, not sure if it matters or not:
NOTICE: setting nrnode to max value of 351642
NOTICE: setting nrnode to max value of 235577

Is there a separate ZFS-DNLC knob to adjust for this? Wild guess is that
it has its own implementation which is integrated with the rest of the
ZFS cache which throws out metadata cache in favour of data cache.. or
something..
 


A more complete and useful set of dnlc statistic can be obtained via
kstat -n dnlcstats. As well as soft the limit on dnlc entries (ncsize)
the current number of cached entries is also useful:
   



This is after ~28h uptime:

module: unixinstance: 0
name:   dnlcstats   class:misc
   crtime  47.5600948
   dir_add_abort   0
   dir_add_max 0
   dir_add_no_memory   0
   dir_cached_current  4
   dir_cached_total107
   dir_entries_cached_current  4321
   dir_fini_purge  0
   dir_hits11000
   dir_misses  172814
   dir_reclaim_any 25
   dir_reclaim_last16
   dir_remove_entry_fail   0
   dir_remove_space_fail   0
   dir_start_no_memory 0
   dir_update_fail 0
   double_enters   234918
   enters  59193543
   hits36690843
   misses  59384436
   negative_cache_hits 1366345
   pick_free   0
   pick_heuristic  57069023
   pick_last   2035111
   purge_all   1
   purge_fs1   0
   purge_total_entries 3748
   purge_vfs   187
   purge_vp95
   snaptime99177.711093


vmstat -s:
96080561 total name lookups (cache hits 38%)

 


echo ncsize/D | mdb -k
echo dnlc_nentries/D | mdb -k
   



ncsize: 60
dnlc_nentries:  19230

Not quite the same..

 


Having said that I actually think your problem is lack of memory.
For each ZFS vnode held by the DNLC it uses a *lot* more memory
than say UFS. Consequently it has to purge dnlc entries and I
suspect with only 1GB that the ZFS ARC doesn't allow many dnlc entries.
I don't know if that number is maintained anywhere, for you to check.
Mark?
   



Current memory usage (for some values of usage ;):
# echo ::memstat|mdb -k
Page SummaryPagesMB  %Tot
     
Kernel  95584   746   75%
Anon20868   163   16%
Exec and libs1703131%
Page cache   1007 71%
Free (cachelist)   97 00%
Free (freelist)  7745606%

Total  127004   992
Physical   125192   978


/Tomas
 


This memory usage shows nearly all of memory consumed by the kernel
and probably by ZFS.  ZFS can't add any more DNLC entries due to lack of
memory without purging others. This can be seen from  the number of
dnlc_nentries being way less than ncsize.
I don't know if there's a DMU or ARC bug to reduce the memory footprint
of their internal structures for situations like this, but we are aware 
of the

issue.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs mount stuck in zil_replay

2006-11-09 Thread Neil Perrin

Hi Robert,

Yes, it could be related, or even the bug. Certainly the replay
was (prior to this bug fix) extremely slow.  I don't really have enough
information to determine if it's the exact problem, though after
re-reading your original post I strongly suspect it is.

I also putback a companion fix which should be helpful in determining
if zil_replay is making progress, or hung:

6486496 zil_replay() useful debug

Neil.

Robert Milkowski wrote On 11/09/06 18:10,:


Hello Neil,

I can see http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6478388
integrated. I guess it could be related to problem I described here,
right?

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Managed to corrupt my pool

2006-12-05 Thread Neil Perrin

Jim,

I'm not at all sure what happened to your pool.
However, I can answer some of your questions.

Jim Hranicky wrote On 12/05/06 11:32,:

So the questions are:

- is this fixable? I don't see an inum I could run
find on to remove, 


I think the pool is busted. Even the message printed in your
previous email is bad:

  DATASET  OBJECT  RANGE
  15   0   lvl=4294967295 blkid=0

as level is way out of range.


  and I can't even do a zfs volinit anyway:
  nextest-01# zfs volinit
cannot iterate filesystems: I/O error


I'm not sure why you're using zfs volinit which I believe creates
the zvol links, but this further shows problems.



- would not enabling zil_disable have prevented this?


No the intent log is not needed for pool integrity.
It ensures the synchronous semantics of O_DSYNC/fsync are obeyed.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-07 Thread Neil Perrin

Ben,

The attached dscript might help determining the zfs_create issue.
It prints:
- a count of all functions called from zfs_create
- average wall count time of the 30 highest functions
- average cpu time of the 30 highest functions

Note, please ignore warnings of the following type:

dtrace: 1346 dynamic variable drops with non-empty dirty list

Neil.

Ben Rockwood wrote On 12/07/06 06:01,:

I've got a Thumper doing nothing but serving NFS.  Its using B43 with 
zil_disabled.  The system is being consumed in waves, but by what I don't know. 
 Notice vmstat:

 3 0 0 25693580 2586268 0 0  0  0  0  0  0  0  0  0  0  926   91  703  0 25 75
 21 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 13 14 1720   21 1105  0 92  8
 20 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 17 18 2538   70  834  0 100 0
 25 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0  0  0  745   18  179  0 100 0
 37 0 0 25693552 2586240 0 0 0  0  0  0  0  0  0  7  7 1152   52  313  0 100 0
 16 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0 15 13 1543   52  767  0 100 0
 17 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0  2  2  890   72  192  0 100 0
 27 0 0 25693572 2586260 0 0 0  0  0  0  0  0  0 15 15 3271   19 3103  0 98  2
 0 0 0 25693456 2586144 0 11 0  0  0  0  0  0  0 281 249 34335 242 37289 0 46 54
 0 0 0 25693448 2586136 0 2  0  0  0  0  0  0  0  0  0 2470  103 2900  0 27 73
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1062  105  822  0 26 74
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1076   91  857  0 25 75
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0  917  126  674  0 25 75

These spikes of sys load come in waves like this.  While there are close to a 
hundred systems mounting NFS shares on the Thumper, the amount of traffic is 
really low.  Nothing to justify this.  We're talking less than 10MB/s.

NFS is pathetically slow.  We're using NFSv3 TCP shared via ZFS sharenfs on a 
3Gbps aggregation (3*1Gbps).

I've been slamming my head against this problem for days and can't make 
headway.  I'll post some of my notes below.  Any thoughts or ideas are welcome!

benr.

===

Step 1 was to disable any ZFS features that might consume large amounts of CPU:

# zfs set compression=off joyous
# zfs set atime=off joyous
# zfs set checksum=off joyous

These changes had no effect.

Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed 
dns was specified in /etc/nsswitch.conf which won't work given that no DNS 
servers are accessable from the storage or private networks, but again, no improvement. 
In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled 
the dns/client service in SMF.

Turning back to CPU usage, we can see the activity is all SYStem time and comes 
in waves:

[private:/tmp] root# sar 1 100

SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006

10:38:05%usr%sys%wio   %idle
10:38:06   0  27   0  73
10:38:07   0  27   0  73
10:38:09   0  27   0  73
10:38:10   1  26   0  73
10:38:11   0  26   0  74
10:38:12   0  26   0  74
10:38:13   0  24   0  76
10:38:14   0   6   0  94
10:38:15   0   7   0  93
10:38:22   0  99   0   1  --
10:38:23   0  94   0   6  --
10:38:24   0  28   0  72
10:38:25   0  27   0  73
10:38:26   0  27   0  73
10:38:27   0  27   0  73
10:38:28   0  27   0  73
10:38:29   1  30   0  69
10:38:30   0  27   0  73

And so we consider whether or not there is a pattern to the frequency. The 
following is sar output from any lines in which sys is above 90%:

10:40:04%usr%sys%wio   %idleDelta
10:40:11   0  97   0   3
10:40:45   0  98   0   2   34 seconds
10:41:02   0  94   0   6   17 seconds
10:41:26   0 100   0   0   24 seconds
10:42:00   0 100   0   0   34 seconds
10:42:25   (end of sample) 25 seconds

Looking at the congestion in the run queue:

[private:/tmp] root# sar -q 5 100

10:45:43 runq-sz %runocc swpq-sz %swpocc
10:45:5127.0  85 0.0   0
10:45:57 1.0  20 0.0   0
10:46:02 2.0  60 0.0   0
10:46:1319.8  99 0.0   0
10:46:2317.7  99 0.0   0
10:46:3424.4  99 0.0   0
10:46:4122.1  97 0.0   0
10:46:4813.0  96 0.0   0
10:46:5525.3 102 0.0   0

Looking at the per-CPU breakdown:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   00   324  224000  1540 00 100   0   0
  10   00   1140  2260   10   130860   1   0  99
  20   00   162  138  1490540 00  

Re: [zfs-discuss] Re: ZFS Storage Pool advice

2006-12-12 Thread Neil Perrin

Are you looking purely for performance, or for the added reliability that ZFS 
can give you?

If the latter, then you would want to configure across multiple LUNs in either 
a mirrored or RAID configuration. This does require sacrificing some storage in 
exchange for the peace of mind that any “silent data corruption” in the array 
or storage fabric will be not only detected but repaired by ZFS.


From a performance point of view, what will work best depends greatly on your 
application I/O pattern, how you would map the application’s data to the 
available ZFS pools if you had more than one, how many channels are used to 
attach the disk array, etc.  A single pool can be a good choice from an 
ease-of-use perspective, but multiple pools may perform better under certain 
types of load (for instance, there’s one intent log per pool, so if the intent 
log writes become a bottleneck then multiple pools can help).


Bad example, as there's actually one intent log per file system!


This also depends on how the LUNs are configured within the EMC array

If you can put together a test system, and run your application as a benchmark, 
you can get an answer. Without that, I don’t think anyone can predict which 
will work best in your particular situation.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring ZFS

2006-12-12 Thread Neil Perrin



Tom Duell wrote On 12/12/06 17:11,:

Group,

We are running a benchmark with 4000 users
simulating a hospital management system
running on Solaris 10 6/06 on USIV+ based
SunFire 6900 with 6540 storage array.

Are there any tools for measuring internal
ZFS activity to help us understand what is going
on during slowdowns?


dtrace can be used in numerous ways to examine
every part of ZFS and Solaris. lockstat(1M) (which actually
uses dtrace underneath) can also be used to see the cpu activity
(try lockstat -kgIW -D 20 sleep 10).

You can also use iostat (eg iostat -xnpcz) to look at disk activity.



We have 192GB of RAM and while ZFS runs
well most of the time, there are times where
the system time jumps up to 25-40%
as measured by vmstat and iostat.  These
times coincide with slowdowns in file access
as measured by a side program that simply
reads a random block in a file... these response
times can exceed 1 second or longer.


ZFS commits transaction groups every 5 seconds.
I suspect this flurry of activity is due to that.
Commiting can indeed take longer than a second.

You might be able to show this by changing it with:

# echo txg_time/W 10 | mdb -kw

then the activity should be longer but less frequent.
I don't however recommend you keep it at that value.




Any pointers greatly appreaciated!

Tom





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Some ZFS questions

2006-12-17 Thread Neil Perrin

CT Will I be able to I tune the DMU flush rate, now set at 5 seconds?

echo 'txg_time/D 0t1' | mdb -kw


Er, that 'D' ahould be a 'W'.
Having said that I don't think we recommend messing with the transaction
group commit timing.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Difference between ZFS and UFS with one LUN from a SAN

2006-12-22 Thread Neil Perrin



Robert Milkowski wrote On 12/22/06 13:40,:

Hello Torrey,

Friday, December 22, 2006, 9:17:46 PM, you wrote:

TM Roch - PAE wrote:


The fact that most FS do not manage the disk write caches
does mean you're at risk of data lost for those FS.




TM Does ZFS? I thought it just turned it on in the places where we had 
TM previously turned if off.


ZFS send flush cache command after each transaction group so it's sure
transaction is on stable storage.


... and after every fsync, O_DSYNC, etc that writes out intent log blocks.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solid State Drives?

2007-01-05 Thread Neil Perrin

I'm currently working on putting the ZFS intent log on separate devices
which could include seperate disks and nvram/solid state devices.
This would help any application using fsync/O_DSYNC - in particular
DB and NFS. From protoyping considerable peformanace improvements have
been seen.

Neil.

Kyle McDonald wrote On 01/05/07 08:10,:
I know there's been much discussion on the list lately about getting HW 
arrays to use (or not use) their caches in a way that helps ZFS the most.


Just yesterday I started seeing articles on NAND Flash Drives, and I 
know other Solid Stae Drive technologies have been around for a while 
and many times are used for transaction logs or other ways of 
accelerating FS's.


If these devices become more prevalent, and/or cheaper I'm curious what 
ways ZFS could be made to bast take advantage of them?


One Idea I had was for each pool allow me to designate a mirror or RaidZ 
of these devices just for the transaction logs. Since they're faster 
than normal disks, My uneducated guess is that they could boost 
performance.


I suppose it doesn't eliminate the problems with the real drive (or 
array) caches
though. You still need to know that the data is on the real drives 
before you can wipe that transaction from the transaction log right?


Well... I'd still like to hear the experts ideas on how this could (or 
won't ever?) help ZFS out? Would changes to ZFS be required?


-Kyle


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solid State Drives?

2007-01-05 Thread Neil Perrin



Robert Milkowski wrote On 01/05/07 11:45,:

Hello Neil,

Friday, January 5, 2007, 4:36:05 PM, you wrote:

NP I'm currently working on putting the ZFS intent log on separate devices
NP which could include seperate disks and nvram/solid state devices.
NP This would help any application using fsync/O_DSYNC - in particular
NP DB and NFS. From protoyping considerable peformanace improvements have
NP been seen.

Can you share any results from prototype testing?


I'd prefer not to just yet as I don't want to raise expectations unduly.
When testing I was using a simple local benchmark, whereas
I'd prefer to run something more official such as TPC.
I'm also missing a few required features in the protoype which
may affect performance.

Hopefully I can can provide some results soon, but even those will
be unoffical.

Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Puzzling ZFS behavior with COMPRESS option

2007-01-08 Thread Neil Perrin



Anantha N. Srirama wrote On 01/08/07 13:04,:

Our setup:

- E2900 (24 x 96); Solaris 10 Update 2 (aka 06/06)
- 2 2Gbps FC HBA
- EMC DMX storage
- 50 x 64GB LUNs configured in 1 ZFS pool
- Many filesystems created with COMPRESS enabled; specifically I've one that is 
768GB

I'm observing the following puzzling behavior:

- We are currently creating a large (1.4TB) and sparse dataset; most of the 
dataset contains repeating blanks (default/standard SAS dataset behavior.)
- ls -l reports the file size as 1.4+TB and du -sk reports the actual on disk 
usage at around 65GB.
- My I/O on the system is pegged at 150+MB/S as reported by zpool iostat and 
I've confirmed the same with iostat.

This is very confusing
 
- ZFS is doing very good compression as reported by the ratio of on disk versus as reported size of the file (1.4TB vs 65GB)

- [b]Why on God's green earth am I observing such high I/O when indeed ZFS is 
compressing?[/b] I can't believe that the program is actually generating I/O at 
the rate of (150MB/S * compressratio).

Any thoughts?




One possibility is that the data is written synchronously (uses O_DSYNC,
fsync, etc), and so the ZFS Intent Log (ZIL) will write that uncompressed
data to stable storage in case of a crash/power fail before the txg
is committed.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Heavy writes freezing system

2007-01-17 Thread Neil Perrin



Rainer Heilke wrote On 01/17/07 15:44,:

It turns out we're probably going to go the UFS/ZFS route, with 4 filesystems 
(the DB files on

 UFS with Directio).


It seems that the pain of moving from a single-node ASM to a RAC'd ASM is 
great, and not worth it.

 The DBA group decided doing the migration to UFS for the DB files now, and
 then to a RAC'd ASM later, will end up being the easiest, safest route.


Rainer
Still curious as to if and when this bug will get fixed...


If you're referring to bug 6413510 that Anantha mentioned then my
earlier post today answered that:

 This problem was fixed in snv_48 last September and will be
 in S10_U4.

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Re: Heavy writes freezing system

2007-01-17 Thread Neil Perrin



Anton B. Rang wrote On 01/17/07 20:31,:

Yes, Anantha is correct that is the bug id, which could be responsible
for more disk writes than expected.



I believe, though, that this would explain at most a factor of 2

 of write expansion (user data getting pushed to disk once in the
 intent log, then again in its final location).

Agreed.


If the writes are

 relatively large, there'd be even less expansion, because the ZIL
 will write a large enough block of data (would this be 128K?)

Anything over zfs_immediate_write_sz (currently 32KB) is written
in this way.

 into a  block which can be used as its final location. (If I'm
 understanding some earlier conversations right; haven't looked at
 the code lately.)


Anton
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bug id 6381203

2007-01-28 Thread Neil Perrin

Hi Leon,

This was fixed in March 2006, and is in S10_U2.

Neil.

Leon Koll wrote On 01/28/07 08:58,:

Hello,
what is the status of the bug 6381203 fix in S10 u3 ?
(deadlock due to i/o while assigning (tc_lock held))

Was it integrated? Is there a patch?

Thanks,
[i]-- leon[/i]
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS inode equivalent

2007-01-31 Thread Neil Perrin

No it's not the final version or even the latest!
The current on disk format version is 3. However, it hasn't
diverged much and the znode/acl stuff hasn't changed.


Neil.

James Blackburn wrote On 01/31/07 14:31,:

Or look at pages 46-50 of the ZFS on-disk format document:

http://opensolaris.org/os/community/zfs/docs/ondiskformatfinal.pdf



There's an final version?  That link appears to be broken (and the
lastest version linked from the ZFS docs area
http://opensolaris.org/os/community/zfs/docs/ is dated 0822).

James
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS checksums - block or file level

2007-02-01 Thread Neil Perrin

ZFS checksums are at the block level.

Nathan Essex wrote On 02/01/07 08:27,:

I am trying to understand if zfs checksums apply at a file or a block level.  
We know that zfs provides end to end checksum integrity, and I assumed that 
when I write a file to a zfs filesystem, the checksum was calculated at a file 
level, as opposed to say, a block level.  However, I have noticed that when I 
create an emulated volume, that volume has a checksum property, set to the same 
default as a normal zfs filesystem.  I can even change the checksum value as 
normal, see below:

# /usr/sbin/zfs create -V 50GB -b 128KB mypool/myvol

# /usr/sbin/zfs set checksum=sha256 mypool/myvol

Now on this emulated volume, I could place any number of structures that are 
not zfs filesystems, say raw database volumes, or ufs, qfs, etc.  Since these 
do not perform end to end checksums, can someone explain to me what the zfs 
checksum would be doing at this point?
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [storage-discuss] Re[2]: [zfs-discuss] se3510 and ZFS

2007-02-06 Thread Neil Perrin



Robert Milkowski wrote On 02/06/07 11:43,:

Hello eric,

Tuesday, February 6, 2007, 5:55:23 PM, you wrote:



IIRC Bill posted here some tie ago saying the problem with write cache
on the arrays is being worked on.



ek Yep, the bug is:
ek 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE  
ek CACHE to

ek SBC-2 devices

Thanks. I see a workaround there (I saw it earlier but it doesn't
apply to 3510) and I have a question - setting zil_disable to 1
won't actually completely disable cache flushing, right? (still every
txg group completes cache would be flushed)??


ek We have a case going through PSARC that will make things works  
ek correctly with regards to flushing the write cache and non-volatile  
ek caches.


There's actually a tunable to disable cache flushes:
zfs_nocacheflush and in older code (like S10U3) it's zil_noflush.


Yes, but we didn't want to publicise this internal switch. (I would
not call it a tunable). We (or at least I) are regretting publicising
zil_disable, but using zfs_nocacheflush is worse. If the device is
volatile then we can get pool corruption. An uberblock could get written
before all of its tree.

Note, zfs_nocacheflush and zil_noflush are not the same.
Setting zil_noflush stopped zil flushes of the write cache, whereas
zfs_nocacheflush will additionally stop flushing for txgs.




H...


ek The tricky part is getting vendors to actually support SYNC_NV bit.   
ek If you your favorite vendor/array doesn't support it, feel free to  
ek give them a call...


Is there any work being done to ensure/check that all arrays Sun sells
do support it?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Efficiency when reading the same file blocks

2007-02-25 Thread Neil Perrin



Jeff Davis wrote On 02/25/07 20:28,:

if you have N processes reading the same file sequentially (where file size is 
much greater than physical memory) from the same starting position, should I 
expect that all N processes finish in the same time as if it were a single 
process?


Yes I would expect them to finish the same time. There should be no additional 
reads because

the data will be in the ZFS cache (ARC).

Given your question are you about to come back with a case where you are not 
seeing this?


Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Today PANIC :(

2007-02-28 Thread Neil Perrin

Gino,

We have ween this before but only very rarely and never got a good crash dump. 
Coincidently, we saw it
only yesterday on a server here, and are currently investigating it. Did you 
also get a dump we
can access? That would If not can you tell us what zfs version you were running. 
 At the moment I'm not sure how

even you can recover from it. Sorry about this problem.

FYI this is bug:

http://bugs.opensolaris.org/view_bug.do?bug_id=6458218

Neil.

Gino Ruopolo wrote On 02/28/07 02:17,:

Feb 28 05:47:31 server141 genunix: [ID 403854 kern.notice] assertion failed: ss 
== NULL, file: ../../common/fs/zfs/space_map.c, line: 81
Feb 28 05:47:31 server141 unix: [ID 10 kern.notice]
Feb 28 05:47:31 server141 genunix: [ID 802836 kern.notice] fe8000d559f0 
fb9acff3 ()
Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55a70 
zfs:space_map_add+c2 ()
Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55aa0 
zfs:space_map_free+22 ()
Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55ae0 
zfs:space_map_vacate+38 ()
Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55b40 
zfs:zfsctl_ops_root+2fdbc7e7 ()
Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55b70 
zfs:vdev_sync_done+2b ()
Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55bd0 
zfs:spa_sync+215 ()
Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55c60 
zfs:txg_sync_thread+115 ()
Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55c70 
unix:thread_start+8 ()
Feb 28 05:47:31 server141 unix: [ID 10 kern.notice]
Feb 28 05:47:31 server141 genunix: [ID 672855 kern.notice] syncing file 
systems...
Feb 28 05:47:32 server141 genunix: [ID 733762 kern.notice]  1
Feb 28 05:47:33 server141 genunix: [ID 904073 kern.notice]  done 


What happened this time? Any suggest?

thanks,
gino
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mirror question

2007-03-23 Thread Neil Perrin

Yes, this is supported now. Replacing one half of a mirror with a larger device;
letting it resilver; then replacing the other half does indeed get a larger 
mirror.
I believe this is described somewhere but I can't remember where now.

Neil.

Richard L. Hamilton wrote On 03/23/07 20:45,:

If I create a mirror, presumably if possible I use two or more identically 
sized devices,
since it can only be as large as the smallest.  However, if later I want to 
replace a disk
with a larger one, and detach the mirror (and anything else on the disk), 
replace the
disk (and if applicable repartition it), since it _is_ a larger disk (and/or 
the partitions
will likely be larger since they mustn't be smaller, and blocks per cylinder 
will likely differ,
and partitions are on cylinder boundaries), once I reattach everything, I'll 
now have
two different sized devices in the mirror.  So far, the mirror is still the 
original size.
But what if I later replace the other disks with ones identical to the first 
one I replaced?
With all the devices within the mirror now the larger size, will the mirror and 
the zpool
of which it is a part expand?  And if that won't happen automatically, can it 
(without
inordinate trickery, and online, i.e. without backup and restore) be forced to 
do so?
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: asize is 300MB smaller than lsize - why?

2007-03-24 Thread Neil Perrin



Matthew Ahrens wrote On 03/24/07 12:13,:

Kangurek wrote:


Thanks for info.
My idea was to traverse changing filesystem, now I see that it will 
not work.

I will try to traverse snapshots. Zreplicate will:
1. do snapshot @replicate_leatest and
2. send data to snapshot @replicate_leatest
3. wait X sec   ( X = 20 )
4. remove @replicate_previous,  rename @replicate_latest to 
@replicate_previous

5. repeat from 1.

I'm sure it will work, but taking snapshots will be slow on loaded 
filesystem.

Do you have any idea how to speed up operations on snapshots.
1. remove @replicate_previous
2. rename @replicate_leatest to @replicate_previous
3. create @replicate_leatest



You can avoid the rename by doing:

zfs create @A
again:
zfs destroy @B
zfs create @B
zfs send @A @B
zfs destroy @A
zfs create @A
zfs send @B @A
goto again

I'm not sure exactly what will be slow about taking snapshots, but one 
aspect might be that we have to suspend the intent log (see call to 
zil_suspend() in dmu_objset_snapshot_one()).  I've been meaning to 
change that for a while now -- just let the snapshot have the 
(non-empty) zil header in it, but don't use it (eg. if we rollback or 
clone, explicitly zero out the zil header).  So you might want to look 
into that.


I've always thought the slowness was due to the txg_wait_synced().
I just counted 5 for one snapshot:

[0] $c
zfs`txg_wait_synced+0xc(30005c51dc0, 0, 7aa610d3, 70170800, ...)
zfs`zil_commit_writer+0x34c(30010c55200, 151, 151, 1, 3fe, 7aa84600)
zfs`zil_commit+0x68(30010c55200, 151, 0, 30010c5527c, 151, 0)
zfs`zil_suspend+0xc0(30010c55200, 2a1010db240, 0, 0, 30014b32e00, 0)
zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0)
zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...)
zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...)

[0] $c
zfs`txg_wait_synced+0xc(30005c51dc0, 3, 151, c00431549f, 3fe, 7aa84600)
zfs`zil_destroy+0xc(30010c55200, 0, 0, 30010c5527c, 30014b32e00, 0)
zfs`zil_suspend+0x108(30010c55200, 2a1010db240, 30010c5527c, 0, 30014b32e00, 0)
zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0)
zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...)
zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400,...)

[0] $c
zfs`txg_wait_synced+0xc(30005c51dc0, 36f8, 30593b0, 1f8, 1f8, 180c000)
zfs`zil_destroy+0x1b0(30010c55200, 0, 701d5760, 30010c5527c, ...)
zfs`zil_suspend+0x108(30010c55200, 2a1010db240, 30010c5527c, 0, 30014b32e00, 0)
zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0)
zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...)
zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...)

[0] $c
zfs`txg_wait_synced+0xc(30005c51dc0, 36f9, 30593b0, 1f8, 1f8, 180c000)
zfs`dsl_sync_task_group_wait+0x11c(300109a7ac8, 30005c51dc0, 7aa60700, ...)
zfs`dmu_objset_snapshot+0x100(300265bd000, 300265bd400, 0, 0, ...)
zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...)

[0] $c
zfs`txg_wait_synced+0xc(30005c51dc0, 36fa, 30593b0, 1f8, 1f8, 180c000)
zfs`dsl_sync_task_group_wait+0x11c(300109a7ac8, 30005c51dc0, ...)
zfs`dsl_sync_task_do+0x28(30005c51dc0, 0, 7aa2d898, 300028f7680,...)
zfs`spa_history_log+0x30(300028f7680, 3000dee1490, 0, 7aa2d800, 1, 18)
zfs`zfs_ioc_pool_log_history+0xd8(7aa64c00, 0, 17, 18, 3000dee1490, 7aa64c00)
zfs`zfsdev_ioctl+0x12c(701cf768, 701cf660, ffbfe850, 108, 701cf400,...)



--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: asize is 300MB smaller than lsize - why?

2007-03-24 Thread Neil Perrin



Matthew Ahrens wrote On 03/24/07 12:36,:

Neil Perrin wrote:

I'm not sure exactly what will be slow about taking snapshots, but 
one aspect might be that we have to suspend the intent log (see call 
to zil_suspend() in dmu_objset_snapshot_one()).  I've been meaning to 
change that for a while now -- just let the snapshot have the 
(non-empty) zil header in it, but don't use it (eg. if we rollback or 
clone, explicitly zero out the zil header).  So you might want to 
look into that.



I've always thought the slowness was due to the txg_wait_synced().
I just counted 5 for one snapshot:



Yeah, well 3 of the 5 are for zil_suspend(), so I think you've proved my 
point :-)


I believe that the one from spa_history_log() will go away with MarkS's 
delegated admin work, leaving just the one actually do it 
txg_wait_synced().


Bottom line, it shouldn be possible to make zfs snapshot take 5x less 
time, without an extraordinary effort.


I'm not sure. Doing one will take the same time as more than one (assuming same 
txg)
but at least one is needed to ensure all transactions prior to the snapshot are 
committed.


Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Size taken by a zfs symlink

2007-04-03 Thread Neil Perrin

Hi Robert,

Robert Milkowski wrote On 04/02/07 17:48,:

Right now a symlink should consume one dnode (320 bytes)


dnode_phys_t are actually 512 bytes:

 ::sizeof dnode_phys_t
sizeof (dnode_phys_t) = 0x200

 if the name it point to is less than 67 bytes, otherwise a data block is 
allocated

additionally to dnode (and more IOs will be needed to read it).
And of course an entry in a directory is needed as for normal file.


- Right

Cheers: Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync

2007-04-26 Thread Neil . Perrin

cedric briner wrote:

You might set zil_disable to 1 (_then_ mount the fs to be
shared). But you're still exposed to OS crashes; those would still 
corrupt your nfs clients.


-r



hello Roch,

I've few questions

1)
from:
   Shenanigans with ZFS flushing and intelligent arrays...
   http://blogs.digitar.com/jjww/?itemid=44
I read :
  Disable the ZIL. The ZIL is the way ZFS maintains _consistency_ until 
it can get the blocks written to their final place on the disk.


This is wrong. The on-disk format is always consistent.
The author of this blog is misinformed and is probably getting
confused with traditional journalling.


That's why the ZIL flushes the cache.


The ZIL flushes it's blocks to ensure that if a power failure/panic occurs
then the data the system guarantees to be on stable storage (due say to a fsync
or O_DSYNC) is actually on stable storage.

If you don't have the ZIL and a power 
outage occurs, your blocks may go poof in your server's RAM...'cause 
they never made it to the disk Kemosabe.


True, but not blocks, rather system call transactions - as this is what the
ZIL handles.



from :
   Eric Kustarz's Weblog
   http://blogs.sun.com/erickustarz/entry/zil_disable
I read :
   Note: disabling the ZIL does _NOT_ compromise filesystem integrity. 
Disabling the ZIL does NOT cause corruption in ZFS.


then :
   I don't understand: In one they tell that:
- we can lose _consistency_
   and in the other one they say that :
- does not compromise filesystem integrity
   so .. which one is right ?


Eric's, who works on ZFS!




2)
from :
   Eric Kustarz's Weblog
   http://blogs.sun.com/erickustarz/entry/zil_disable
I read:
  Disabling the ZIL is definitely frowned upon and can cause your 
applications much confusion. Disabling the ZIL can cause corruption for 
NFS clients in the case where a reply to the client is done before the 
server crashes, and the server crashes before the data is commited to 
stable storage. If you can't live with this, then don't turn off the ZIL.


then:
   The service that we export with zfs  NFS is not such things as 
databases or some really stress full system, but just exporting home. So 
it feels to me that we can juste disable this ZIL.


3)
from:
   NFS and ZFS, a fine combination
   http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
I read:
   NFS service with risk of corruption of client's side view :

nfs/ufs :  7 sec (write cache enable)
nfs/zfs :  4.2   sec (write cache enable,zil_disable=1)
nfs/zfs :  4.7   sec (write cache disable,zil_disable=1)

Semantically correct NFS service :

nfs/ufs : 17 sec (write cache disable)
nfs/zfs : 12 sec (write cache disable,zil_disable=0)
nfs/zfs :  7 sec (write cache enable,zil_disable=0)

then :
   Does this mean that when you just create an UFS FS, and that you just 
export it with NFS, you are doing an not semantically correct NFS 
service. And that you have to disable the write cache to have an correct 
NFS server ???


Yes. UFS requires the write cache to be disabled to maintain consistency.



4)
so can we say that people used to have an NFS with risk of corruption of 
client's side view can just take ZFS and disable the ZIL ?


I suppose but we aim to strive for better than expected corruption.
We (ZFS) recommend not disabling the ZIL.
We also recommend not disabling the disk write cache flushing unless they are
backed by nvram or UPS.



thanks in advance for your clarifications

Ced.
P.-S. Does some of you know the best way to send an email containing 
many questions inside it ? Should I create a thread for each of them, 
the next time


This works.

- Good questions.

Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recovered state after system crash

2007-05-04 Thread Neil Perrin

kyusun Chang wrote On 05/04/07 19:34,:

If system crashes some time after last commit of transaction group (TxG), what
happens to the file system transactions since the last commit of TxG


They are lost, unless they were synchronous (see below).


(I presume last commit of TxG represents the last on-disk consistency)?


Correct.


Does ZFS recover all file system transactions which it returned with success
since the last commit of TxG, which implis that ZIL must flush log records for
 each successful file system transaction before it returns to caller so that 
it can replay

the filesystem transactions?


Only synchronous transactions (those forced by O_DSYNC or fsync()) are
written to the intent log.


Blogs on ZIL states (I hope I read it right) that log records are maintained
in-memory and flushed to disk only when 
1) at synchronous write request (does that mean they free in-memory

log after that),


Yes they are then freed in memory


2) when TxG is committed (and free in-memory log).

Thank you for your time.
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] does every fsync() require O(log n) platter-writes?

2007-05-06 Thread Neil . Perrin

Adam Megacz wrote:

After reading through the ZFS slides, it appears to be the case that
if ZFS wants to modify a single data block, if must rewrite every
block between that modified block and the uberblock (root of the tree).

 Is this really the case?

That is true when commiting the transaction grouptp the main pool
every 5 seconds. However, this isn't so bad as a lot of transactions are
commited which likely have common roots and writes are aggregated
and striped across the pool etc...


If so, does this mean that every commit
operation (ie every fsync()) in ZFS requires O(log n) platter writes?


The ZIL does not modify the main pool. It only writes system call
transactions related to the file being fsynced and any other transactions that
might related to that file (eg mkdir, rename). Writes for these transactions are
also aggregated and written use a block size tailored to fit the data. Typically
for a single system call just one write occurs. On a system crash or power fail
those ZIL transactions are replayed.

See also:
http://blogs.sun.com/perrin

Neil.



Thanks,

  - a



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: does every fsync() require O(log n) platter-writes?

2007-05-06 Thread Neil . Perrin

Adam Megacz wrote:

Ah, okay.  The slides I read said that in ZFS there is no journal --
not needed (slide #9):

  http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf

I guess the slides are out of date in light of the ZFS Intent Log
journal?


Yes , I can understand your confusion. Technically the intent log is not a 
journal.
A journal has to be replayed to get meta data consistency of the fs.
UFS logging, EXT3 and VXFS all use journals. For perf reasons user data is
typically not logged leading to user data inconsistency.

On the other hand, the zfs pool is always consistent whether or not the
intent log is replayed.



Anyways, it all makes sense now.  Without a journal, you'd need to
perform the operation on slide #11 for every fsync(), which would be a
major performance problem.  With a journal, you don't need to do this.

Great work, guys...


- Thanks Adam.



  - a



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: How does ZFS write data to disks?

2007-05-11 Thread Neil . Perrin

lonny wrote:

On May 11, 2007, at 9:09 AM, Bob Netherton wrote:

**On Fri, 2007-05-11 at 09:00 -0700, lonny wrote:
**I've noticed a similar behavior in my writes. ZFS seems to write in bursts of
** around 5 seconds. I assume it's just something to do with caching?

^Yep - the ZFS equivalent of fsflush.  Runs more often so the pipes don't
^get as clogged.   We've had lots of rain here recently, so I'm sort of
^sensitive to stories of clogged pipes.
^
**Is this behavior ok? seems it would be better to have the disks writing
** the whole time instead of in bursts.
^
^Perhaps - although not in all cases (probably not in most cases).
^Wouldn't it be cool to actually do some nice sequential writes to
^the sweet spot of the disk bandwidth curve, but not depend on it
^so much that a single random I/O here and there throws you for
^a loop ?
^
^Human analogy - it's often more wise to work smarter than harder :-)
^
^Directly to your question - are you seeing any anomalies in file
^system read or write performance (bandwidth or latency) ?

^Bob


No performance problems so far, the thumper and zfs seem to handle everything 
we throw at them. On the T2000 internal disks we were seeing a bottleneck when 
using a single disk for our apps but moving to a 3 disk raidz alleviated that.

The only issue is when using iostat commands the bursts make it a little harder 
to gauge performance. Is it safe to assume that if those bursts were to reach 
the upper performance limit that it would spread the writes out a bit more?


The burst of activity every 5 seconds is when the transaction group is 
committed.
Batching up the writes in this way can lead to a number of efficiencies (as Bob 
hinted).
With heavier activity the writes will not get spread out, but will just takes 
longer.
Another way to look at the gaps of IO inactivity is that they indicate 
underutilisation.


Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS and Tar/Star Performance

2007-06-12 Thread Neil . Perrin

eric kustarz wrote:


Over NFS to non-ZFS drive
-
tar xfvj linux-2.6.21.tar.bz2
real5m0.211s,user0m45.330s,sys 0m50.118s

star xfv linux-2.6.21.tar.bz2
real3m26.053s,user0m43.069s,sys 0m33.726s

star -no-fsync -x -v -f linux-2.6.21.tar.bz2
real3m55.522s,user0m42.749s,sys 0m35.294s

It looks like ZFS is the culprit here.  The untarring is much  faster 
to a single 80 GB UFS drive than a 6 disk raid-z array over  NFS.




Comparing a ZFS pool made out of a single disk to a single UFS  
filesystem would be a fair comparison.


Right, and to be fairer you need to ensure the disk write cache is disabled
(format -e) when testing ufs (as ufs does no flushing of the cache).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Difference between add and attach a device?

2007-06-14 Thread Neil . Perrin

Rick Mann wrote:

Hi. I've been reading the ZFS admin guide, and I don't understand the distinction between 
adding a device and attaching a device to a pool?


attach is used to create or add a side to a mirror.
add is to add a new top level vdev where that can be a raidz, mirror
or single device. Writes are spread across top level vdevs.

Hope that helps. Perhaps the zpool man page is clearer.

 Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on user specified devices?

2007-06-22 Thread Neil Perrin

Bryna,

Your timing is excellent! We've been working on this for a while now and
hopefully within the next day I'll be adding support for separate log
devices into Nevada.

I'll send out more details soon...

Neil.

Bryan Wagoner wrote:

Quick question,

Are there any tunables, or is there any way to specify devices in a pool to use for the ZIL specifically? I've been thinking through architectures to mitigate performance problems on SAN and various other storage technologies where disabling ZIL or cache flushes has been necessary to make up for performance and was  wondering if there would be a way to specify a specific device or set of devices for the ZIL to use separate of the data devices so I wouldn't have to disable it in those circumstances. 


Thanks in advance!
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Drive Failure w/o Redundancy

2007-06-27 Thread Neil Perrin



Darren Dunham wrote:

The problem I've come across with using mirror or raidz for this setup
is that (as far as I know) you can't add disks to mirror/raidz groups,
and if you just add the disk to the pool, you end up in the same
situation as above (with more space but no redundancy).


You can't add to an existing mirror, but you can add new mirrors (or
raidz) items to the pool.  If so, there's no loss of redundancy.


Maybe I'm missing some context, but you can add to an existing mirror
- see zpool attach.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Take Three: PSARC 2007/171 ZFS Separate Intent Log

2007-07-07 Thread Neil Perrin
Cyril,

I wrote this case and implemented the project. My problem was
that I didn't know what policy (if any) Sun has about publishing
ARC cases, and a mail log with a gazillion email addresses.

I did receive an answer to this this in the form:

http://www.opensolaris.org/os/community/arc/arc-faq/arc-publish-historical-checklist/

Never having done this it seems somewhat burdensome, and will take some time.

Sorry, for the slow response and lack of feedback. Are there
any particular questions you have about separate intent logs
that I can answer before I embark on the process?

Neil.

Cyril Plisko wrote:
 Hello,
 
 This is a third request to open the materials of the PSARC case
 2007/171 ZFS Separate Intent Log
 I am not sure why two previous requests were completely ignored
 (even when seconded by another community member).
 In any case that is absolutely unaccepted practice.
 
 
 
 On 6/30/07, Cyril Plisko [EMAIL PROTECTED] wrote:
 Hello !

 I am adding zfs-discuss as it directly relevant to this community.

 On 6/23/07, Cyril Plisko [EMAIL PROTECTED] wrote:
 Hi,

 can the materials of the above be open for the community ?

 --
 Regards,
 Cyril


 --
 Regards,
 Cyril

 
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Take Three: PSARC 2007/171 ZFS Separate Intent Log

2007-07-09 Thread Neil Perrin


Cyril Plisko wrote:
 On 7/7/07, Neil Perrin [EMAIL PROTECTED] wrote:
 Cyril,

 I wrote this case and implemented the project. My problem was
 that I didn't know what policy (if any) Sun has about publishing
 ARC cases, and a mail log with a gazillion email addresses.

 I did receive an answer to this this in the form:

 http://www.opensolaris.org/os/community/arc/arc-faq/arc-publish-historical-checklist/
  


 Never having done this it seems somewhat burdensome, and will take 
 some time.
 
 Neil,
 
 I am glad the message finally got through.
 
 It seems to me that the URL above refers to the publishing
 materials of *historical* cases. Do you think the case in hand
 should be considered historical ?

Yes, this was what I was asked to do. Looking more closely it doesn't look
too bad. I'll start this process.

 
 Anyway, many ZFS related cases were openly reviewed from
 the moment zero of their life, why this one was an exception ?

There's no good reason. Certainly the ideas had been kicked around
on the alias, but I agree there was no specific proposal and
call for discussion.

 

 Sorry, for the slow response and lack of feedback. Are there
 any particular questions you have about separate intent logs
 that I can answer before I embark on the process?
 
 Well, that only question I have now is what is it all about ?
 It is hard to ask question without access to case materials,
 right ?

So I've attached the accepted proposal. There was (as expected) not
much discussion of this case as it was considered an obvious extension.
The actual psarc case materials when opened will not have much more info
than this.

Hope this helps: Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Take Three: PSARC 2007/171 ZFS Separate Intent Log

2007-07-09 Thread Neil Perrin

Er with attachment this time.



So I've attached the accepted proposal. There was (as expected) not
much discussion of this case as it was considered an obvious extension.
The actual psarc case materials when opened will not have much more info
than this.
PSARC CASE: 2007/171 ZFS Separate Intent Log 

SUMMARY:

This is a proposal to allow separate devices to be used
for the ZFS Intent Log (ZIL). The sole purpose of this is
performance. The devices can be disks, solid state drives,
nvram drives, or any device that presents a block interface.

PROBLEM:

The ZIL satisfies the synchronous requirements of POSIX.
For instance, databases often require their
transactions to be on stable storage on return from the system
call.  NFS and other applications can also use fsync() to ensure
data stability. The speed of the ZIL is therefore essential in
determining the latency of writes for these critical applications.

Currently the ZIL is allocated dynamically from the pool.
It consists of a chain of varying block sizes which are
anchored in fixed objects. Blocks are sized to fit the
demand and will come from different metaslabs and thus
different areas of the disk. This causes more head movement.

Furthermore, the log blocks are freed as soon as the intent
log transaction (system call) is committed. So a swiss cheesing
effect can occur leading to pool fragmentation.

PROPOSED SOLUTION:

This proposal takes advantage of the greatly faster media speeds
of nvram, solid state disks, or even dedicated disks.
To this end, additional extensions to the zpool command
are defined:

zpool create pool pool devices log log devices
Creates a pool with a separate log. If more than one
log device is specified then writes are load-balanced
between devices. It's also possible to mirror log
devices. For example a log consisting of
two sets of two mirrors could be created thus:

zpool create pool pool devices \
log mirror c1t8d0 c1t9d0 mirror c1t10d0 c1t11d0

A raidz/raidz2 log is not supported

zpool add pool log log devices
Creates a separate log if it doesn't exist, or 
adds extra devices if it does.

zpool remove pool log devices
Remove the log devices. If all log devices are removed
we revert to placing the log in the pool.  Evacuating a
log is easily handled by ensuring all txgs are committed.

zpool replace pool old log device new log device
Replace old log device with new log device.

zpool attach pool log device new log device
Attaches a new log device to an existing log device. If
the existing device is not a mirror then a 2 way mirror
is created. If device is part of a two-way log mirror,
attaching new_device creates a three-way log mirror,
and so on.

zpool detach pool log device
Detaches a log device from a mirror.

zpool status
Additionally displays the log devices

zpool iostat
Additionally shows IO statistics for log devices.

zpool export/import
Will export and import the log devices.

When a separate log that is not mirrored fails then
logging will start using chained logs within the main pool.

The name log will become a reserved word. Attempts to create
a pool with the name log will fail with:

cannot create 'log': name is reserved
 pool name may have been omitted

Hot spares cannot replace log devices.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] separate intent log blog

2007-07-18 Thread Neil Perrin
I wrote up a blog on the separate intent log called slog blog
which describes the interface; some performance results; and
general status:

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] separate intent log blog

2007-07-18 Thread Neil Perrin


Albert Chin wrote:
 On Wed, Jul 18, 2007 at 01:29:51PM -0600, Neil Perrin wrote:
 I wrote up a blog on the separate intent log called slog blog
 which describes the interface; some performance results; and
 general status:

 http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
 
 So, how did you get a pci Micro Memory pci1332,5425 card :) I
 presume this is the PCI-X version.

I wasn't involved in the aquisition but was just sent one internally
for testing. Yes it's PCI-X. I assume your asking because they can
not (or no longer) be obtained?

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] separate intent log blog

2007-07-27 Thread Neil Perrin
Adolf,

Yes, there was a separate driver, that I believe came from Micro
Memories. I installit from a package umem_Sol_Drv_Cust_i386_v01_10.pkg.
I just use pkgadd on it and it just worked. Sorry, I don't know if it's
publicly available or will even work for your device.

I gave details of that device for completeness. I was hoping
it would be representative of any NVRAM. I wasn't
intending to endorse its use, although it does seem fast.
Hardware availability and access to drivers is indeed
an issue.

256M is not a lot of NVRAM - the devive I tested had 1GB.
If you have a lot of synchronous transactions then you could
exceed the 256MB and overflow into the slower main pool.

Neil.

Adolf Hohl wrote:
 Hi,
 
 what is necessary to get it working from the solaris side.
 Is a driver on board or is there no special one needed? 
 I just got a packed MM-5425CN with 256M.
 However i am lacking a pci-x 64bit connector and not sure
 if it is worth the whole effort for my personal purposes.
 
 Any comment are very appreciated
 
 -ah
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, ZIL, vq_max_pending and OSCON

2007-08-07 Thread Neil Perrin
Jay,

Slides look good, though I'm not not sure what you say along
with Filthy lying on slide 22 related to the ZIL, or
slide 27 which has Worst Feature - thinks hardware is stupid.

Anyway I have some comments on http://www.meangrape.com/2007/08/oscon-zfs
You say:
---
Records in the ZIL are discarded in a number of circumstances:

* a DMU transaction group completes and is committed to stable storage
* a write flagged O_DSYNC completes
* an fsync() call is completed
* a ZFS filesystem is successfully unmounted


Your fist bullet is correct: in-memory and stable storage intent log
records are discarded when the dmu transaction group is committed
to stable storage. However, this is the only time they are discarded.

A O_DSYNC or fsync will cause in-memory records to be written to the
stable storage intent log. When unmounting, if there are any uncommitted
transactions we wait for that DMU transaction group to commit.

Most of this is explained in:

http://blogs.sun.com/perrin/entry/the_lumberjack

Hope that helps: Neil.

Jay Edwards wrote:
 The slides from my ZFS presentation at OSCON (as well as some additional 
 information) are available at _http://www.meangrape.com/2007/08/oscon-zfs/_
 
 Jay Edwards
 [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
 http://www.meangrape.com
 
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs iscsi storage for virtual machines

2007-08-08 Thread Neil Perrin

 How does ZFS handle snapshots of large files like VM images? Is 
 replication done on the bit/block level or by file? In otherwords, does 
 a snapshot of a changed VM image take up the same amount of space as the 
 image or only the amount of space of the bits that have changed within 
 the image?  

ZFS uses Copy On Write to implement snap shots.
No replication is done. When changes are made only the
blocks changed are different (the originals are kept by the
snapshot).

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Serious ZFS problems

2007-09-06 Thread Neil Perrin


Tim Spriggs wrote:
 Hello,
 
 I think I have gained sufficient fool status for testing the 
 fool-proof-ness of zfs. I have a cluster of T1000 servers running 
 Solaris 10 and two x4100's running an OpenSolaris dist (Nexenta) which 
 is at b68. Each T1000 hosts several zones each of which has its own 
 zpool associated with it. Each zpool is a mirrored configuration between 
 and IBM N series Nas and another OSOL box serving iscsi from zvols. To 
 move zones around, I move the zone configuration and then move the zpool 
 from one T1000 to another and bring the zone up. Now for the problem.
 
 For sake of brevity:
 
 T1000-1: zpool export pool1
 T1000-2: zpool export pool2
 T1000-3: zpool import -f pool1
 T1000-4: zpool import -f pool2
 and other similar operations to move zone data around.
 
 Then I 'init 6'd all the T1000s. The reason for the init 6 was so that 
 all of the pools would completely let go of the iscsi luns so I can 
 remove static-configurations from each T1000.
 
 upon reboot, pool1 has the following problem:
 
 WARNING: can't process intent log for pool1

During pool startup (spa_load()) zil_claim() is called on
each dataset in the pool and the first thing it tries to do is
open the dataset (dmu_objset_open()). If this fails then the
can't process intent log... is printed. So you have a pretty
serious pool consistency problem.

I guess more information is needed. Running zdb on the pool would
be useful, or zdb -l device to display the labels (on a exported pool).

 
 and then attempts to export the pool fail with:
 
 cannot open 'pool1': I/O error
 
 
 pool2 can consistently make a T1000 (Sol1) kernel panic when imported. 
 It will also make an x4100 panic (osol)
 
 
 Any ideas?
 
 Thanks in advance.
 -Tim
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mixing SATA PATA Drives

2007-09-17 Thread Neil Perrin
Yes performance will suffer, but it's a bit difficult to say by how much.
Both pool transaction group writes and zil writes are spread across 
all devices. It depends on what applications you will run as to how much
use is made of the zil. Maybe you should experiment and see if performance
is good enough.

Neil.

Tim Spriggs wrote:
 I'm far from an expert but my understanding is that the zil is spread 
 across the whole pool by default so in theory the one drive could slow 
 everything down. I don't know what it would mean in this respect to keep 
 the PATA drive as a hot spare though.
 
 -Tim
 
 Christopher Gibbs wrote:
 Anyone?

 On 9/14/07, Christopher Gibbs [EMAIL PROTECTED] wrote:
   
 I suspect it's probably not a good idea but I was wondering if someone
 could clarify the details.

 I have 4 250G SATA(150) disks and 1 250G PATA(133) disk.  Would it
 cause problems if I created a raidz1 pool across all 5 drives?

 I know the PATA drive is slower so would it slow the access across the
 whole pool or just when accessing that disk?

 Thanks for your input.

 - Chris

 

   
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zfs log device (zil) ever coming to Sol10?

2007-09-18 Thread Neil Perrin
Separate log devices (slogs) didn't make it into S10U4 but will be in U5.

Andy Lubel wrote:
 I think we are very close to using zfs in our production environment..  Now
 that I have snv_72 installed and my pools set up with NVRAM log devices
 things are hauling butt.
 
 I've been digging to find out whether this capability would be put into
 Solaris 10, does anyone know?
 
 If not, then I guess we can probably be OK using SXCE (as Joyent did).
 
 Thanks,
 
 Andy Lubel
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zfs log device (zil) ever coming to Sol10?

2007-09-18 Thread Neil Perrin


Matty wrote:
 On 9/18/07, Neil Perrin [EMAIL PROTECTED] wrote:
 
 Separate log devices (slogs) didn't make it into S10U4 but will be in U5.
 
 This is awesome! Will the SYNC_NV support that was integrated this
 week be added to update 5 as well? That would be super useful,
 assuming the major arrays vendors support it.

I believe it will. So far we have just batched up all the
bug fixes and enhancements in ZFS and all of them are integrated
into the next update. It's easier for us that way as well.

Actually the part of we is not usually played by me!

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] enlarge a mirrored pool

2007-10-12 Thread Neil Perrin


Erik Trimble wrote:
 Ivan Wang wrote:
 Hi all,

 Forgive me if this is a dumb question. Is it possible for a two-disk 
 mirrored zpool to be seamlessly enlarged by gradually replacing previous 
 disk with larger one?

 Say, in a constrained desktop, only space for two internal disks is 
 available, could I just begin with two 160G disks, then at some time, 
 replace one of the 160G with 250G, resilvering, then replace another 160G, 
 and finally get a two-disk 250G mirrored pool?

 Cheers,
 Ivan.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 Yes.
 
 After both drives are replaced, you will automatically see the 
 additional space.

I believe currently after the last replace an import/export sequence
is needed to force zfs to see the increased size.

Neil.
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] characterizing I/O on a per zvol basis.

2007-10-18 Thread Neil Perrin
I don't know of any way to observe IOPS per zvol and I believe
this would be tricky. Any writes/reads from individual datasets (filesystems
and zvols) will go through the pipeline and can fan out to multiple
mirrors or raidz or be striped across devices. Block writes will be
combined and pushed out in transaction groups, but if synchronous will also
have separate (and possibly multiple) intent log writes. Reads if not cached
can similarly come from multiple locations. The individual IOs are not tagged
with the dataset(s) they are servicing.

It would be easier to observe the byte count and read/write request count
for a zvol using dtrace.

Neil.


Nathan Kroenert wrote:
 Hey all -
 
 Time for my silly question of the day, and before I bust out vi and 
 dtrace...
 
 If there a simple, existing way I can observe the read / write / IOPS on 
 a per-zvol basis?
 
 If not, is there interest in having one?
 
 Cheers!
 
 Nathan.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL reliability/replication questions

2007-10-18 Thread Neil Perrin


Scott Laird wrote:
 I'm debating using an external intent log on a new box that I'm about
 to start working on, and I have a few questions.
 
 1.  If I use an external log initially and decide that it was a
 mistake, is there a way to move back to the internal log without
 rebuilding the entire pool?

It's not currently possible to remove a separate log.
This was working once, but was stripped out until the
more generic zpool remove devices was provided.
This is bug 6574286:

http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 

 2.  What happens if the logging device fails completely?  Does this
 damage anything else in the pool, other then potentially losing
 in-flight transactions?

This should work. It shouldn't even lose the in-flight transactions.
ZFS reverts to using the main pool if a slog write fails or the
slog fills up.

 3.  What about corruption in the log?  Is it checksummed like the rest of ZFS?

Yes it's checksummed, but the checksumming is a bit different
from the pool blocks in the uberblock tree.

See also:
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

 
 Thanks.
 
 
 Scott
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL reliability/replication questions

2007-10-18 Thread Neil Perrin


Scott Laird wrote:
 On 10/18/07, Neil Perrin [EMAIL PROTECTED] wrote:

 Scott Laird wrote:
 I'm debating using an external intent log on a new box that I'm about
 to start working on, and I have a few questions.

 1.  If I use an external log initially and decide that it was a
 mistake, is there a way to move back to the internal log without
 rebuilding the entire pool?
 It's not currently possible to remove a separate log.
 This was working once, but was stripped out until the
 more generic zpool remove devices was provided.
 This is bug 6574286:

 http://bugs.opensolaris.org/view_bug.do?bug_id=6574286
 
 Okay, so hopefully it'll work in a couple quarters?

It's not being worked on currently but hopefully will be fixed
in 6 months.
 
 2.  What happens if the logging device fails completely?  Does this
 damage anything else in the pool, other then potentially losing
 in-flight transactions?
 This should work. It shouldn't even lose the in-flight transactions.
 ZFS reverts to using the main pool if a slog write fails or the
 slog fills up.
 
 So, the only way to lose transactions would be a crash or power loss,
 leaving outstanding transactions in the log, followed by the log
 device failing to start up on reboot?  I assume that that would that
 be handled relatively cleanly (files have out of data data), as
 opposed to something nasty like the pool fails to start up.

I just checked on the behaviour of this. The log is treated as part
of the main pool. If it is not replicated and disappears then the pool
can't be opened - just like any unreplicated device in the main pool.
If the slog is found but can't be opened or is corrupted then then the
pool will be opened but the slog isn't used.
This seems a bit inconsistent.

 
 3.  What about corruption in the log?  Is it checksummed like the rest of 
 ZFS?
 Yes it's checksummed, but the checksumming is a bit different
 from the pool blocks in the uberblock tree.

 See also:
 http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
 
 That started this whole mess :-).  I'd like to try out using one of
 the Gigabyte SATA ramdisk cards that are discussed in the comments.

A while ago there was a comment on this alias that these cards
weren't purchasable. Unfortunately, I don't know what is available.

 It supposedly has 18 hours of battery life, so a long-term power
 outage would kill the log.  I could reasonably expect one 18+ hour
 power outage over the life of the filesystem.  I'm fine with losing
 in-flight data (I'd expect the log to be replayed before the UPS shuts
 the system down anyway), but I'd rather not lose the whole pool or
 something extreme like that.
 
 I'm willing to trade the chance of some transaction losses during an
 exceptional event for more performance, but I'd rather not have to
 pull out the backups if I can ever avoid it.
 
 
 Scott
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL reliability/replication questions

2007-10-18 Thread Neil Perrin


Scott Laird wrote:
 On 10/18/07, Neil Perrin [EMAIL PROTECTED] wrote:
 So, the only way to lose transactions would be a crash or power loss,
 leaving outstanding transactions in the log, followed by the log
 device failing to start up on reboot?  I assume that that would that
 be handled relatively cleanly (files have out of data data), as
 opposed to something nasty like the pool fails to start up.
 I just checked on the behaviour of this. The log is treated as part
 of the main pool. If it is not replicated and disappears then the pool
 can't be opened - just like any unreplicated device in the main pool.
 If the slog is found but can't be opened or is corrupted then then the
 pool will be opened but the slog isn't used.
 This seems a bit inconsistent.
 
 Hmm, yeah.  What would happen if I mirrored the ramdisk with a hard
 drive?  Would ZFS block until the data's stable on both devices, or
 would it continue once the write is complete on the ramdisk?

ZFS ensures all mirror sides have the data before returning.


 Failing that, would replacing the missing log with a blank device let
 me bring the pool back up, or would it be dead at that point?

Replacing the device would work:

: mull ; mkfile 100m /p1 /p2
: mull ; zpool create whirl /p1 log /p2
: mull ; echo abc  /whirl/f
: mull ; sync
: mull ; rm /p2
: mull ; sync
reset system
: mull ; zpool status
  pool: whirl
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
whirl   UNAVAIL  0 0 0  insufficient replicas
  /p1   ONLINE   0 0 0
logsUNAVAIL  0 0 0  insufficient replicas
  /p2   UNAVAIL  0 0 0  cannot open
: mull ; mkfile 100m /p2 /p3
: mull ; zpool online whirl /p2
warning: device '/p2' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
: mull ; zpool status
  pool: whirl
 state: ONLINE
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
whirl   ONLINE   0 0 0
  /p1   ONLINE   0 0 0
logsONLINE   0 0 0
  /p2   UNAVAIL  0 0 0  corrupted data

errors: No known data errors
: mull ; zpool replace whirl /p2 /p3
: mull ; zpool status
  pool: whirl
 state: ONLINE
 scrub: resilver completed with 0 errors on Thu Oct 18 18:16:39 2007
config:

NAME STATE READ WRITE CKSUM
whirlONLINE   0 0 0
  /p1ONLINE   0 0 0
logs ONLINE   0 0 0
  replacing  ONLINE   0 0 0
/p2  UNAVAIL  0 0 0  corrupted data
/p3  ONLINE   0 0 0

errors: No known data errors
: mull ; zpool status
  pool: whirl
 state: ONLINE
 scrub: resilver completed with 0 errors on Thu Oct 18 18:16:39 2007
config:

NAMESTATE READ WRITE CKSUM
whirl   ONLINE   0 0 0
  /p1   ONLINE   0 0 0
logsONLINE   0 0 0
  /p3   ONLINE   0 0 0

errors: No known data errors
: mull ; zfs mount
: mull ; zfs mount -a
: mull ; cat /whirl/f
abc
: mull ;

 
 3.  What about corruption in the log?  Is it checksummed like the rest of 
 ZFS?
 Yes it's checksummed, but the checksumming is a bit different
 from the pool blocks in the uberblock tree.

 See also:
 http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
 That started this whole mess :-).  I'd like to try out using one of
 the Gigabyte SATA ramdisk cards that are discussed in the comments.
 A while ago there was a comment on this alias that these cards
 weren't purchasable. Unfortunately, I don't know what is available.
 
 The umem one is unavailable, but the Gigabyte model is easy to find.
 I had Amazon overnight one to me, it's probably sitting at home right
 now.

Cool let us know how it goes.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)

2007-11-16 Thread Neil Perrin
Joe,

I don't think adding a slog helped in this case. In fact I
believe it made performance worse. Previously the ZIL would be 
spread out over all devices but now all synchronous traffic
is directed at one device (and everything is synchronous in NFS).
Mind you 15MB/s seems a bit on the slow side - especially is
cache flushing is disabled.

It would be interesting to see what all the threads are waiting
on. I think the problem maybe that everything is backed
up waiting to start a transaction because the txg train is
slow due to NFS requiring the ZIL to push everything synchronously.

Neil.

Joe Little wrote:
 I have historically noticed that in ZFS, when ever there is a heavy
 writer to a pool via NFS, the reads can held back (basically paused).
 An example is a RAID10 pool of 6 disks, whereby a directory of files
 including some large 100+MB in size being written can cause other
 clients over NFS to pause for seconds (5-30 or so). This on B70 bits.
 I've gotten used to this behavior over NFS, but didn't see it perform
 as such when on the server itself doing similar actions.
 
 To improve upon the situation, I thought perhaps I could dedicate a
 log device outside the pool, in the hopes that while heavy writes went
 to the log device, reads would merrily be allowed to coexist from the
 pool itself. My test case isn't ideal per se, but I added a local 9GB
 SCSI (80) drive for a log, and added to LUNs for the pool itself.
 You'll see from the below that while the log device is pegged at
 15MB/sec (sd5),  my directory list request on devices sd15 and sd16
 never are answered. I tried this with both no-cache-flush enabled and
 off, with negligible difference. Is there anyway to force a better
 balance of reads/writes during heavy writes?
 
  extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 fd0   0.00.00.00.0  0.0  0.00.0   0   0
 sd0   0.00.00.00.0  0.0  0.00.0   0   0
 sd1   0.00.00.00.0  0.0  0.00.0   0   0
 sd2   0.00.00.00.0  0.0  0.00.0   0   0
 sd3   0.00.00.00.0  0.0  0.00.0   0   0
 sd4   0.00.00.00.0  0.0  0.00.0   0   0
 sd5   0.0  118.00.0 15099.9  0.0 35.0  296.7   0 100
 sd6   0.00.00.00.0  0.0  0.00.0   0   0
 sd7   0.00.00.00.0  0.0  0.00.0   0   0
 sd8   0.00.00.00.0  0.0  0.00.0   0   0
 sd9   0.00.00.00.0  0.0  0.00.0   0   0
 sd10  0.00.00.00.0  0.0  0.00.0   0   0
 sd11  0.00.00.00.0  0.0  0.00.0   0   0
 sd12  0.00.00.00.0  0.0  0.00.0   0   0
 sd13  0.00.00.00.0  0.0  0.00.0   0   0
 sd14  0.00.00.00.0  0.0  0.00.0   0   0
 sd15  0.00.00.00.0  0.0  0.00.0   0   0
 sd16  0.00.00.00.0  0.0  0.00.0   0   0
...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)

2007-11-19 Thread Neil Perrin


Roch - PAE wrote:
 Neil Perrin writes:
   
   
   Joe Little wrote:
On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote:
Joe,
   
I don't think adding a slog helped in this case. In fact I
believe it made performance worse. Previously the ZIL would be
spread out over all devices but now all synchronous traffic
is directed at one device (and everything is synchronous in NFS).
Mind you 15MB/s seems a bit on the slow side - especially is
cache flushing is disabled.
   
It would be interesting to see what all the threads are waiting
on. I think the problem maybe that everything is backed
up waiting to start a transaction because the txg train is
slow due to NFS requiring the ZIL to push everything synchronously.
   

I agree completely. The log (even though slow) was an attempt to
isolate writes away from the pool. I guess the question is how to
provide for async access for NFS. We may have 16, 32 or whatever
threads, but if a single writer keeps the ZIL pegged and prohibiting
reads, its all for nought. Is there anyway to tune/configure the
ZFS/NFS combination to balance reads/writes to not starve one for the
other. Its either feast or famine or so tests have shown.
   
   No there's no way currently to give reads preference over writes.
   All transactions get equal priority to enter a transaction group.
   Three txgs can be outstanding as we use a 3 phase commit model:
   open; quiescing; and syncing.
 
 That makes me wonder if this is not just the lack of write
 throttling issue. If one txg is syncing and the other is
 quiesced out, I think it means we have let in too many
 writes. We do need a better balance.
 
 Neil is  it correct that  reads never hit txg_wait_open(), but
 they just need an I/O scheduler slot ?

Yes, they don't modify any meta data (except access time which is
handled separately). I'm less clear about what happens further
down in the DMU and SPA.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write frequency

2007-11-28 Thread Neil Perrin


Ajay Kumar wrote:
 IHAC who would like to understand following:
 
 We've upgraded a box to sol10-u4 and created a ZFS pool.  We notice that 
 running zfs iostat 1 or iostat -xnz 1, the data gets written to disk 
 every 5 seconds, even though the data is being copied to the filesystem 
 continuously.
 
   This behavior is different than UFS as UFS continuously writes. So, 
 what's with the 5 second pause?

ZFS creates transactions for systems calls that modify the pool.
For efficiency it gathers together individual transactions into transaction
groups (txgs) which are committed every 5 seconds.

If you are seeing some constant background write activity then that
is probably due to synchronous writes which require data be stable on
return from the system call. These are written on demand to an intent
log. 

 
 Any clarification will be appreciated.
 
 Thank you
 Ajay
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bugid 6535160

2007-12-14 Thread Neil Perrin
Vincent Fox wrote:
 So does anyone have any insight on BugID 6535160?
 
 We have verified on a similar system, that ZFS shows big latency in filebench 
 varmail test.
 
 We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 
 ms.

This is such a big difference it makes me think something else is going on.
I suspect one of two possible causes:

A) The disk write cache is enabled and volatile. UFS knows nothing of write 
caches
   and requires the write cache to be disabled otherwise corruption can occur.
B) The write cache is non volatile, but ZFS hasn't been configured
   to stop flushing it (set zfs:zfs_nocacheflush = 1).
   Note, ZFS enables the write cache and will flush it as necessary.

 
 http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1
 
 We run Solaris 10u4 on our production systems, don't see any indication
 of a patch for this.
 
 I'll try downloading recent Nevada build and load it on same system and see
 if the problem has indeed vanished post snv_71.

Yes please try this. I think it will make a difference but the delta
will be small.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bugid 6535160

2007-12-14 Thread Neil Perrin
Vincent Fox wrote:
 So does anyone have any insight on BugID 6535160?

 We have verified on a similar system, that ZFS shows big latency in filebench 
 varmail test.

 We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 
 ms.

This is such a big difference it makes me think something else is going on.
I suspect one of two possible causes:

A) The disk write cache is enabled and volatile. UFS knows nothing of write 
caches
  and requires the write cache to be disabled otherwise corruption can occur.
B) The write cache is non volatile, but ZFS hasn't been configured
  to stop flushing it (set zfs:zfs_nocacheflush = 1).
  Note, ZFS enables the write cache and will flush it as necessary.


 http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1

 We run Solaris 10u4 on our production systems, don't see any indication
 of a patch for this.

 I'll try downloading recent Nevada build and load it on same system and see
 if the problem has indeed vanished post snv_71.

Yes please try this. I think it will make a difference but the delta
will be small.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] copy on write related query

2008-01-06 Thread Neil Perrin


sudarshan sridhar wrote:
 I'm not quite sure what you're asking here. Data, whether newly written or
 copy-on-write, goes to a newly allocated block, which may reside on any
 vdev, and will be spread across devices if using RAID.

 My exact doubt is, if COW is default behavior of ZFS then does COWd data 
 written to the same physical drive where the filesystem resides?

Yes.

 If so the physical device capacity should be more that what the file 
 system size is.

Yes.

 I mean in normal filesystem sinario, a partition with 1Gb with some some 
 filesystem (say ext2fs) is created, then use can save upto 1Gb data 
 under that.

This is not true of any filesystem. There is always some overhead for
meta data like indirect blocks, journals, superblocks, space maps etc.
Some filesystems (eg UFS) have a fixed areas for meta data which limits
the number of files and possible data, whereas others dynamically allocate
the meta data (eg ZFS). The former is more predictable and the latter
more flexible.

 Is the same behavior with ZFS?. Because I feel since COW is default ZFS 
 require  1Gb for one fileystem inorder to store COWed data.
  
 Please correct me if i am wrong.
  
 -sridhar
 
 
 Never miss a thing. Make Yahoo your homepage. 
 http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-07 Thread Neil Perrin


parvez shaikh wrote:
 Hello,
 
 I am learning ZFS, its design and layout.
 
 I would like to understand how Intent logs are different from journal?
 
 Journal too are logs of updates to ensure consistency of file system 
 over crashes. Purpose of intent log also appear to be same.  I hope I am 
 not missing something important in these concepts.

There is a difference. A journal contains the necessary transactions to
make the on-disk fs consistent. The ZFS intent is not needed for consistency.
Here's an extract from http://blogs.sun.com/perrin/entry/the_lumberjack :


ZFS is always consistent on disk due to its transaction model. Unix system 
calls can be considered as transactions which are aggregated into a transaction 
group for performance and committed together periodically. Either everything 
commits or nothing does. That is, if a power goes out, then the transactions in 
the pool are never partial. This commitment happens fairly infrequently - 
typically a few seconds between each transaction group commit.

Some applications, such as databases, need assurance that say the data they 
wrote or mkdir they just executed is on stable storage, and so they request 
synchronous semantics such as O_DSYNC (when opening a file), or execute 
fsync(fd) after a series of changes to a file descriptor. Obviously waiting 
seconds for the transaction group to commit before returning from the system 
call is not a high performance solution. Thus the ZFS Intent Log (ZIL) was born.


 
 Also I read that Updates in ZFS are intrinsically atomic,  I cant 
 understand how they are intrinsically atomic 
 http://weblog.infoworld.com/yager/archives/2007/10/suns_zfs_is_clo.html
 
 I would be grateful if someone can address my query
 
 Thanks
 
 
 Explore your hobbies and interests. Click here to begin. 
 http://in.rd.yahoo.com/tagline_groups_6/*http://in.promos.yahoo.com/groups 
 
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS behavior with fsync() calls

2008-01-11 Thread Neil Perrin


Todd Moore wrote:
 My understanding is that the answers to the questions posed below are both 
 YES due the transactional design of ZFS.  However, I'm working with some 
 folks that need more details or documents describing the design/behavior 
 without having to look through all the source code.  
 
 [b]Scenario 1[/b]
 * Create file
 * Open and Write data to file
 * Issue fsync() call for file
 
 [b]Question:[/b]  Is it guaranteed that the write to the directory occurs 
 prior to the write to the file?  


Yes, this is guaranteed.

 
 [b]Scenario 1[/b]
 * Write an extended attribute (such as a file version number) for a file.
 * Open and Write data to file
 * Issue fsync() call for file
 
 [b]Question:[/b]  Is it guaranteed that the extended attribute write occurs 
 prior to the write to the file?  
  

Again yes this is guaranteed in ZFS. ZFS writes all transactions related to
specified file and other transactions not related to the file
that may be needed to create the file.

 Additionally, is it possible that there are differences in this behavior as 
 relates to these scenarios between Solaris 10 U4 or a SXDE 01/08 
 implementation (snv_b79)?


No the zfs code has always been this way.

The ZIL which handles this behaviour is described at

http://blogs.sun.com/perrin/entry/the_lumberjack

but this maybe insufficient detail for you.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance on ZFS vs UFS

2008-01-24 Thread Neil Perrin


Steve Hillman wrote:
 I realize that this topic has been fairly well beaten to death on this forum, 
 but I've also read numerous comments from ZFS developers that they'd like to 
 hear about significantly different performance numbers of ZFS vs UFS for 
 NFS-exported filesystems, so here's one more.
 
 The server is an x4500 with 44 drives configured in a RAID10 zpool, and two 
 drives mirrored and formatted with UFS for the boot device. It's running 
 Solaris 10u4, patched with the Recommended Patch Set from late Dec/07. The 
 client (if it matters) is an older V20z w/ Solaris 10 3/05. No tuning has 
 been done on either box
 
 The test involved copying lots of small files (2-10k) from an NFS client to a 
 mounted NFS volume. A simple 'cp' was done, both with 1 thread and 4 parallel 
 threads (to different directories) and then I monitored to see how fast the 
 files were accumulating on the server.
 
 ZFS:
 1 thread - 25 files/second; 4 threads - 25 files/second (~6 per thread)
 
 UFS: (same server, just exported /var from the boot volume)
 1 thread - 200 files/second; 4 threads - 520 files/second (~130/thread)

With this big a difference, I suspect the write cache is enabled on 
the disks. UFS requires this cache to be disabled or battery backed
otherwise corruption can occur.

 
 For comparison, the same test was done to a NetApp FAS270 that the x4500 was 
 bought to replace:
 1 thread - 70 files/second; 4 threads - ~250 files/second

I don't know enough about that system but perhaps it has NVRAM or an SSD
to service the synchronous demands of NFS. An equivalent setup could be
configured with a separate intent log on a similar fast device.

 
 I have been able to work around this performance hole by exporting multiple 
 ZFS filesystems, because the workload is spread across a hashed directory 
 structure. I then get 25 files per FS per second. Still, I thought I'd raise 
 it here anyway. If there's something I'm doing wrong, I'd love to hear about 
 it. 
 
 I'm also assuming that this ties into BugID 6535160  Lock contention on 
 zl_lock from zil_commit, so if that's the case, please add another vote for 
 making this fix available as a patch for S10u4 users

I believe this is a different problem than 6535160.

 
 Thanks,
 Steve Hillman
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL controls in Solaris 10 U4?

2008-01-30 Thread Neil Perrin


Roch - PAE wrote:
 Jonathan Loran writes:
   
   Is it true that Solaris 10 u4 does not have any of the nice ZIL controls 
   that exist in the various recent Open Solaris flavors?  I would like to 
   move my ZIL to solid state storage, but I fear I can't do it until I 
   have another update.  Heck, I would be happy to just be able to turn the 
   ZIL off to see how my NFS on ZFS performance is effected before spending 
   the $'s.  Anyone know when will we see this in Solaris 10?
   
 
 You can certainly turn it off with any release (Jim's link).
 
 It's true that S10u4 does not have the Separate Intent Log 
 to allow using an SSD for ZIL blocks. I believe S10U5 will
 have that feature.

Unfortunately it will not. A lot of ZFS fixes and features
that had existed for a while will not be in U5 (for reasons I
can't go into here). They should be in S10U6...

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL controls in Solaris 10 U4?

2008-01-30 Thread Neil Perrin


Jonathan Loran wrote:
 Vincent Fox wrote:
 Are you already running with zfs_nocacheflush=1?   We have SAN arrays with 
 dual battery-backed controllers for the cache, so we definitely have this 
 set on all our production systems.  It makes a big difference for us.

   
 No, we're not using the zfs_nocacheflush=1, but our SAN array's are set 
 to cache all writebacks, so it shouldn't be needed.  I may test this, if 
 I get the chance to reboot one of the servers, but I'll bet the storage 
 arrays' are working correctly.

I think there's some confusion. ZFS and the ZIL issue controller commands
to force the disk cache to be flushed to ensure data is on stable
storage. If the disk cache is battery backed then the costly flush
is unnecessary. As Vincent said, setting zfs_nocacheflush=1 can make a
huge difference.

Note that this is a system wide variable so all controllers serving ZFS
devices should be non volatile to enable it.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Neil Perrin
Marc Bevand wrote:
 William Fretts-Saxton william.fretts.saxton at sun.com writes:
   
 I disabled file prefetch and there was no effect.

 Here are some performance numbers.  Note that, when the application server
 used a ZFS file system to save its data, the transaction took TWICE as long.
 For some reason, though, iostat is showing 5x as much disk
 writing (to the physical disks) on the ZFS partition.  Can anyone see a
 problem here?
 

 Possible explanation: the Glassfish applications are using synchronous
 writes, causing the ZIL (ZFS Intent Log) to be intensively used, which
 leads to a lot of extra I/O.

The ZIL doesn't do a lot of extra IO. It usually just does one write per 
synchronous request and will batch
up multiple writes into the same log block if possible. However, it does 
need to wait for the
writes to be on stable storage before returning to the application, 
which is what the application has
requested. It does this by waiting for the write to complete and then 
flushing the disk write cache.
If the write cache is battery backed for all zpool devices then the 
global zfs_nocacheflush can be set
to give dramatically better performance.
  Try to disable it:

 http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29

 Since disabling it is not recommended, if you find out it is the cause of your
 perf problems, you should instead try to use a SLOG (separate intent log, see
 above link). Unfortunately your OS version (Solaris 10 8/07) doesn't support
 SLOGs, they have only been added to OpenSolaris build snv_68:

 http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

 -marc

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes

2008-02-15 Thread Neil Perrin


Nathan Kroenert wrote:
 And something I was told only recently - It makes a difference if you 
 created the file *before* you set the recordsize property.
 
 If you created them after, then no worries, but if I understand 
 correctly, if the *file* was created with 128K recordsize, then it'll 
 keep that forever...
 
 Assuming I understand correctly.
 
 Hopefully someone else on the list will be able to confirm.

Yes, that is correct.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and multipath with iSCSI

2008-04-04 Thread Neil Perrin
ZFS will handle out of order writes due to it transactional
nature. Individual writes can be re-ordered safely. When the transaction
commits it will wait for all writes and flush them; then write a
new uberblock with the new transaction group number and flush that.

Chris Siebenmann wrote:
  We're currently designing a ZFS fileserver environment with iSCSI based
 storage (for failover, cost, ease of expansion, and so on). As part of
 this we would like to use multipathing for extra reliability, and I am
 not sure how we want to configure it.
 
  Our iSCSI backend only supports multiple sessions per target, not
 multiple connections per session (and my understanding is that the
 Solaris initiator doesn't currently support multiple connections
 anyways). However, we have been cautioned that there is nothing in
 the backend that imposes a global ordering for commands between the
 sessions, and so disk IO might get reordered if Solaris's multipath load
 balancing submits part of it to one session and part to another.
 
  So: does anyone know if Solaris's multipath and iSCSI systems already
 take care of this, or if ZFS already is paranoid enough to deal
 with this, or if we should configure Solaris multipathing to not
 load-balance?
 
 (A load-balanced multipath configuration is simpler for us to
 administer, at least until I figure out how to tell Solaris multipathing
 which is the preferrred network for any given iSCSI target so we can
 balance the overall network load by hand.)
 
  Thanks in advance.
 
   - cks
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] incorrect/conflicting suggestion in error message on a faulted pool

2008-04-09 Thread Neil Perrin
Haudy,

Thanks for reporting this bug and helping to improve ZFS.
I'm not sure either how you could have added a note to an
existing report. Anyway I've gone ahead and done that for you
in the Related Bugs field. Though opensolaris doesn't reflect it yet

Neil.


Haudy Kazemi wrote:
 I have reported this bug here: 
 http://bugs.opensolaris.org/view_bug.do?bug_id=6685676
 
 I think this bug may be related, but I do not see where to add a note to 
 an existing bug report: 
 http://bugs.opensolaris.org/view_bug.do?bug_id=6633592
 (both bugs refer to ZFS-8000-2Q however my report shows a FAULTED pool 
 instead of a DEGRADED pool.)
 
 Thanks,
 
 -hk
 
 Haudy Kazemi wrote:
 Hello,

 I'm writing to report what I think is an incorrect or conflicting 
 suggestion in the error message displayed on a faulted pool that does 
 not have redundancy (equiv to RAID0?).  I ran across this while testing 
 and learning about ZFS on a clean installation of NexentaCore 1.0.

 Here is how to recreate the scenario:

 [EMAIL PROTECTED]:~$ mkfile 200m testdisk1 testdisk2
 [EMAIL PROTECTED]:~$ sudo zpool create mybigpool $PWD/testdisk1 
 $PWD/testdisk2
 Password:
 [EMAIL PROTECTED]:~$ zpool status mybigpool
   pool: mybigpool
  state: ONLINE
  scrub: none requested
 config:

 NAME  STATE READ WRITE CKSUM
 mybigpool ONLINE   0 0 0
   /export/home/kaz/testdisk1  ONLINE   0 0 0
   /export/home/kaz/testdisk2  ONLINE   0 0 0

 errors: No known data errors
 [EMAIL PROTECTED]:~$ sudo zpool scrub mybigpool
 [EMAIL PROTECTED]:~$ zpool status mybigpool
   pool: mybigpool
  state: ONLINE
  scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:09:29 2008
 config:

 NAME  STATE READ WRITE CKSUM
 mybigpool ONLINE   0 0 0
   /export/home/kaz/testdisk1  ONLINE   0 0 0
   /export/home/kaz/testdisk2  ONLINE   0 0 0

 errors: No known data errors

 Up to here everything looks fine.  Now lets destroy one of the virtual 
 drives:

 [EMAIL PROTECTED]:~$ rm testdisk2
 [EMAIL PROTECTED]:~$ zpool status mybigpool
   pool: mybigpool
  state: ONLINE
  scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:09:29 2008
 config:

 NAME  STATE READ WRITE CKSUM
 mybigpool ONLINE   0 0 0
   /export/home/kaz/testdisk1  ONLINE   0 0 0
   /export/home/kaz/testdisk2  ONLINE   0 0 0

 errors: No known data errors

 Okay, still looks fine, but I haven't tried to read/write to it yet.  
 Try a scrub.

 [EMAIL PROTECTED]:~$ sudo zpool scrub mybigpool
 [EMAIL PROTECTED]:~$ zpool status mybigpool
   pool: mybigpool
  state: FAULTED
 status: One or more devices could not be opened.  Sufficient replicas 
 exist for
 the pool to continue functioning in a degraded state.
 action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
  scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:10:36 2008
 config:

 NAME  STATE READ WRITE CKSUM
 mybigpool FAULTED  0 0 0  
 insufficient replicas
   /export/home/kaz/testdisk1  ONLINE   0 0 0
   /export/home/kaz/testdisk2  UNAVAIL  0 0 0  cannot 
 open

 errors: No known data errors
 [EMAIL PROTECTED]:~$

 There we go.  The pool has faulted as I expected to happen because I 
 created it as a non-redundant pool.  I think it was the equivalent of a 
 RAID0 pool with checksumming, at least it behaves like one.  The key to 
 my reporting this is that the status message says One or more devices 
 could not be opened.  Sufficient replicas exist for the pool to continue 
 functioning in a degraded state. while the message further down to the 
 right of the pool name says insufficient replicas.

 The verbose status message is wrong in this case.  From other forum/list 
 posts looks like that status message is also used for degraded pools, 
 which isn't a problem, but here we have a faulted pool.  Here's an 
 example of the same status message used appropriately: 
 http://mail.opensolaris.org/pipermail/zfs-discuss/2006-April/031298.html

 Is anyone else able to reproduce this?  And if so, is there a ZFS bug 
 tracker to report this too? (I didn't see a public bug tracker when I 
 looked.)

 Thanks,

 Haudy Kazemi
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pause Solaris with ZFS compression busy by doing a cp?

2008-05-22 Thread Neil Perrin

 I also noticed (perhaps by design) that a copy with compression off almost
 instantly returns, but the writes continue LONG after the cp process claims
 to be done. Is this normal?

Yes this is normal. Unless the application is doing synchronous writes
(eg DB) the file will be written to disk at the convenience of the FS.
Most fs operate this way. It's too expensive to synchronously write
out data, so it's batched up and written asynchronously.

 Wouldn't closing the file ensure it was written to disk?

No.

 Is that tunable somewhere?

No. For ZFS you can use sync(1M) which will force out all transactions
for all files in the pool. That is expensive though. 

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08

2008-05-24 Thread Neil Perrin


Hugh Saunders wrote:
 On Sat, May 24, 2008 at 4:00 PM,  [EMAIL PROTECTED] wrote:
   cache improve write performance or only reads?

 L2ARC cache device is for reads... for write you want
   Intent Log
 
 Thanks for answering my question, I had seen mention of intent log
 devices, but wasn't sure of their purpose.
 
 If only one significantly faster disk is available, would it make
 sense to slice it and use a slice for L2ARC and a slice for ZIL? or
 would that cause horrible thrashing?

I wouldn't recommend this configuration.
As you say it would thrash the head. Log devices mainly need to write
fast as they only ever are read once on reboot if there's uncommitted
transactions. Whereas cache devices require a fast read as the write can
be done slowly and asynchronously. So a common device sliced for use as
both purposes wouldn't work well unless it was both fast read and write
and had minimal seek times (nvram, ss disk).

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog devices don't resilver correctly

2008-05-27 Thread Neil Perrin
Joe Little wrote:
 On Tue, May 27, 2008 at 4:50 PM, Eric Schrock [EMAIL PROTECTED] wrote:
 Joe -

 We definitely don't do great accounting of the 'vdev_islog' state here,
 and it's possible to create a situation where the parent replacing vdev
 has the state set but the children do not, but I have been unable to
 reproduce the behavior you saw.  I have rebooted the system during
 resilver, manually detached the replacing vdev, and a variety of other
 things, but I've never seen the behavior you describe.  In all cases,
 the log state is kept with the replacing vdev and restored when the
 resilver completes.  I have also not observed the resilver failing with
 a bad log device.

 Can you provide more information about how to reproduce this problem?
 Perhaps without rebooting into B70 in the middle?

 
 Well, this happened live on a production system, and I'm still in the
 process of rebuilding said system (trying to save all the snapshots)
 
 I don't know what triggered it. It was trying to resilver in B85,
 rebooted into B70 where it did resilver (but it was now using cmdk
 device naming vs the full scsi device names). It was marked degraded
 still even though re-silvering finished. Since the resilver took so
 long, I suspect the splicing in of the device took place in the B70.
 Again, it would never work in B85 -- just kept resetting. I'm
 wondering if the device path changing from cxtxdx to cxdx could be the
 trigger point.

Joe,

We're sorry about your problems. My take on how this is best handled,
is that it be be better to expedite (raise priority) fixing the bug

6574286 removing a slog doesn't work

rather than expend too much effort in understanding how it
failed on your system. You would not have had this problem
if you were able to remove a log device. Is that reasonable?

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE 4852783

2008-06-16 Thread Neil Perrin
This is actually quite a tricky fix as obviously data and meta data have
to be relocated. Although there's been no visible activity in this bug
there has been substantial design activity to allow the RFE to be easily
fixed. 

Anyway, to answer your question, I would fully expect this RFE would
be fixed within a year, but can't guarantee it.

Neil.

Miles Nordin wrote:
 Is RFE 4852783 (need for an equivalent to LVM2's pvmove) likely to
 happen within the next year?
 
 My use-case is home user.  I have 16 disks spinning, two towers of
 eight disks each, exporting some of them as iSCSI targets.  Four disks
 are 1TB disks already in ZFS mirrors, and 12 disks are 180 - 320GB and
 contain 12 individual filesystems.
 
 If RFE 4852783 will happen in a year, I can move the smaller disks and
 their data into the ZFS mirror.  As they die I will replace them with
 pairs of ~1TB disks.
 
 I worry the RFE won't happen because it looks 5 years old with no
 posted ETA.  If it won't be closed within a year, some of those 12
 disks will start failing and need replacement.  We find we lose one or
 two each year.  If I added them to ZFS, I'd have to either waste
 money, space, power on buying undersized replacement disks, or else do
 silly and dangerously confusing things with slices.  Therefore in that
 case I will leave the smaller disks out of ZFS and add only 1TB
 devices to these immutable vdev's.
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Caching - write() syscall with O_SYNC

2008-07-07 Thread Neil Perrin


Patrick Pinchera wrote:
 IHAC using ZFS in production, and he's opening up some files with the 
 O_SYNC flag.  This affects subsequent write()'s by providing 
 synchronized I/O file integrity completion. That is, each write(2) will 
 wait for both the file data and file status to be physically updated.
 
 Because of this, he's seeing some delays on the file write()'s. This is 
 verified with dtrace.  He's got a storage array with a read/write cache 
 already.  What does ZFS introduce to this O_SYNC flag?  Is ZFS doing 
 some caching itself, too?

Yes, but not in the path of the synchronous request. The latency isn't
affected by other ZFS caching.

Are there settings we got by default when we 
 created the ZFS pools that already give us the equivalent of O_SYNC?

No. 

 Is there something we should consider turning on or off with regard to ZFS?

Yes, because your write cache is non-volatile you can disable the zfs write
cache flush. See:
 
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes

Note this should only really be done if ZFS is the only user of the
storage array.

 
 My feeling is that in an effort to make these write()'s so that they go 
 completely to the disk, we may have gone overboard with one or more of 
 the following:
 
 * setting O_SYNC on the file open() to affect the write()'s
 * using ZFS
 * using a storage array with a battery backed up read/write cache
 
 Can we eliminate one or more of these and still get the file integrity 
 we want?
 
 PRD;IANOTA
 
 Regards,
 Pat
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS deduplication

2008-07-07 Thread Neil Perrin
Mertol,

Yes, dedup is certainly on our list and has been actively
discussed recently, so there's hope and some forward progress.
It would be interesting to see where it fits into our customers
priorities for ZFS. We have a long laundry list of projects.
In addition there's bug fixes  performance changes that customers
are demanding.

Neil.

Mertol Ozyoney wrote:
 Hi All ;
 
  
 
 Is there any hope for deduplication on ZFS ?
 
  
 
 Mertol
 
  
 
  
 
 http://www.sun.com/emrkt/sigs/6g_top.gif http://www.sun.com/
 
   
 
 *Mertol Ozyoney *
 Storage Practice - Sales Manager
 
 *Sun Microsystems, TR*
 Istanbul TR
 Phone +902123352200
 Mobile +905339310752
 Fax +90212335
 Email [EMAIL PROTECTED]
 
  
 
  
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-discuss Digest, Vol 33, Issue 19

2008-07-08 Thread Neil Perrin


Ross wrote:
 Hi Gilberto,
 
 I bought a Micro Memory card too, so I'm very likely going to end up in the 
 same boat. 
 I saw Neil Perrin's blog about the MM-5425 card, found that Vmetro don't seem 
 to want
 to sell them, but then then last week spotted five of those cards on e-bay so 
 snapped
 them up.
 
 I'm still waiting for the hardware for this server, but regarding the 
 drivers, if these
 cards don't work out of the box I was planning to pester Neil Perrin and see 
 if he still
 has some drivers for them :)

Unfortunately, there are a couple of problems:

1. It's been a while since I used that board and driver.   I recently tried 
pkgadd-ing on
   the latest Nevada build and it hung. I'm not sure if the latest Nevada is 
somehow
   incompatible. I didn't have time to track down the cause.

2. I received the board and driver from another group within Sun.
   It would be better to contact Micro Memory (or whoever took them
   over) directly, as it's not my place to give out 3rd party drivers
   or provide support for them.

Sorry for the bad news: Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-30 Thread Neil Perrin


Peter Cudhea wrote:
 Your point is well taken that ZFS should not duplicate functionality 
 that is already or should be available at the device driver level.In 
 this case, I think it misses the point of what ZFS should be doing that 
 it is not.
 
 ZFS does its own periodic commits to the disk, and it knows if those 
 commit points have reached the disk or not, or whether they are getting 
 errors.In this particular case, those commits to disk are presumably 
 failing, because one of the disks they depend on has been removed from 
 the system.   (If the writes are not being marked as failures, that 
 would definitely be an error in the device driver, as you say.)  In this 
 case, however, the ZIL log has stopped being updated, but ZFS does 
 nothing to announce that this has happened, or to indicate that a remedy 
 is required.

I think you have some misconceptions about how the ZIL works.
It doesn't provide journalling like UFS. The following might help:

http://blogs.sun.com/perrin/entry/the_lumberjack

The ZIL isn't used at all unless there's fsync/O_DSYNC activity.

 
 At the very least, it would be extremely helpful if  ZFS had a status to 
 report that indicates that the ZIL log is out of date, or that there are 
 troubles writing to the ZIL log, or something like that.

If the ZIL cannot be written then we force a transaction group (txg)
commit. That is the only recourse to force data to stable storage before
returning to the application. 

 
 An additional feature would be to have user-selectable behavior when the 
 ZIL log is significantly out of date.For example, if the ZIL log is 
 more than X seconds out of date, then new writes to the system should 
 pause, or give errors or continue to silently succeed.

Again this doesn't make sense given how the ZIL works.

 
 In an earlier phase of my career when I worked for a database company, I 
 was responsible for a similar bug.   It caused a major customer to lose 
 a major amount of data when a system rebooted when not all good data had 
 been successfully committed to disk.The resulting stink caused us to 
 add a feature to detect the cases when the writing-to-disk process had 
 fallen too far behind, and to pause new writes to the database until the 
 situation was resolved.
 
 Peter
 
 Bob Friesenhahn wrote:
 While I do believe that device drivers. or the fault system, should 
 notify ZFS when a device fails (and ZFS should appropriately react), I 
 don't think that ZFS should be responsible for fault monitoring.  ZFS 
 is in a rather poor position for device fault monitoring, and if it 
 attempts to do so then it will be slow and may misbehave in other 
 ways.  The software which communicates with the device (i.e. the 
 device driver) is in the best position to monitor the device.

 The primary goal of ZFS is to be able to correctly read data which was 
 successfully committed to disk.  There are programming interfaces 
 (e.g. fsync(), msync()) which may be used to ensure that data is 
 committed to disk, and which should return an error if there is a 
 problem.  If you were performing your tests over an NFS mount then the 
 results should be considerably different since NFS requests that its 
 data be committed to disk.

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   >