subject:"\[zfs\-discuss\] Sun Flash Accelerator F20"

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Andrey Kuzmin

On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling richard.ell...@gmail.comwrote:

 On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

  Andrey Kuzmin wrote:
  Well, I'm more accustomed to  sequential vs. random, but YMMW.
  As to 67000 512 byte writes (this sounds suspiciously close to 32Mb
 fitting into cache), did you have write-back enabled?
 
  It's a sustained number, so it shouldn't matter.

 That is only 34 MB/sec.  The disk can do better for sequential writes.

 Note: in ZFS, such writes will be coalesced into 128KB chunks.


So this is just 256 IOPS in the controller, not 64K.

Regards,
Andrey


  -- richard

 --
 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
 http://nexenta-rotterdam.eventbrite.com/







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread sensille

Andrey Kuzmin wrote:
 On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling
 richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:
 
 On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:
 
  Andrey Kuzmin wrote:
  Well, I'm more accustomed to  sequential vs. random, but YMMW.
  As to 67000 512 byte writes (this sounds suspiciously close to
 32Mb fitting into cache), did you have write-back enabled?
 
  It's a sustained number, so it shouldn't matter.
 
 That is only 34 MB/sec.  The disk can do better for sequential writes.
 
 Note: in ZFS, such writes will be coalesced into 128KB chunks.
 
 
 So this is just 256 IOPS in the controller, not 64K.

No, it's 67k ops, it was a completely ZFS-free test setup. iostat also confirmed
the numbers.

--Arne

 
 Regards,
 Andrey
  
 
  -- richard
 
 --
 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
 http://nexenta-rotterdam.eventbrite.com/
 
 
 
 
 
 
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Robert Milkowski


On 10/06/2010 20:43, Andrey Kuzmin wrote:
As to your results, it sounds almost too good to be true. As Bob has 
pointed out, h/w design targeted hundreds IOPS, and it was hard to 
believe it can scale 100x. Fantastic.


But it actually can do over 100k.
Also several thousand IOPS on a single FC port is nothing unusual and 
has been the case for at least several years.


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Robert Milkowski


On 11/06/2010 09:22, sensille wrote:

Andrey Kuzmin wrote:
   

On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling
richard.ell...@gmail.commailto:richard.ell...@gmail.com  wrote:

 On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

   Andrey Kuzmin wrote:
   Well, I'm more accustomed to  sequential vs. random, but YMMW.
   As to 67000 512 byte writes (this sounds suspiciously close to
 32Mb fitting into cache), did you have write-back enabled?
 
   It's a sustained number, so it shouldn't matter.

 That is only 34 MB/sec.  The disk can do better for sequential writes.

 Note: in ZFS, such writes will be coalesced into 128KB chunks.


So this is just 256 IOPS in the controller, not 64K.
 

No, it's 67k ops, it was a completely ZFS-free test setup. iostat also confirmed
the numbers.


It's a really simple test everyone can do it.

# dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512

I did a test on my workstation a moment ago and got about 21k IOPS from 
my sata drive (iostat).
The trick here of course is that this is sequentail write with no other 
workload going on and a drive should be able to nicely coalesce these 
IOs and do a sequential writes with large blocks.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Andrey Kuzmin

On Fri, Jun 11, 2010 at 1:26 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 11/06/2010 09:22, sensille wrote:

 Andrey Kuzmin wrote:


 On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling
 richard.ell...@gmail.commailto:richard.ell...@gmail.com  wrote:

 On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

   Andrey Kuzmin wrote:
   Well, I'm more accustomed to  sequential vs. random, but YMMW.
   As to 67000 512 byte writes (this sounds suspiciously close to
 32Mb fitting into cache), did you have write-back enabled?
 
   It's a sustained number, so it shouldn't matter.

 That is only 34 MB/sec.  The disk can do better for sequential
 writes.

 Note: in ZFS, such writes will be coalesced into 128KB chunks.


 So this is just 256 IOPS in the controller, not 64K.


 No, it's 67k ops, it was a completely ZFS-free test setup. iostat also
 confirmed
 the numbers.


 It's a really simple test everyone can do it.

 # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512

 I did a test on my workstation a moment ago and got about 21k IOPS from my
 sata drive (iostat).
 The trick here of course is that this is sequentail write with no other
 workload going on and a drive should be able to nicely coalesce these IOs
 and do a sequential writes with large blocks.


Exactly, though one might still wonder where the coalescing actually
happens, in the respective OS layer or in the controller. Nonetheless, this
is hardly a common use-case one would design h/w for.

Regards,
Andrey




 --
 Robert Milkowski
 http://milek.blogspot.com

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Robert Milkowski


On 11/06/2010 10:58, Andrey Kuzmin wrote:
On Fri, Jun 11, 2010 at 1:26 PM, Robert Milkowski mi...@task.gda.pl 
mailto:mi...@task.gda.pl wrote:


On 11/06/2010 09:22, sensille wrote:

Andrey Kuzmin wrote:

On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling
richard.ell...@gmail.com
mailto:richard.ell...@gmail.commailto:richard.ell...@gmail.com
mailto:richard.ell...@gmail.com  wrote:

On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

  Andrey Kuzmin wrote:
  Well, I'm more accustomed to  sequential vs. random,
but YMMW.
  As to 67000 512 byte writes (this sounds suspiciously
close to
32Mb fitting into cache), did you have write-back enabled?

  It's a sustained number, so it shouldn't matter.

That is only 34 MB/sec.  The disk can do better for
sequential writes.

Note: in ZFS, such writes will be coalesced into 128KB
chunks.


So this is just 256 IOPS in the controller, not 64K.

No, it's 67k ops, it was a completely ZFS-free test setup.
iostat also confirmed
the numbers.


It's a really simple test everyone can do it.

# dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512

I did a test on my workstation a moment ago and got about 21k IOPS
from my sata drive (iostat).
The trick here of course is that this is sequentail write with no
other workload going on and a drive should be able to nicely
coalesce these IOs and do a sequential writes with large blocks.


Exactly, though one might still wonder where the coalescing actually 
happens, in the respective OS layer or in the controller. Nonetheless, 
this is hardly a common use-case one would design h/w for.





in the above example it happens inside a disk drive.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Garrett D'Amore

On Fri, 2010-06-11 at 13:58 +0400, Andrey Kuzmin wrote:

 # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512
 
 I did a test on my workstation a moment ago and got about 21k
 IOPS from my sata drive (iostat).
 The trick here of course is that this is sequentail write with
 no other workload going on and a drive should be able to
 nicely coalesce these IOs and do a sequential writes with
 large blocks.
 
 
 Exactly, though one might still wonder where the coalescing actually
 happens, in the respective OS layer or in the controller. Nonetheless,
 this is hardly a common use-case one would design h/w for.

No OS layer coalescing happens.  The most an OS will ever do is sort
the IOs to make them advantageous (e.g. avoid extra seeking), but the
I/Os are still delivered as individual requests to the HBA.

I'm not aware of any logic in an HBA to coalesce either, and I would
think such a thing would be highly risky.

That said, caching firmware on the drive itself may effectively
(probably!) cause these transfers to happen as a single transfer if they
are naturally contiguous, and if they are arriving at the drive firmware
faster than the firmware can flush them to media.

- Garrett



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Robert Milkowski


On 21/10/2009 03:54, Bob Friesenhahn wrote:


I would be interested to know how many IOPS an OS like Solaris is able 
to push through a single device interface.  The normal driver stack is 
likely limited as to how many IOPS it can sustain for a given LUN 
since the driver stack is optimized for high latency devices like disk 
drives.  If you are creating a driver stack, the design decisions you 
make when requests will be satisfied in about 12ms would be much 
different than if requests are satisfied in 50us.  Limitations of 
existing software stacks are likely reasons why Sun is designing 
hardware with more device interfaces and more independent devices.



Open Solaris 2009.06, 1KB READ I/O:

# dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
# iostat -xnzCM 1|egrep device|c[0123]$
[...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17497.30.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17498.80.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17277.60.0   16.90.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17441.30.0   17.00.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17333.90.0   16.90.0  0.0  0.80.00.0   0  82 c0


Now lets see how it looks like for a single SAS connection but dd to 11x 
SSDs:


# dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0

# iostat -xnzCM 1|egrep device|c[0123]$
[...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104243.30.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104249.20.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104208.10.0  101.80.0  0.2  9.70.00.1   0 967 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104245.80.0  101.80.0  0.2  9.70.00.1   0 966 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104221.90.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104212.20.0  101.80.0  0.2  9.70.00.1   0 967 c0


It looks like a single CPU core still hasn't been saturated and the 
bottleneck is in the device rather then OS/CPU. So the MPT driver in 
Solaris 2009.06 can do at least 100,000 IOPS to a single SAS port.


It also scales well - I did run above dd's over 4x SAS ports at the same 
time and it scaled linearly by achieving well over 400k IOPS.



hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw. 
1.27.3.0), connected to F5100.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin

On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 21/10/2009 03:54, Bob Friesenhahn wrote:


 I would be interested to know how many IOPS an OS like Solaris is able to
 push through a single device interface.  The normal driver stack is likely
 limited as to how many IOPS it can sustain for a given LUN since the driver
 stack is optimized for high latency devices like disk drives.  If you are
 creating a driver stack, the design decisions you make when requests will be
 satisfied in about 12ms would be much different than if requests are
 satisfied in 50us.  Limitations of existing software stacks are likely
 reasons why Sun is designing hardware with more device interfaces and more
 independent devices.



 Open Solaris 2009.06, 1KB READ I/O:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0


/dev/null is usually a poor choice for a test lie this. Just to be on the
safe side, I'd rerun it with /dev/random.

Regards,
Andrey


 # iostat -xnzCM 1|egrep device|c[0123]$
 [...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17497.30.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17498.80.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17277.60.0   16.90.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17441.30.0   17.00.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17333.90.0   16.90.0  0.0  0.80.00.0   0  82 c0


 Now lets see how it looks like for a single SAS connection but dd to 11x
 SSDs:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0

 # iostat -xnzCM 1|egrep device|c[0123]$
 [...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104243.30.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104249.20.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104208.10.0  101.80.0  0.2  9.70.00.1   0 967 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104245.80.0  101.80.0  0.2  9.70.00.1   0 966 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104221.90.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104212.20.0  101.80.0  0.2  9.70.00.1   0 967 c0


 It looks like a single CPU core still hasn't been saturated and the
 bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris
 2009.06 can do at least 100,000 IOPS to a single SAS port.

 It also scales well - I did run above dd's over 4x SAS ports at the same
 time and it scaled linearly by achieving well over 400k IOPS.


 hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw.
 1.27.3.0), connected to F5100.


 --
 Robert Milkowski
 http://milek.blogspot.com

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Robert Milkowski


On 10/06/2010 15:39, Andrey Kuzmin wrote:
On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl 
mailto:mi...@task.gda.pl wrote:


On 21/10/2009 03:54, Bob Friesenhahn wrote:


I would be interested to know how many IOPS an OS like Solaris
is able to push through a single device interface.  The normal
driver stack is likely limited as to how many IOPS it can
sustain for a given LUN since the driver stack is optimized
for high latency devices like disk drives.  If you are
creating a driver stack, the design decisions you make when
requests will be satisfied in about 12ms would be much
different than if requests are satisfied in 50us.  Limitations
of existing software stacks are likely reasons why Sun is
designing hardware with more device interfaces and more
independent devices.



Open Solaris 2009.06, 1KB READ I/O:

# dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0


/dev/null is usually a poor choice for a test lie this. Just to be on 
the safe side, I'd rerun it with /dev/random.




That wouldn't work, would it?
Please notice that I'm reading *from* an ssd and writing *to* /dev/null

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin

Sorry, my bad. _Reading_ from /dev/null may be an issue, but not writing to
it, of course.

Regards,
Andrey



On Thu, Jun 10, 2010 at 6:46 PM, Robert Milkowski mi...@task.gda.pl wrote:

  On 10/06/2010 15:39, Andrey Kuzmin wrote:

 On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.plwrote:

 On 21/10/2009 03:54, Bob Friesenhahn wrote:


 I would be interested to know how many IOPS an OS like Solaris is able to
 push through a single device interface.  The normal driver stack is likely
 limited as to how many IOPS it can sustain for a given LUN since the driver
 stack is optimized for high latency devices like disk drives.  If you are
 creating a driver stack, the design decisions you make when requests will be
 satisfied in about 12ms would be much different than if requests are
 satisfied in 50us.  Limitations of existing software stacks are likely
 reasons why Sun is designing hardware with more device interfaces and more
 independent devices.



 Open Solaris 2009.06, 1KB READ I/O:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0


  /dev/null is usually a poor choice for a test lie this. Just to be on the
 safe side, I'd rerun it with /dev/random.


 That wouldn't work, would it?
 Please notice that I'm reading *from* an ssd and writing *to* /dev/null


 --
 Robert Milkowski
 http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Mike Gerdts

On Thu, Jun 10, 2010 at 9:39 AM, Andrey Kuzmin
andrey.v.kuz...@gmail.com wrote:
 On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 21/10/2009 03:54, Bob Friesenhahn wrote:

 I would be interested to know how many IOPS an OS like Solaris is able to
 push through a single device interface.  The normal driver stack is likely
 limited as to how many IOPS it can sustain for a given LUN since the driver
 stack is optimized for high latency devices like disk drives.  If you are
 creating a driver stack, the design decisions you make when requests will be
 satisfied in about 12ms would be much different than if requests are
 satisfied in 50us.  Limitations of existing software stacks are likely
 reasons why Sun is designing hardware with more device interfaces and more
 independent devices.


 Open Solaris 2009.06, 1KB READ I/O:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0

 /dev/null is usually a poor choice for a test lie this. Just to be on the
 safe side, I'd rerun it with /dev/random.
 Regards,
 Andrey

(aside from other replies about read vs. write and /dev/random...)

Testing performance of disk by reading from /dev/random and writing to
disk is misguided.  From random(7d):

   Applications retrieve random bytes by reading /dev/random
   or /dev/urandom. The /dev/random interface returns random
   bytes only when sufficient amount of entropy has been collected.

In other words, when the kernel doesn't think that it can give high
quality random numbers, it stops providing them until it has gathered
enough entropy.  It will pause your reads.

If instead you use /dev/urandom, the above problem doesn't exist, but
the generation of random numbers is CPU-intensive.  There is a
reasonable chance (particularly with slow CPU's and fast disk) that
you will be testing the speed of /dev/urandom rather than the speed of
the disk or other I/O components.

If your goal is to provide data that is not all 0's to prevent ZFS
compression from making the file sparse or want to be sure that
compression doesn't otherwise make the actual writes smaller, you
could try something like:

# create a file just over 100 MB
dd if=/dev/random of=/tmp/randomdata bs=513 count=204401
# repeatedly feed that file to dd
while true ; do cat /tmp/randomdataa ; done | dd of=/my/test/file
bs=... count=...

The above should make it so that it will take a while before there are
two blocks that are identical, thus confounding deduplication as well.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Ross Walker

On Jun 10, 2010, at 5:54 PM, Richard Elling richard.ell...@gmail.com  
wrote:



On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:


Andrey Kuzmin wrote:

Well, I'm more accustomed to  sequential vs. random, but YMMW.
As to 67000 512 byte writes (this sounds suspiciously close to  
32Mb fitting into cache), did you have write-back enabled?


It's a sustained number, so it shouldn't matter.


That is only 34 MB/sec.  The disk can do better for sequential writes.


Not doing sector sized IO.

Besides this was a max IOPS number not max throughput number. If it  
were the OP might have used a 1M bs or better instead.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin

As to your results, it sounds almost too good to be true. As Bob has pointed
out, h/w design targeted hundreds IOPS, and it was hard to believe it can
scale 100x. Fantastic.

Regards,
Andrey



On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 21/10/2009 03:54, Bob Friesenhahn wrote:


 I would be interested to know how many IOPS an OS like Solaris is able to
 push through a single device interface.  The normal driver stack is likely
 limited as to how many IOPS it can sustain for a given LUN since the driver
 stack is optimized for high latency devices like disk drives.  If you are
 creating a driver stack, the design decisions you make when requests will be
 satisfied in about 12ms would be much different than if requests are
 satisfied in 50us.  Limitations of existing software stacks are likely
 reasons why Sun is designing hardware with more device interfaces and more
 independent devices.



 Open Solaris 2009.06, 1KB READ I/O:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
 # iostat -xnzCM 1|egrep device|c[0123]$
 [...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17497.30.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17498.80.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17277.60.0   16.90.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17441.30.0   17.00.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17333.90.0   16.90.0  0.0  0.80.00.0   0  82 c0


 Now lets see how it looks like for a single SAS connection but dd to 11x
 SSDs:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0

 # iostat -xnzCM 1|egrep device|c[0123]$
 [...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104243.30.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104249.20.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104208.10.0  101.80.0  0.2  9.70.00.1   0 967 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104245.80.0  101.80.0  0.2  9.70.00.1   0 966 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104221.90.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104212.20.0  101.80.0  0.2  9.70.00.1   0 967 c0


 It looks like a single CPU core still hasn't been saturated and the
 bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris
 2009.06 can do at least 100,000 IOPS to a single SAS port.

 It also scales well - I did run above dd's over 4x SAS ports at the same
 time and it scaled linearly by achieving well over 400k IOPS.


 hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw.
 1.27.3.0), connected to F5100.


 --
 Robert Milkowski
 http://milek.blogspot.com

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Garrett D'Amore



For the record, with my driver (which is not the same as the one shipped 
by the vendor), I was getting over 150K IOPS with a single DDRdrive X1.  
It is possible to get very high IOPS with Solaris.  However, it might be 
difficult to get such high numbers with systems based on SCSI/SCSA.  
(SCSA does have assumptions which make it overweight for typical 
simple flash based devices.)


My solution was based around the blkdev device driver that I 
integrated into ON a couple of builds ago.


-- Garrett

On 06/10/10 12:57, Andrey Kuzmin wrote:
On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net 
mailto:sensi...@gmx.net wrote:


Andrey Kuzmin wrote:

As to your results, it sounds almost too good to be true. As
Bob has pointed out, h/w design targeted hundreds IOPS, and it
was hard to believe it can scale 100x. Fantastic.


Hundreds IOPS is not quite true, even with hard drives. I just tested
a Hitachi 15k drive and it handles 67000 512 byte linear write/s,
cache


Linear? May be sequential?

Regards,
Andrey

enabled.

--Arne


Regards,
Andrey




On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski
mi...@task.gda.pl mailto:mi...@task.gda.pl
mailto:mi...@task.gda.pl mailto:mi...@task.gda.pl wrote:

   On 21/10/2009 03:54, Bob Friesenhahn wrote:


   I would be interested to know how many IOPS an OS like
Solaris
   is able to push through a single device interface.  The
normal
   driver stack is likely limited as to how many IOPS it can
   sustain for a given LUN since the driver stack is
optimized for
   high latency devices like disk drives.  If you are
creating a
   driver stack, the design decisions you make when
requests will
   be satisfied in about 12ms would be much different than if
   requests are satisfied in 50us.  Limitations of existing
   software stacks are likely reasons why Sun is designing
hardware
   with more device interfaces and more independent devices.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin

On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net wrote:

 Andrey Kuzmin wrote:

 As to your results, it sounds almost too good to be true. As Bob has
 pointed out, h/w design targeted hundreds IOPS, and it was hard to believe
 it can scale 100x. Fantastic.


 Hundreds IOPS is not quite true, even with hard drives. I just tested
 a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache


Linear? May be sequential?

Regards,
Andrey


 enabled.

 --Arne


 Regards,
 Andrey




 On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.plmailto:
 mi...@task.gda.pl wrote:

On 21/10/2009 03:54, Bob Friesenhahn wrote:


I would be interested to know how many IOPS an OS like Solaris
is able to push through a single device interface.  The normal
driver stack is likely limited as to how many IOPS it can
sustain for a given LUN since the driver stack is optimized for
high latency devices like disk drives.  If you are creating a
driver stack, the design decisions you make when requests will
be satisfied in about 12ms would be much different than if
requests are satisfied in 50us.  Limitations of existing
software stacks are likely reasons why Sun is designing hardware
with more device interfaces and more independent devices.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Arne Jansen


Andrey Kuzmin wrote:
On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net 
mailto:sensi...@gmx.net wrote:


Andrey Kuzmin wrote:

As to your results, it sounds almost too good to be true. As Bob
has pointed out, h/w design targeted hundreds IOPS, and it was
hard to believe it can scale 100x. Fantastic.


Hundreds IOPS is not quite true, even with hard drives. I just tested
a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache


Linear? May be sequential?


Aren't these synonyms? linear as opposed to random.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Arne Jansen


Andrey Kuzmin wrote:

Well, I'm more accustomed to  sequential vs. random, but YMMW.

As to 67000 512 byte writes (this sounds suspiciously close to 32Mb 
fitting into cache), did you have write-back enabled?




It's a sustained number, so it shouldn't matter.


Regards,
Andrey



On Fri, Jun 11, 2010 at 12:03 AM, Arne Jansen sensi...@gmx.net 
mailto:sensi...@gmx.net wrote:


Andrey Kuzmin wrote:

On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net
mailto:sensi...@gmx.net mailto:sensi...@gmx.net
mailto:sensi...@gmx.net wrote:

   Andrey Kuzmin wrote:

   As to your results, it sounds almost too good to be true.
As Bob
   has pointed out, h/w design targeted hundreds IOPS, and
it was
   hard to believe it can scale 100x. Fantastic.


   Hundreds IOPS is not quite true, even with hard drives. I
just tested
   a Hitachi 15k drive and it handles 67000 512 byte linear
write/s, cache


Linear? May be sequential?


Aren't these synonyms? linear as opposed to random.





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Arne Jansen


Andrey Kuzmin wrote:
As to your results, it sounds almost too good to be true. As Bob has 
pointed out, h/w design targeted hundreds IOPS, and it was hard to 
believe it can scale 100x. Fantastic.


Hundreds IOPS is not quite true, even with hard drives. I just tested
a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache
enabled.

--Arne



Regards,
Andrey



On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl 
mailto:mi...@task.gda.pl wrote:


On 21/10/2009 03:54, Bob Friesenhahn wrote:


I would be interested to know how many IOPS an OS like Solaris
is able to push through a single device interface.  The normal
driver stack is likely limited as to how many IOPS it can
sustain for a given LUN since the driver stack is optimized for
high latency devices like disk drives.  If you are creating a
driver stack, the design decisions you make when requests will
be satisfied in about 12ms would be much different than if
requests are satisfied in 50us.  Limitations of existing
software stacks are likely reasons why Sun is designing hardware
with more device interfaces and more independent devices.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin

Well, I'm more accustomed to  sequential vs. random, but YMMW.

As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting
into cache), did you have write-back enabled?

Regards,
Andrey



On Fri, Jun 11, 2010 at 12:03 AM, Arne Jansen sensi...@gmx.net wrote:

 Andrey Kuzmin wrote:

  On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net mailto:
 sensi...@gmx.net wrote:

Andrey Kuzmin wrote:

As to your results, it sounds almost too good to be true. As Bob
has pointed out, h/w design targeted hundreds IOPS, and it was
hard to believe it can scale 100x. Fantastic.


Hundreds IOPS is not quite true, even with hard drives. I just tested
a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache


 Linear? May be sequential?


 Aren't these synonyms? linear as opposed to random.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Richard Elling

On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

 Andrey Kuzmin wrote:
 Well, I'm more accustomed to  sequential vs. random, but YMMW.
 As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting 
 into cache), did you have write-back enabled?
 
 It's a sustained number, so it shouldn't matter.

That is only 34 MB/sec.  The disk can do better for sequential writes.

Note: in ZFS, such writes will be coalesced into 128KB chunks.
 -- richard

-- 
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Jeroen Roodhart

Hi list,

 If you're running solaris proper, you better mirror
 your
  ZIL log device.  
...
 I plan to get to test this as well, won't be until
 late next week though.

Running OSOL nv130. Power off the machine, removed the F20 and power back on. 
Machines boots OK and comes up normally with the following message in 'zpool 
status':
...
pool: mypool
 state: FAULTED
status: An intent log record could not be read.
Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run 'zpool online',
or ignore the intent log records by running 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
mypool  FAULTED  0 0 0  bad intent log
...

Nice! Running a later version of ZFS seems to lessen the need for 
ZIL-mirroring...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Jeroen Roodhart

Hi list,

 If you're running solaris proper, you better mirror
 your
  ZIL log device.  
...
 I plan to get to test this as well, won't be until
 late next week though.

Running OSOL nv130. Power off the machine, removed the F20 and power back on. 
Machines boots OK and comes up normally with the following message in 'zpool 
status':
...
pool: mypool
 state: FAULTED
status: An intent log record could not be read.
Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run 'zpool online',
or ignore the intent log records by running 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
mypool  FAULTED  0 0 0  bad intent log
...

Nice! Running a later version of ZFS seems to lessen the need for 
ZIL-mirroring...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jeroen Roodhart
 
  If you're running solaris proper, you better mirror
  your
   ZIL log device.
 ...
  I plan to get to test this as well, won't be until
  late next week though.
 
 Running OSOL nv130. Power off the machine, removed the F20 and power
 back on. Machines boots OK and comes up normally [...]
 
 Nice! Running a later version of ZFS seems to lessen the need for ZIL-
 mirroring...

Yes, since zpool 19, which is not available in any version of solaris yet,
and is not available in osol 2009.06 unless you update to developer
builds,  Since zpool 19, you have the ability to zpool remove log
devices.  And if a log device fails during operation, the system is supposed
to fall back and just start using ZIL blocks from the main pool instead.

So the recommendation for zpool 19 would be *strongly* recommended.  Mirror
your log device if you care about using your pool.
And the recommendation for zpool =19 would be ... don't mirror your log
device.  If you have more than one, just add them both unmirrored.

I edited the ZFS Best Practices yesterday to reflect these changes.

I always have a shade of doubt about things that are supposed to do
something.  Later this week, I am building an OSOL machine, updating it,
adding an unmirrored log device, starting a sync-write benchmark (to ensure
the log device is heavily in use) and then I'm going to yank out the log
device, and see what happens.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Ragnar Sundblad


On 7 apr 2010, at 14.28, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jeroen Roodhart
 
 If you're running solaris proper, you better mirror
 your
 ZIL log device.
 ...
 I plan to get to test this as well, won't be until
 late next week though.
 
 Running OSOL nv130. Power off the machine, removed the F20 and power
 back on. Machines boots OK and comes up normally [...]
 
 Nice! Running a later version of ZFS seems to lessen the need for ZIL-
 mirroring...
 
 Yes, since zpool 19, which is not available in any version of solaris yet,
 and is not available in osol 2009.06 unless you update to developer
 builds,  Since zpool 19, you have the ability to zpool remove log
 devices.  And if a log device fails during operation, the system is supposed
 to fall back and just start using ZIL blocks from the main pool instead.
 
 So the recommendation for zpool 19 would be *strongly* recommended.  Mirror
 your log device if you care about using your pool.
 And the recommendation for zpool =19 would be ... don't mirror your log
 device.  If you have more than one, just add them both unmirrored.

Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.

For a file server, mail server, etc etc, where things are stored
and supposed to be available later, you almost certainly want
redundancy on your slog too. (There may be file servers where
this doesn't apply, but they are special cases that should not
be mentioned in the general documentation.)

 I edited the ZFS Best Practices yesterday to reflect these changes.

I'd say, that In zpool version 19 or greater, it is recommended not to
mirror log devices. is not a very good advice and should be changed.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Robert Milkowski


On 07/04/2010 13:58, Ragnar Sundblad wrote:


Rather: ...=19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.

For a file server, mail server, etc etc, where things are stored
and supposed to be available later, you almost certainly want
redundancy on your slog too. (There may be file servers where
this doesn't apply, but they are special cases that should not
be mentioned in the general documentation.)

   


While I agree with you I want to mention that it is all about 
understanding a risk.
In this case not only your server has to crash in such a way so data has 
not been synced (sudden power loss for example) but there would have to 
be some data committed to a slog device(s) which was not written to a 
main pool and when your server restarts your slog device would have to 
completely die as well.


Other than that you are fine even with unmirrored slog device.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Bob Friesenhahn


On Wed, 7 Apr 2010, Ragnar Sundblad wrote:


So the recommendation for zpool 19 would be *strongly* recommended.  Mirror
your log device if you care about using your pool.
And the recommendation for zpool =19 would be ... don't mirror your log
device.  If you have more than one, just add them both unmirrored.


Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.


It is also worth pointing out that in normal operation the slog is 
essentially a write-only device which is only read at boot time.  The 
writes are assumed to work if the device claims success.  If the log 
device fails to read (oops!), then a mirror would be quite useful.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Robert Milkowski


On 07/04/2010 15:35, Bob Friesenhahn wrote:

On Wed, 7 Apr 2010, Ragnar Sundblad wrote:


So the recommendation for zpool 19 would be *strongly* 
recommended.  Mirror

your log device if you care about using your pool.
And the recommendation for zpool =19 would be ... don't mirror your 
log

device.  If you have more than one, just add them both unmirrored.


Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.


It is also worth pointing out that in normal operation the slog is 
essentially a write-only device which is only read at boot time.  The 
writes are assumed to work if the device claims success.  If the log 
device fails to read (oops!), then a mirror would be quite useful.


it is only read at boot if there are uncomitted data on it - during 
normal reboots zfs won't read data from slog.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Bob Friesenhahn


On Wed, 7 Apr 2010, Robert Milkowski wrote:


it is only read at boot if there are uncomitted data on it - during normal 
reboots zfs won't read data from slog.


How does zfs know if there is uncomitted data on the slog device 
without reading it?  The minimal read would be quite small, but it 
seems that a read is still required.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Neil Perrin


On 04/07/10 09:19, Bob Friesenhahn wrote:

On Wed, 7 Apr 2010, Robert Milkowski wrote:


it is only read at boot if there are uncomitted data on it - during 
normal reboots zfs won't read data from slog.


How does zfs know if there is uncomitted data on the slog device 
without reading it?  The minimal read would be quite small, but it 
seems that a read is still required.


Bob


If there's ever been synchronous activity then there an empty tail block 
(stubby) that

will be read even after a clean shutdown.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Edward Ned Harvey

 From: Ragnar Sundblad [mailto:ra...@csc.kth.se]
 
 Rather: ... =19 would be ... if you don't mind loosing data written
 the ~30 seconds before the crash, you don't have to mirror your log
 device.

If you have a system crash, *and* a failed log device at the same time, this
is an important consideration.  But if you have either a system crash, or a
failed log device, that don't happen at the same time, then your sync writes
are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
device on zpool = 19.


 I'd say, that In zpool version 19 or greater, it is recommended not to
 mirror log devices. is not a very good advice and should be changed.

See above.  Still disagree?

If desired, I could clarify the statement, by basically pasting what's
written above.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 It is also worth pointing out that in normal operation the slog is
 essentially a write-only device which is only read at boot time.  The
 writes are assumed to work if the device claims success.  If the log
 device fails to read (oops!), then a mirror would be quite useful.

An excellent point.

BTW, does the system *ever* read from the log device during normal
operation?  Such as perhaps during a scrub?  It really would be nice to
detect failure of log devices in advance, that are claiming to write
correctly, but which are really unreadable.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Neil Perrin


On 04/07/10 10:18, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Bob Friesenhahn

It is also worth pointing out that in normal operation the slog is
essentially a write-only device which is only read at boot time.  The
writes are assumed to work if the device claims success.  If the log
device fails to read (oops!), then a mirror would be quite useful.



An excellent point.

BTW, does the system *ever* read from the log device during normal
operation?  Such as perhaps during a scrub?  It really would be nice to
detect failure of log devices in advance, that are claiming to write
correctly, but which are really unreadable.


A scrub will read the log blocks but only for unplayed logs.
Because of the transient nature of the log and becuase it operates
outside of the transaction group model it's hard to read the in-flight 
log blocks

to validate them.

There have previously been suggestions to read slogs periodically.
I don't know if  there's a CR raised for this though.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Mark J Musante


On Wed, 7 Apr 2010, Neil Perrin wrote:

There have previously been suggestions to read slogs periodically. I 
don't know if there's a CR raised for this though.


Roch wrote up CR 6938883 Need to exercise read from slog dynamically


Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Bob Friesenhahn


On Wed, 7 Apr 2010, Edward Ned Harvey wrote:


From: Ragnar Sundblad [mailto:ra...@csc.kth.se]

Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.


If you have a system crash, *and* a failed log device at the same time, this
is an important consideration.  But if you have either a system crash, or a
failed log device, that don't happen at the same time, then your sync writes
are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
device on zpool = 19.


The point is that the slog is a write-only device and a device which 
fails such that its acks each write, but fails to read the data that 
it wrote, could silently fail at any time during the normal 
operation of the system.  It is not necessary for the slog device to 
fail at the exact same time that the system spontaneously reboots.  I 
don't know if Solaris implements a background scrub of the slog as a 
normal course of operation which would cause a device with this sort 
of failure to be exposed quickly.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Bob Friesenhahn


On Wed, 7 Apr 2010, Edward Ned Harvey wrote:


BTW, does the system *ever* read from the log device during normal
operation?  Such as perhaps during a scrub?  It really would be nice to
detect failure of log devices in advance, that are claiming to write
correctly, but which are really unreadable.


To make matters worse, a SSD with a large cache might satisfy such 
reads from its cache so a scrub of the (possibly) tiny bit of 
pending synchronous writes may not validate anything.  A lightly 
loaded slog should usually be empty.  We already know that some 
(many?) SSDs are not very good about persisting writes to FLASH, even 
after acking a cache flush request.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Richard Elling

On Apr 7, 2010, at 10:19 AM, Bob Friesenhahn wrote:
 On Wed, 7 Apr 2010, Edward Ned Harvey wrote:
 From: Ragnar Sundblad [mailto:ra...@csc.kth.se]
 
 Rather: ... =19 would be ... if you don't mind loosing data written
 the ~30 seconds before the crash, you don't have to mirror your log
 device.
 
 If you have a system crash, *and* a failed log device at the same time, this
 is an important consideration.  But if you have either a system crash, or a
 failed log device, that don't happen at the same time, then your sync writes
 are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
 device on zpool = 19.
 
 The point is that the slog is a write-only device and a device which fails 
 such that its acks each write, but fails to read the data that it wrote, 
 could silently fail at any time during the normal operation of the system.  
 It is not necessary for the slog device to fail at the exact same time that 
 the system spontaneously reboots.  I don't know if Solaris implements a 
 background scrub of the slog as a normal course of operation which would 
 cause a device with this sort of failure to be exposed quickly.

You are playing against marginal returns. An ephemeral storage requirement
is very different than permanent storage requirement.  For permanent storage
services, scrubs work well -- you can have good assurance that if you read
the data once then you will likely be able to read the same data again with 
some probability based on the expected decay of the data. For ephemeral data,
you do not read the same data more than once, so there is no correlation
between reading once and reading again later.  In other words, testing the
readability of an ephemeral storage service is like a cat chasing its tail.  
IMHO,
this is particularly problematic for contemporary SSDs that implement wear 
leveling.

sidebar
For clusters the same sort of problem exists for path monitoring. If you think
about paths (networks, SANs, cups-n-strings) then there is no assurance 
that a failed transfer means all subsequent transfers will also fail. Some other
permanence test is required to predict future transfer failures.
s/fail/pass/g
/sidebar

Bottom line: if you are more paranoid, mirror the separate log devices and
sleep through the night.  Pleasant dreams! :-)
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Miles Nordin

 jr == Jeroen Roodhart j.r.roodh...@uva.nl writes:

jr Running OSOL nv130. Power off the machine, removed the F20 and
jr power back on. Machines boots OK and comes up normally with
jr the following message in 'zpool status':

yeah, but try it again and this time put rpool on the F20 as well and
try to import the pool from a LiveCD: if you lose zpool.cache at this
stage, your pool is toast./end repeat mode


pgpt1GZtrVxS6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Ragnar Sundblad


On 7 apr 2010, at 18.13, Edward Ned Harvey wrote:

 From: Ragnar Sundblad [mailto:ra...@csc.kth.se]
 
 Rather: ... =19 would be ... if you don't mind loosing data written
 the ~30 seconds before the crash, you don't have to mirror your log
 device.
 
 If you have a system crash, *and* a failed log device at the same time, this
 is an important consideration.  But if you have either a system crash, or a
 failed log device, that don't happen at the same time, then your sync writes
 are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
 device on zpool = 19.

Right, but if you have a power or a hardware problem, chances are
that more things really break at the same time, including the slog
device(s).

 I'd say, that In zpool version 19 or greater, it is recommended not to
 mirror log devices. is not a very good advice and should be changed.
 
 See above.  Still disagree?
 
 If desired, I could clarify the statement, by basically pasting what's
 written above.

I believe that for a mail server, NFS server (to be spec compliant),
general purpose file server and the like, where the last written data
is as important as older data (maybe even more), it would be wise to
have at least as good redundancy on the slog as on the data disks.

If one can stand the (pretty small) risk of of loosing the last
transaction group before a crash, at the moment typically up to the
last 30 seconds of changes, you may have less redundancy on the slog.

(And if you don't care at all, like on a web cache perhaps, you
could of course disable the zil all together - that is kind of
the other end of the scale, which puts this in perspective.)

As Robert M so wisely and simply put it; It is all about understanding
a risk. I think the documentation should help people take educated
decisions, though I am not right now sure how to put the words to
describe this in an easily understandable way.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-06 Thread Jeroen Roodhart

Hi Roch,

 Can  you try 4 concurrent tar to four different ZFS
 filesystems (same pool). 

Hmmm, you're on to something here:

http://www.science.uva.nl/~jeroen/zil_compared_e1000_iostat_iops_svc_t_10sec_interval.pdf

In short: when using two exported file systems total time goes down to around 
4mins (IOPS maxes out at around 5500 when adding all four vmods together). When 
using four file systems total time goes down to around 3min30s (IOPS maxing out 
at about 9500).

I figured it is either NFS or a per file system data structure in the ZFS/ZIL 
interface. To rule out NFS I tried exporting two directories using default 
NFS shares (via /etc/dfs/dfstab entries). To my surprise this seems to bypass 
the ZIL all together (dropping to 100 IOPS, which results from our RAIDZ2 
configuration). So clearly ZFS sharenfs is more than a nice front end for NFS 
configuration :).  

But back to your suggestion: You clearly had a hypothesis behind your question. 
Care to elaborate?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-06 Thread Edward Ned Harvey

  We ran into something similar with these drives in an X4170 that
 turned
  out to
  be  an issue of the preconfigured logical volumes on the drives. Once
  we made
  sure all of our Sun PCI HBAs where running the exact same version of
  firmware
  and recreated the volumes on new drives arriving from Sun we got back
  into sync
  on the X25-E devices sizes.
 
 Can you elaborate?  Just today, we got the replacement drive that has
 precisely the right version of firmware and everything.  Still, when we
 plugged in that drive, and create simple volume in the storagetek
 raid utility, the new drive is 0.001 Gb smaller than the old drive.
 I'm still hosed.
 
 Are you saying I might benefit by sticking the SSD into some laptop,
 and zero'ing the disk?  And then attach to the sun server?
 
 Are you saying I might benefit by finding some other way to make the
 drive available, instead of using the storagetek raid utility?
 
 Thanks for the suggestions...

Sorry for the double post.  Since the wrong-sized drive was discussed in two
separate threads, I want to stick a link here to the other one, where the
question was answered.  Just incase anyone comes across this discussion by
search or whatever...

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-April/039669.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-05 Thread Kyle McDonald

On 4/4/2010 11:04 PM, Edward Ned Harvey wrote:
 Actually, It's my experience that Sun (and other vendors) do exactly
 that for you when you buy their parts - at least for rotating drives, I
 have no experience with SSD's.

 The Sun disk label shipped on all the drives is setup to make the drive
 the standard size for that sun part number. They have to do this since
 they (for many reasons) have many sources (diff. vendors, even diff.
 parts from the same vendor) for the actual disks they use for a
 particular Sun part number.
 
 Actually, if there is a fdisk partition and/or disklabel on a drive when it
 arrives, I'm pretty sure that's irrelevant.  Because when I first connect a
 new drive to the HBA, of course the HBA has to sign and initialize the drive
 at a lower level than what the OS normally sees.  So unless I do some sort
 of special operation to tell the HBA to preserve/import a foreign disk, the
 HBA will make the disk blank before the OS sees it anyway.

   
That may be true. Though these days they may be spec'ing the drives to
the manufacturer's at an even lower level.

So does your HBA have newer firmware now than it did when the first disk
was connected?
Maybe it's the HBA that is handling the new disks differently now, than
it did when the first one was plugged in?

Can you down rev the HBA FW? Do you have another HBa that might still
have the older Rev you coudltest it on?

  -Kyle


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-05 Thread Edward Ned Harvey

 From: Kyle McDonald [mailto:kmcdon...@egenera.com]

 So does your HBA have newer firmware now than it did when the first
 disk
 was connected?
 Maybe it's the HBA that is handling the new disks differently now, than
 it did when the first one was plugged in?
 
 Can you down rev the HBA FW? Do you have another HBa that might still
 have the older Rev you coudltest it on?

I'm planning to get the support guys more involved tomorrow, so ... things
have been pretty stagnant for several days now, I think it's time to start
putting more effort into this.

Long story short, I don't know yet.  But there is one glaring clue:  Prior
to OS installation, I don't know how to configure the HBA.  This means the
HBA must have been preconfigured with the factory installed disks, and I
followed a different process with my new disks, because I was using the GUI
within the OS.  My best hope right now is to find some other way to
configure the HBA, possibly through the ILOM, but I already searched there
and looked at everything.  Maybe I have to shutdown (power cycle) the system
and attach keyboard  monitor.  I don't know yet...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Ragnar Sundblad


On 4 apr 2010, at 06.01, Richard Elling wrote:

Thank you for your reply! Just wanted to make sure.

 Do not assume that power outages are the only cause of unclean shutdowns.
 -- richard

Thanks, I have seen that mistake several times with other
(file)systems, and hope I'll never ever make it myself! :-)

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Edward Ned Harvey

 Hmm, when you did the write-back test was the ZIL SSD included in the
 write-back?

 What I was proposing was write-back only on the disks, and ZIL SSD
 with no write-back.

The tests I did were:
All disks write-through
All disks write-back
With/without SSD for ZIL

All the permutations of the above.

So, unfortunately, no, I didn't test with WriteBack enabled only for
spindles, and WriteThrough on SSD.  

It has been suggested, and this is actually what I now believe based on my
experience, that precisely the opposite would be the better configuration.
If the spindles are configured WriteThrough, while the SSD is configured
WriteBack.  I believe would be optimal.

If I get the opportunity to test further, I'm interested and I will.  But
who knows when/if that will happen.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Edward Ned Harvey

 Actually, It's my experience that Sun (and other vendors) do exactly
 that for you when you buy their parts - at least for rotating drives, I
 have no experience with SSD's.
 
 The Sun disk label shipped on all the drives is setup to make the drive
 the standard size for that sun part number. They have to do this since
 they (for many reasons) have many sources (diff. vendors, even diff.
 parts from the same vendor) for the actual disks they use for a
 particular Sun part number.

Actually, if there is a fdisk partition and/or disklabel on a drive when it
arrives, I'm pretty sure that's irrelevant.  Because when I first connect a
new drive to the HBA, of course the HBA has to sign and initialize the drive
at a lower level than what the OS normally sees.  So unless I do some sort
of special operation to tell the HBA to preserve/import a foreign disk, the
HBA will make the disk blank before the OS sees it anyway.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Casper . Dik



The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

This is what Jeff Bonwick says in the zil synchronicity arc case:

   What I mean is that the barrier semantic is implicit even with no ZIL at 
all.
   In ZFS, if event A happens before event B, and you lose power, then
   what you'll see on disk is either nothing, A, or both A and B.  Never just B.
   It is impossible for us not to have at least barrier semantics.

So there's no chance that a *later* async write will overtake an earlier
sync *or* async write.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Neil Perrin


On 04/02/10 08:24, Edward Ned Harvey wrote:

The purpose of the ZIL is to act like a fast log for synchronous
writes.  It allows the system to quickly confirm a synchronous write
request with the minimum amount of work.  



Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
answer this question, I wrote that code, or at least have read it?
  


I'm one of the ZFS developers. I wrote most of the zil code.
Still I don't have all the answers. There's a lot of knowledgeable people
on this alias. I usually monitor this alias and sometimes chime in
when there's some misinformation being spread, but sometimes the volume 
is so high.

Since I started this reply there's been 20 new posts on this thread alone!


Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls? 
  


- The intent log (separate device(s) or not) is only used by fsync, 
O_DSYNC, O_SYNC, O_RSYNC.

NFS commits are seen to ZFS as fsyncs.
Note sync(1m) and sync(2s) do not use the intent log. They force 
transaction group (txg)
commits on all pools. So zfs goes beyond the the requirement for sync() 
which only requires

it schedules but does not necessarily complete the writing before returning.
The zfs interpretation is rather expensive but seemed broken so we fixed it.


Is it ever used to accelerate async writes?



The zil is not used to accelerate async writes.


Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.
  


Threads can be pre-empted in the OS at any time. So even though thread A 
issued
W1 before thread B issued W2, the order is not guaranteed to arrive at 
ZFS as W1, W2.

Multi-threaded applications have to handle this.

If this was a single thread issuing W1 then W2 then yes the order is 
guaranteed

regardless of whether W1 or W2 are synchronous or asynchronous.
Of course if the system crashes then the async operations might not be 
there.



I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?
  


- Kind of. The uberblock contains the root of the txg.



At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?
  


A txg is for the whole pool which can contain many filesystems.
The latest txg defines the current state of the pool and each individual fs.


My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.


Correct (except replace sync() with O_DSYNC, etc).
This also assumes hardware that for example handles correctly the 
flushing of it's caches.



  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.
  


The ZIL doesn't make such guarantees. It's the DMU that handles transactions
and their grouping into txgs. It ensures that writes are committed in order
by it's transactional nature.

The function of the zil is to merely ensure that synchronous operations are
stable and replayed after a crash/power fail onto the latest txg.


Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.
  

No, disabling the ZIL does not disable the DMU.


Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Jeroen Roodhart

Hi Al,

 Have you tried the DDRdrive from Christopher George
 cgeo...@ddrdrive.com?
 Looks to me like a much better fit for your application than the F20?
 
 It would not hurt to check it out.  Looks to me like
 you need a product with low *latency* - and a RAM based cache
 would be a much better performer than any solution based solely on
 flash.
 
 Let us know (on the list) how this works out for you.

Well, I did look at it but at that time there was no Solaris support yet. Right 
now it seems there is only a beta driver? I kind of remember that if you'd want 
reliable fallback to nvram, you'd need an UPS feeding the card. I could be very 
wrong there, but the product documentation isn't very clear on this (at least 
to me ;) ) 

Also, we'd kind of like to have a SnOracle supported option. 

But yeah, on paper it does seem it could be an attractive solution...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Christopher George

 Well, I did look at it but at that time there was no Solaris support yet. 
 Right now it 
 seems there is only a beta driver?

Correct, we just completed functional validation of the OpenSolaris driver.  
Our 
focus has now turned to performance tuning and benchmarking.  We expect to 
formally introduce the DDRdrive X1 to the ZFS community later this quarter.  It 
is our 
goal to focus exclusively on the dedicated ZIL device market going forward. 

  I kind of remember that if you'd want reliable fallback to nvram, you'd need 
 an 
  UPS feeding the card.

Currently, a dedicated external UPS is required for correct operation.  Based 
on 
community feedback, we will be offering automatic backup/restore prior to 
release.  
This guarantees the UPS will only be required for 60 secs to successfully 
backup 
the drive contents on a host power or hardware failure.  Dutifully on the next 
reboot 
the restore will occur prior to the OS loading for seamless non-volatile 
operation.

Also,we have heard loud and clear the requests for a internal power option.  It 
is our 
intention the X1 will be the first in a family of products all dedicated to ZIL 
acceleration for not only OpenSolaris but also Solaris 10 and FreeBSD.

  Also, we'd kind of like to have a SnOracle supported option.

Although a much smaller company, we believe our singular focus and absolute 
passion 
for ZFS and the potential of Hybrid Storage Pools will serve our customers well.

We are actively designing our soon to be available support plans.  Your voice 
will be 
heard, please email directly at cgeorge at ddrdrive dot com for requests, 
comments
and/or questions.

Thanks,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Ragnar Sundblad


On 1 apr 2010, at 06.15, Stuart Anderson wrote:

 Assuming you are also using a PCI LSI HBA from Sun that is managed with
 a utility called /opt/StorMan/arcconf and reports itself as the amazingly
 informative model number Sun STK RAID INT what worked for me was to run,
 arcconf delete (to delete the pre-configured volume shipped on the drive)
 arcconf create (to create a new volume)

Just to sort things out (or not? :-): 

I more than agree that this product is highly confusing, but I
don't think there is anything LSI in or about that card. I believe
it is an Adaptec card, developed, manufactured and supported by
Intel for Adaptec, licensed (or something) to StorageTek, and later
included in Sun machines (since Sun bought StorageTek, I suppose).
Now we could add Oracle to this name dropping inferno, if we would
want to.

I am not sure why they (Sun) put those in there, they don't seem
very fast or smart or anything.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Ragnar Sundblad


On 2 apr 2010, at 22.47, Neil Perrin wrote:

 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.
   
 
 
 Threads can be pre-empted in the OS at any time. So even though thread A 
 issued
 W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as 
 W1, W2.
 Multi-threaded applications have to handle this.
 
 If this was a single thread issuing W1 then W2 then yes the order is 
 guaranteed
 regardless of whether W1 or W2 are synchronous or asynchronous.
 Of course if the system crashes then the async operations might not be there.

Could you please clarify this last paragraph a little:
Do you mean that this is in the case that you have ZIL enabled
and the txg for W1 and W2 hasn't been commited, so that upon reboot
the ZIL is replayed, and therefore only the sync writes are
eventually there?

If, lets say, W1 is an async small write, W2 is a sync small write,
W1 arrives to zfs before W2, and W2 arrives before the txg is
commited, will both writes always be in the txg on disk?
If so, it would mean that zfs itself never buffer up async writes to
larger blurbs to write at a later txg, correct?
I take it that ZIL enabled or not does not make any difference here
(we pretend the system did _not_ crash), correct?

Thanks!

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Richard Elling

On Apr 3, 2010, at 5:47 PM, Ragnar Sundblad wrote:
 On 2 apr 2010, at 22.47, Neil Perrin wrote:
 
 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.
 
 
 
 Threads can be pre-empted in the OS at any time. So even though thread A 
 issued
 W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS 
 as W1, W2.
 Multi-threaded applications have to handle this.
 
 If this was a single thread issuing W1 then W2 then yes the order is 
 guaranteed
 regardless of whether W1 or W2 are synchronous or asynchronous.
 Of course if the system crashes then the async operations might not be there.
 
 Could you please clarify this last paragraph a little:
 Do you mean that this is in the case that you have ZIL enabled
 and the txg for W1 and W2 hasn't been commited, so that upon reboot
 the ZIL is replayed, and therefore only the sync writes are
 eventually there?

yes. The ZIL needs to be replayed on import after an unclean shutdown.

 If, lets say, W1 is an async small write, W2 is a sync small write,
 W1 arrives to zfs before W2, and W2 arrives before the txg is
 commited, will both writes always be in the txg on disk?

yes

 If so, it would mean that zfs itself never buffer up async writes to
 larger blurbs to write at a later txg, correct?

correct

 I take it that ZIL enabled or not does not make any difference here
 (we pretend the system did _not_ crash), correct?

For import following a clean shutdown, there are no transactions in 
the ZIL to apply.

For async-only workloads, there are no transactions in the ZIL to apply.

Do not assume that power outages are the only cause of unclean shutdowns.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik


On 01/04/2010 20:58, Jeroen Roodhart wrote:

 I'm happy to see that it is now the default and I hope this will cause the
 Linux NFS client implementation to be faster for conforming NFS servers.
  
 Interesting thing is that apparently defaults on Solaris an Linux are chosen 
 such that one can't
 signal the desired behaviour to the other. At least we didn't manage to get a 
Linux client to asyn
chronously mount a Solaris (ZFS backed) NFS export...


Which is to be expected as it is not a nfs client which requests the 
behavior but rather a nfs server.
Currently on Linux you can export a share with as sync (default) or 
async share while on Solaris you can't really currently force a NFS 
server to start working in an async mode.


The other part of the issue is that the Solaris Clients have been 
developed with a sync server.  The client write behinds more and
continues caching the non-acked data.  The Linux client has been developed 
with a async server and has some catching up to do.


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Roch


Robert Milkowski writes:
  On 01/04/2010 20:58, Jeroen Roodhart wrote:
  
   I'm happy to see that it is now the default and I hope this will cause the
   Linux NFS client implementation to be faster for conforming NFS servers.

   Interesting thing is that apparently defaults on Solaris an Linux are 
   chosen such that one can't signal the desired behaviour to the other. At 
   least we didn't manage to get a Linux client to asynchronously mount a 
   Solaris (ZFS backed) NFS export...
  
  
  Which is to be expected as it is not a nfs client which requests the 
  behavior but rather a nfs server.
  Currently on Linux you can export a share with as sync (default) or 
  async share while on Solaris you can't really currently force a NFS 
  server to start working in an async mode.
  

True, and there is an entrenched misconception (not you)
that this a ZFS specific problem which it's not. 

It's really an NFS protocol feature which can be
circumvented using zil_disable which therefore reinforces
the misconception.  It's further reinforced by testing NFS
server on disk drives with WCE=1 with filesystem not ZFS.

All fast options cause the NFS client to become inconsistent
after a server reboot. Whatever was being done in the moments
prior to server reboot will need to be wiped out by users if
they are told that the server did reboot. That's manageable
for home use not for the entreprise.

-r





  -- 
  Robert Milkowski
  http://milek.blogspot.com
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

  Seriously, all disks configured WriteThrough (spindle and SSD disks
  alike)
  using the dedicated ZIL SSD device, very noticeably faster than
  enabling the
  WriteBack.
 
 What do you get with both SSD ZIL and WriteBack disks enabled?
 
 I mean if you have both why not use both? Then both async and sync IO
 benefits.

Interesting, but unfortunately false.  Soon I'll post the results here.  I
just need to package them in a way suitable to give the public, and stick it
on a website.  But I'm fighting IT fires for now and haven't had the time
yet.

Roughly speaking, the following are approximately representative.  Of course
it varies based on tweaks of the benchmark and stuff like that.
Stripe 3 mirrors write through:  450-780 IOPS
Stripe 3 mirrors write back:  1030-2130 IOPS
Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
ZIL is 3-4 times faster than naked disk.  And for some reason, having the
WriteBack enabled while you have SSD ZIL actually hurts performance by
approx 10%.  You're better off to use the SSD ZIL with disks in Write
Through mode.

That result is surprising to me.  But I have a theory to explain it.  When
you have WriteBack enabled, the OS issues a small write, and the HBA
immediately returns to the OS:  Yes, it's on nonvolatile storage.  So the
OS quickly gives it another, and another, until the HBA write cache is full.
Now the HBA faces the task of writing all those tiny writes to disk, and the
HBA must simply follow orders, writing a tiny chunk to the sector it said it
would write, and so on.  The HBA cannot effectively consolidate the small
writes into a larger sequential block write.  But if you have the WriteBack
disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
SSD, and immediately return to the process:  Yes, it's on nonvolatile
storage.  So the application can issue another, and another, and another.
ZFS is smart enough to aggregate all these tiny write operations into a
single larger sequential write before sending it to the spindle disks.  

Long story short, the evidence suggests if you have SSD ZIL, you're better
off without WriteBack on the HBA.  And I conjecture the reasoning behind it
is because ZFS can write buffer better than the HBA can.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

 I know it is way after the fact, but I find it best to coerce each
 drive down to the whole GB boundary using format (create Solaris
 partition just up to the boundary). Then if you ever get a drive a
 little smaller it still should fit.

It seems like it should be unnecessary.  It seems like extra work.  But
based on my present experience, I reached the same conclusion.

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what's to prevent
the same thing from happening to one of my 1TB spindle disk mirrors?
Nothing.  That's what.

I take it back.  Me.  I am to prevent it from happening.  And the technique
to do so is precisely as you've said.  First slice every drive to be a
little smaller than actual.  Then later if I get a replacement device for
the mirror, that's slightly smaller than the others, I have no reason to
care.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Roch


  When we use one vmod, both machines are finished in about 6min45,
  zilstat maxes out at about 4200 IOPS.
  Using four vmods it takes about 6min55, zilstat maxes out at 2200
  IOPS. 

Can  you try 4 concurrent tar to four different ZFS filesystems (same
pool). 

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

  http://nfs.sourceforge.net/
 
 I think B4 is the answer to Casper's question:

We were talking about ZFS, and under what circumstances data is flushed to
disk, in what way sync and async writes are handled by the OS, and what
happens if you disable ZIL and lose power to your system.

We were talking about C/C++ sync and async.  Not NFS sync and async.

I don't think anything relating to NFS is the answer to Casper's question,
or else, Casper was simply jumping context by asking it.  Don't get me
wrong, I have no objection to his question or anything, it's just that the
conversation has derailed and now people are talking about NFS sync/async
instead of what happens when a C/C++ application is doing sync/async writes
to a disabled ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

  I am envisioning a database, which issues a small sync write,
 followed by a
  larger async write.  Since the sync write is small, the OS would
 prefer to
  defer the write and aggregate into a larger block.  So the
 possibility of
  the later async write being committed to disk before the older sync
 write is
  a real risk.  The end result would be inconsistency in my database
 file.
 
 Zfs writes data in transaction groups and each bunch of data which
 gets written is bounded by a transaction group.  The current state of
 the data at the time the TXG starts will be the state of the data once
 the TXG completes.  If the system spontaneously reboots then it will
 restart at the last completed TXG so any residual writes which might
 have occured while a TXG write was in progress will be discarded.
 Based on this, I think that your ordering concerns (sync writes
 getting to disk faster than async writes) are unfounded for normal
 file I/O.

So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?

If that's true, if there's no increased risk of data corruption, then why
doesn't everybody just disable their ZIL all the time on every system?

The reason to have a sync() function in C/C++ is so you can ensure data is
written to disk before you move on.  It's a blocking call, that doesn't
return until the sync is completed.  The only reason you would ever do this
is if order matters.  If you cannot allow the next command to begin until
after the previous one was completed.  Such is the situation with databases
and sometimes virtual machines.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

 hello
 
 i have had this problem this week. our zil ssd died (apt slc ssd 16gb).
 because we had no spare drive in stock, we ignored it.
 
 then we decided to update our nexenta 3 alpha to beta, exported the
 pool and made a fresh install to have a clean system and tried to
 import the pool. we only got a error message about a missing drive.
 
 we googled about this and it seems there is no way to acces the pool
 !!!
 (hope this will be fixed in future)
 
 we had a backup and the data are not so important, but that could be a
 real problem.
 you have  a valid zfs3 pool and you cannot access your data due to
 missing zil.

If you have zpool less than version 19 (when ability to remove log device
was introduced) and you have a non-mirrored log device that failed, you had
better treat the situation as an emergency.  Normally you can find your
current zpool version by doing zpool upgrade, but you cannot now if you're
in this failure state.  Do not attempt zfs send or zfs list or any other
zpool or zfs command.  Instead, do man zpool and look for zpool remove.
If it says supports removing log devices then you had better use it to
remove your log device.  If it says only supports removing hotspares or
cache then your zpool is lost permanently.

If you are running Solaris, take it as given, you do not have zpool version
19.  If you are running Opensolaris, I don't know at which point zpool 19
was introduced.  Your only hope is to zpool remove the log device.  Use
tar or cp or something, to try and salvage your data out of there.  Your
zpool is lost and if it's functional at all right now, it won't stay that
way for long.  Your system will soon hang, and then you will not be able to
import your pool.

Ask me how I know.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

 ZFS recovers to a crash-consistent state, even without the slog,
 meaning it recovers to some state through which the filesystem passed
 in the seconds leading up to the crash.  This isn't what UFS or XFS
 do.
 
 The on-disk log (slog or otherwise), if I understand right, can
 actually make the filesystem recover to a crash-INconsistent state (a

You're speaking the opposite of common sense.  If disabling the ZIL makes
the system faster *and* less prone to data corruption, please explain why we
don't all disable the ZIL?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

 If you have zpool less than version 19 (when ability to remove log
 device
 was introduced) and you have a non-mirrored log device that failed, you
 had
 better treat the situation as an emergency.  

 Instead, do man zpool and look for zpool
 remove.
 If it says supports removing log devices then you had better use it
 to
 remove your log device.  If it says only supports removing hotspares
 or
 cache then your zpool is lost permanently.

I take it back.  If you lost your log device on a zpool which is less than
version 19, then you *might* have a possible hope if you migrate your disks
to a later system.  You *might* be able to zpool import on a later version
of OS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik


  http://nfs.sourceforge.net/
 
 I think B4 is the answer to Casper's question:

We were talking about ZFS, and under what circumstances data is flushed to
disk, in what way sync and async writes are handled by the OS, and what
happens if you disable ZIL and lose power to your system.

We were talking about C/C++ sync and async.  Not NFS sync and async.

I don't think so.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg36783.html

(This discussion was started, I think, in the context of NFS performance)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik



So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?

The question is not how the writes are ordered but whether an earlier
write can be in a later txg.  A transaction group is committed atomically.

In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar 
question to make sure I understand it correctly, and the answer was:

 = Casper, the answer is from Neil Perrin:

 Is there a partialy order defined for all filesystem operations?
   

File system operations  will be written in order for all settings of 
the 
sync flag.

 Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a
 file,
   
(I assume by O_DATA you meant O_DSYNC).

 that later transactions will not be in an earlier transaction group?
 (Or is this already the case?)
  
This is already the case.


So what I assumed was true but what you made me doubt, was apparently still
true: later transactions cannot be committed in an earlier txg.



If that's true, if there's no increased risk of data corruption, then why
doesn't everybody just disable their ZIL all the time on every system?

For an application running on the file server, there is no difference.
When the system panics you know that data might be lost.  The application 
also dies.  (The snapshot and the last valid uberblock are equally valid)

But for an application on an NFS client, without ZIL data will be lost 
while the NFS client believes the data is written amd it will not try 
again.  With the ZIL, when the NFS server says that data is written then 
it is actually on stable storage.

The reason to have a sync() function in C/C++ is so you can ensure data is
written to disk before you move on.  It's a blocking call, that doesn't
return until the sync is completed.  The only reason you would ever do this
is if order matters.  If you cannot allow the next command to begin until
after the previous one was completed.  Such is the situation with databases
and sometimes virtual machines.  

So the question is: when will your data invalid?

What happens with the data when the system dies before the fsync() call?
What happens with the data when the system dies after the fsync() call?
What happens with the data when the system dies after more I/O operations?

With the zil disabled, you call fsync() but you may encounter data from
before the call to fsync().  That could happen before, so I assume you can
actually recover from that situation.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

 Dude, don't be so arrogant.  Acting like you know what I'm talking
 about
 better than I do.  Face it that you have something to learn here.
 
 You may say that, but then you post this:

Acknowledged.  I read something arrogant, and I replied even more arrogant.
That was dumb of me.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

 Only a broken application uses sync writes
 sometimes, and async writes at other times.

Suppose there is a virtual machine, with virtual processes inside it.  Some
virtual process issues a sync write to the virtual OS, meanwhile another
virtual process issues an async write.  Then the virtual OS will sometimes
issue sync writes and sometimes async writes to the host OS.

Are you saying this makes qemu, and vbox, and vmware broken applications?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

 The purpose of the ZIL is to act like a fast log for synchronous
 writes.  It allows the system to quickly confirm a synchronous write
 request with the minimum amount of work.  

Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
answer this question, I wrote that code, or at least have read it?

Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?  Is it
ever used to accelerate async writes?

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.

Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn't
necessarily gain you anything.

The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik



Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?  Is it
ever used to accelerate async writes?

There are quite a few of sync writes, specifically when you mix in the 
NFS server.

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

What I quoted from the other discussion, it seems to be that later writes 
cannot be committed in an earlier TXG then your sync write or other earlier
writes.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

The uberblock is the root of all the data.  All the data in a ZFS pool 
is referenced by it; after the txg is in stable storage then the uberblock 
is updated.

At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?

The current zpool and the filesystems such as referenced by the last
uberblock.

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

sync() is actually *async* and returning from sync() says nothing about 
stable storage.  After fsync() returns it signals that all the data is
in stable storage (except if you disable ZIL), or, apparently, in Linux
when the write caches for your disks are enabled (the default for PC
drives).  ZFS doesn't care about the writecache; it makes sure it is 
flushed.  (There's fsyc() and open(..., O_DSYNC|O_SYNC)

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.

Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn't
necessarily gain you anything.

The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.


I believe that the writes are still ordered so the consistency you want is 
actually delivered even without the ZIL enabled.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Kyle McDonald

On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:
 I know it is way after the fact, but I find it best to coerce each
 drive down to the whole GB boundary using format (create Solaris
 partition just up to the boundary). Then if you ever get a drive a
 little smaller it still should fit.
 
 It seems like it should be unnecessary.  It seems like extra work.  But
 based on my present experience, I reached the same conclusion.

 If my new replacement SSD with identical part number and firmware is 0.001
 Gb smaller than the original and hence unable to mirror, what's to prevent
 the same thing from happening to one of my 1TB spindle disk mirrors?
 Nothing.  That's what.

   
Actually, It's my experience that Sun (and other vendors) do exactly
that for you when you buy their parts - at least for rotating drives, I
have no experience with SSD's.

The Sun disk label shipped on all the drives is setup to make the drive
the standard size for that sun part number. They have to do this since
they (for many reasons) have many sources (diff. vendors, even diff.
parts from the same vendor) for the actual disks they use for a
particular Sun part number.

This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same
reasons.
I'm a little surprised that the engineers would suddenly stop doing it
only on SSD's. But who knows.

  -Kyle

 I take it back.  Me.  I am to prevent it from happening.  And the technique
 to do so is precisely as you've said.  First slice every drive to be a
 little smaller than actual.  Then later if I get a replacement device for
 the mirror, that's slightly smaller than the others, I have no reason to
 care.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Mattias Pantzare

On Fri, Apr 2, 2010 at 16:24, Edward Ned Harvey solar...@nedharvey.com wrote:
 The purpose of the ZIL is to act like a fast log for synchronous
 writes.  It allows the system to quickly confirm a synchronous write
 request with the minimum amount of work.

 Bob and Casper and some others clearly know a lot here.  But I'm hearing
 conflicting information, and don't know what to believe.  Does anyone here
 work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
 answer this question, I wrote that code, or at least have read it?

 Questions to answer would be:

 Is a ZIL log device used only by sync() and fsync() system calls?  Is it
 ever used to accelerate async writes?

sync() will tell the filesystems to flush writes to disk. sync() will
not use ZIL, it will just start a new TXG, and could return before the
writes are done.

fsync() is what you are interested in.


 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.


Writers from a TXG will not be used until the whole TXG is committed to disk.
Everything from a half written TXG will be ignored after a crash.

This means that the order of writes within a TXG is not important.

The only way to do a sync write without ZIL is to start a new TXG
after the write. That costs a lot so we have the ZIL for sync writes.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Bob Friesenhahn


On Fri, 2 Apr 2010, Edward Ned Harvey wrote:


So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?


I am like a pool or tank of regurgitated zfs knowledge.  I simply 
pay attention when someone who really knows explains something (e.g. 
Neil Perrin, as Casper referred to) so I can regurgitate it later.  I 
try to do so faithfully.  If I had behaved this way in school, I would 
have been a good student.  Sometimes I am wrong or the design has 
somewhat changed since the original information was provided.


There are indeed popular filesystems (e.g. Linux EXT4) which write 
data to disk in different order than cronologically requested so it is 
good that you are paying attention to these issues.  While in the 
slog-based recovery scenario, it is possible for a TXG to be generated 
which lacks async data, this only happens after a system crash and if 
all of the critical data is written as a sync request, it will be 
faithfully preserved.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Bob Friesenhahn


On Fri, 2 Apr 2010, Edward Ned Harvey wrote:

were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.


You seem to be assuming that Solaris is an incoherent operating 
system.  With ZFS, the filesystem in memory is coherent, and 
transaction groups are constructed in simple chronological order 
(capturing combined changes up to that point in time), without regard 
to SYNC options.  The only possible exception to the coherency is for 
memory mapped files, where the mapped memory is a copy of data 
(originally) from the ZFS ARC and needs to be reconciled with the ARC 
if an application has dirtied it.  This differs from UFS and the way 
Solaris worked prior to Solaris 10.


Synchronous writes are not faster than asynchronous writes.  If you 
drop heavy and light objects from the same height, they fall at the 
same rate.  This was proven long ago.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Stuart Anderson


On Apr 2, 2010, at 5:08 AM, Edward Ned Harvey wrote:

 I know it is way after the fact, but I find it best to coerce each
 drive down to the whole GB boundary using format (create Solaris
 partition just up to the boundary). Then if you ever get a drive a
 little smaller it still should fit.
 
 It seems like it should be unnecessary.  It seems like extra work.  But
 based on my present experience, I reached the same conclusion.
 
 If my new replacement SSD with identical part number and firmware is 0.001
 Gb smaller than the original and hence unable to mirror, what's to prevent
 the same thing from happening to one of my 1TB spindle disk mirrors?
 Nothing.  That's what.
 
 I take it back.  Me.  I am to prevent it from happening.  And the technique
 to do so is precisely as you've said.  First slice every drive to be a
 little smaller than actual.  Then later if I get a replacement device for
 the mirror, that's slightly smaller than the others, I have no reason to
 care.

However, I believe there are some downsides to letting ZFS manage just
a slice rather than an entire drive, but perhaps those do not apply as
significantly to SSD devices?

Thanks

--
Stuart Anderson  ander...@ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Ross Walker

On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey
solar...@nedharvey.com wrote:
  Seriously, all disks configured WriteThrough (spindle and SSD disks
  alike)
  using the dedicated ZIL SSD device, very noticeably faster than
  enabling the
  WriteBack.

 What do you get with both SSD ZIL and WriteBack disks enabled?

 I mean if you have both why not use both? Then both async and sync IO
 benefits.

 Interesting, but unfortunately false.  Soon I'll post the results here.  I
 just need to package them in a way suitable to give the public, and stick it
 on a website.  But I'm fighting IT fires for now and haven't had the time
 yet.

 Roughly speaking, the following are approximately representative.  Of course
 it varies based on tweaks of the benchmark and stuff like that.
        Stripe 3 mirrors write through:  450-780 IOPS
        Stripe 3 mirrors write back:  1030-2130 IOPS
        Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
        Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

 Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
 ZIL is 3-4 times faster than naked disk.  And for some reason, having the
 WriteBack enabled while you have SSD ZIL actually hurts performance by
 approx 10%.  You're better off to use the SSD ZIL with disks in Write
 Through mode.

 That result is surprising to me.  But I have a theory to explain it.  When
 you have WriteBack enabled, the OS issues a small write, and the HBA
 immediately returns to the OS:  Yes, it's on nonvolatile storage.  So the
 OS quickly gives it another, and another, until the HBA write cache is full.
 Now the HBA faces the task of writing all those tiny writes to disk, and the
 HBA must simply follow orders, writing a tiny chunk to the sector it said it
 would write, and so on.  The HBA cannot effectively consolidate the small
 writes into a larger sequential block write.  But if you have the WriteBack
 disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
 SSD, and immediately return to the process:  Yes, it's on nonvolatile
 storage.  So the application can issue another, and another, and another.
 ZFS is smart enough to aggregate all these tiny write operations into a
 single larger sequential write before sending it to the spindle disks.

Hmm, when you did the write-back test was the ZIL SSD included in the
write-back?

What I was proposing was write-back only on the disks, and ZIL SSD
with no write-back.

Not all operations hit the ZIL, so it would still be nice to have the
non-ZIL operations return quickly.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Robert Milkowski


On 02/04/2010 16:04, casper@sun.com wrote:


sync() is actually *async* and returning from sync() says nothing about

   


to clarify - in case of ZFS sync() is actually synchronous.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Tirso Alonso

 If my new replacement SSD with identical part number and firmware is 0.001
 Gb smaller than the original and hence unable to mirror, what's to prevent
 the same thing from happening to one of my 1TB spindle disk mirrors?

There is a standard for sizes that many manufatures use (IDEMA LBA1-02):

LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes – 50.0)) 

Sizes should match exactly if the manufacturer follows the standard.

See:
http://opensolaris.org/jive/message.jspa?messageID=393336#393336
http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=downloaddata_file_id=1066
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Miles Nordin

 enh == Edward Ned Harvey solar...@nedharvey.com writes:

   enh If you have zpool less than version 19 (when ability to remove
   enh log device was introduced) and you have a non-mirrored log
   enh device that failed, you had better treat the situation as an
   enh emergency.

Ed the log device removal support is only good for adding a slog to
try it out, then changing your mind and removing the slog (which was
not possible before).  It doesn't change the reliability situation one
bit: pools with dead slogs are not importable.  There've been threads
on this for a while.  It's well-discussed because it's an example of
IMHO broken process of ``obviously a critical requirement but not
technically part of the original RFE which is already late,'' as well
as a dangerous pitfall for ZFS admins.  I imagine the process works
well in other cases to keep stuff granular enough that it can be
prioritized effectively, but in this case it's made the slog feature
significantly incomplete for a couple years and put many production
systems in a precarious spot, and the whole mess was predicted before
the slog feature was integrated.

  The on-disk log (slog or otherwise), if I understand right, can
  actually make the filesystem recover to a crash-INconsistent
  state 

   enh You're speaking the opposite of common sense.  

Yeah, I'm doing it on purpose to suggest that just guessing how you
feel things ought to work based on vague notions of economy isn't a
good idea.

   enh If disabling the ZIL makes the system faster *and* less prone
   enh to data corruption, please explain why we don't all disable
   enh the ZIL?

I said complying with fsync can make the system recover to a state not
equal to one you might have hypothetically snapshotted in a moment
leading up to the crash.  Elsewhere I might've said disabling the ZIL
does not make the system more prone to data corruption, *iff* you are
not an NFS server.

If you are, disabling the ZIL can lead to lost writes if an NFS server
reboots and an NFS client does not, which can definitely cause
app-level data corruption.

Disabling the ZIL breaks the D requirement of ACID databases which
might screw up apps that replicate, or keep databases on several
separate servers in sync, and it might lead to lost mail on an MTA,
but because unlike non-COW filesystems it costs nothing extra for ZFS
to preserve write ordering even without fsync(), AIUI you will not get
corrupted application-level data by disabling the ZIL.  you just get
missing data that the app has a right to expect should be there.  The
dire warnings written by kernel developers in the wikis of ``don't
EVER disable the ZIL'' are totally ridiculous and inappropriate IMO.
I think they probably just worked really hard to write the ZIL piece
of ZFS, and don't want people telling their brilliant code to fuckoff
just because it makes things a little slower.  so we get all this
``enterprise'' snobbery and so on.

``crash consistent'' is a technical term not a common-sense term, and
I may have used it incorrectly:

 
http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html

With a system that loses power on which fsync() had been in use, the
files getting fsync()'ed will probably recover to more recent versions
than the rest of the files, which means the recovered state achieved
by yanking the cord couldn't have been emulated by cloning a snapshot
and not actually having lost power.  However, the app calling fsync()
will expect this, so it's not supposed to lead to application-level
inconsistency.  

If you test your app's recovery ability in just that way, by cloning
snapshots of filesystems on which the app is actively writing and then
seeing if the app can recover the clone, then you're unfortunately not
testing the app quite hard enough if fsync() is involved, so yeah I
guess disabling the ZIL might in theory make incorrectly-written apps
less prone to data corruption.  Likewise, no testing of the app on a
ZFS will be aggressive enough to make the app powerfail-proof on a
non-COW POSIX system because ZFS keeps more ordering than the API
actually guarantees to the app.

I'm repeating myself though.  I wish you'll just read my posts with at
least paragraph granularity instead of just picking out individual
sentences and discarding everything that seems too complicated or too
awkwardly stated.

I'm basing this all on the ``common sense'' that to do otherwise,
fsync() would have to completely ignore its filedescriptor
argument. It'd have to copy the entire in-memory ZIL to the slog and
behave the same as 'lockfs -fa', which I think would perform too badly
compared to non-ZFS filesystems' fsync()s, and would lead to emphatic
performance advice like ``segregate files that get lots of fsync()s
into separate ZFS datasets from files that get high write bandwidth,''
and we don't have advice like that in the blogs/lists/wikis which
makes me think it's not beneficial (the benefit would be

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Tim Cook

On Fri, Apr 2, 2010 at 10:08 AM, Kyle McDonald kmcdon...@egenera.comwrote:

 On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:
  I know it is way after the fact, but I find it best to coerce each
  drive down to the whole GB boundary using format (create Solaris
  partition just up to the boundary). Then if you ever get a drive a
  little smaller it still should fit.
 
  It seems like it should be unnecessary.  It seems like extra work.  But
  based on my present experience, I reached the same conclusion.
 
  If my new replacement SSD with identical part number and firmware is
 0.001
  Gb smaller than the original and hence unable to mirror, what's to
 prevent
  the same thing from happening to one of my 1TB spindle disk mirrors?
  Nothing.  That's what.
 
 
 Actually, It's my experience that Sun (and other vendors) do exactly
 that for you when you buy their parts - at least for rotating drives, I
 have no experience with SSD's.

 The Sun disk label shipped on all the drives is setup to make the drive
 the standard size for that sun part number. They have to do this since
 they (for many reasons) have many sources (diff. vendors, even diff.
 parts from the same vendor) for the actual disks they use for a
 particular Sun part number.

 This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same
 reasons.
 I'm a little surprised that the engineers would suddenly stop doing it
 only on SSD's. But who knows.

  -Kyle



If I were forced to ignorantly cast a stone, it would be into Intel's lap
(if the SSD's indeed came directly from Sun).  Sun's normal drive vendors
have been in this game for decades, and know the expectations.  Intel on the
other hand, may not have quite the same QC in place yet.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Eric D. Mudama


On Fri, Apr  2 at 11:14, Tirso Alonso wrote:

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what's to prevent
the same thing from happening to one of my 1TB spindle disk mirrors?


There is a standard for sizes that many manufatures use (IDEMA LBA1-02):

LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ??? 50.0))

Sizes should match exactly if the manufacturer follows the standard.

See:
http://opensolaris.org/jive/message.jspa?messageID=393336#393336
http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=downloaddata_file_id=1066


Problem is that it only applies to devices that are = 50GB in size,
and the X25 in question is only 32GB.

That being said, I'd be skeptical of either the sourcing of the parts,
or else some other configuration feature on the drives (like HPA or
DCO) that is changing the capacity.  It's possible one of these is in
effect.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Al Hopper

Hi Jeroen,

Have you tried the DDRdrive from Christopher George cgeo...@ddrdrive.com?
Looks to me like a much better fit for your application than the F20?

It would not hurt to check it out.  Looks to me like you need a
product with low *latency* - and a RAM based cache would be a much
better performer than any solution based solely on flash.

Let us know (on the list) how this works out for you.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX a...@logical-approach.com
   Voice: 214.233.5089 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


If you disable the ZIL, the filesystem still stays correct in RAM, and the
only way you lose any data such as you've described, is to have an
ungraceful power down or reboot.

The advice I would give is:  Do zfs autosnapshots frequently (say ... every
5 minutes, keeping the most recent 2 hours of snaps) and then run with no
ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
snapshot ... and rollback once more for good measure.  As long as you can
afford to risk 5-10 minutes of the most recent work after a crash, then you
can get a 10x performance boost most of the time, and no risk of the
aforementioned data corruption.

Why do you need the rollback? The current filesystems have correct and 
consistent data; not different from the last two snapshots.
(Snapshots can happen in the middle of untarring)

The difference between running with or without ZIL is whether the
client has lost data when the server reboots; not different from using 
Linux as an NFS server.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey

 If you disable the ZIL, the filesystem still stays correct in RAM, and
 the
 only way you lose any data such as you've described, is to have an
 ungraceful power down or reboot.
 
 The advice I would give is:  Do zfs autosnapshots frequently (say ...
 every
 5 minutes, keeping the most recent 2 hours of snaps) and then run with
 no
 ZIL.  If you have an ungraceful shutdown or reboot, rollback to the
 latest
 snapshot ... and rollback once more for good measure.  As long as you
 can
 afford to risk 5-10 minutes of the most recent work after a crash,
 then you
 can get a 10x performance boost most of the time, and no risk of the
 aforementioned data corruption.
 
 Why do you need the rollback? The current filesystems have correct and
 consistent data; not different from the last two snapshots.
 (Snapshots can happen in the middle of untarring)
 
 The difference between running with or without ZIL is whether the
 client has lost data when the server reboots; not different from using
 Linux as an NFS server.

If you have an ungraceful shutdown in the middle of writing stuff, while the
ZIL is disabled, then you have corrupt data.  Could be files that are
partially written.  Could be wrong permissions or attributes on files.
Could be missing files or directories.  Or some other problem.

Some changes from the last 1 second of operation before crash might be
written, while some changes from the last 4 seconds might be still
unwritten.  This is data corruption, which could be worse than losing a few
minutes of changes.  At least, if you rollback, you know the data is
consistent, and you know what you lost.  You won't continue having more
losses afterward caused by inconsistent data on disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey

  Can you elaborate?  Just today, we got the replacement drive that has
  precisely the right version of firmware and everything.  Still, when
 we
  plugged in that drive, and create simple volume in the storagetek
 raid
  utility, the new drive is 0.001 Gb smaller than the old drive.  I'm
 still
  hosed.
 
  Are you saying I might benefit by sticking the SSD into some laptop,
 and
  zero'ing the disk?  And then attach to the sun server?
 
  Are you saying I might benefit by finding some other way to make the
 drive
  available, instead of using the storagetek raid utility?
 
 Assuming you are also using a PCI LSI HBA from Sun that is managed with
 a utility called /opt/StorMan/arcconf and reports itself as the
 amazingly
 informative model number Sun STK RAID INT what worked for me was to
 run,
 arcconf delete (to delete the pre-configured volume shipped on the
 drive)
 arcconf create (to create a new volume)
 
 What I observed was that
 arcconf getconfig 1
 would show the same physical device size for our existing drives and
 new
 ones from Sun, but they reported a slightly different logical volume
 size.
 I am fairly sure that was due to the Sun factory creating the initial
 volume
 with a different version of the HBA controller firmware then we where
 using
 to create our own volumes.
 
 If I remember the sign correctly, the newer firmware creates larger
 logical
 volumes, and you really want to upgrade the firmware if you are going
 to
 be running multiple X25-E drives from the same controller.
 
 I hope that helps.

Uggh.  This is totally different than my system.  But thanks for writing.
I'll take this knowledge, and see if we can find some analogous situation
with the StorageTek controller.  It still may be helpful, so again, thanks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik



If you have an ungraceful shutdown in the middle of writing stuff, while the
ZIL is disabled, then you have corrupt data.  Could be files that are
partially written.  Could be wrong permissions or attributes on files.
Could be missing files or directories.  Or some other problem.

Some changes from the last 1 second of operation before crash might be
written, while some changes from the last 4 seconds might be still
unwritten.  This is data corruption, which could be worse than losing a few
minutes of changes.  At least, if you rollback, you know the data is
consistent, and you know what you lost.  You won't continue having more
losses afterward caused by inconsistent data on disk.

How exactly is this different from rolling back to some other point of 
time?.

I think you don't quite understand how ZFS works; all operations are 
grouped in transaction groups; all the transactions in a particular group 
are commit in one operation.  I don't know what partial ordering ZFS uses 
when creating transaction groups, but a snapshot just picks one
transaction group as the last group included in the snapshot.

When the system reboots, ZFS picks the most recent, valid uberblock;
so the data available is correct upto transaction group N1.

If you rollback to a snapshot, you get data
 correct upto transaction group N2.

But N2  N1 so you lose more data.

Why do you think that a Snapshot has a better quality than the last 
snapshot available?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey

 If you have an ungraceful shutdown in the middle of writing stuff,
 while the
 ZIL is disabled, then you have corrupt data.  Could be files that are
 partially written.  Could be wrong permissions or attributes on files.
 Could be missing files or directories.  Or some other problem.
 
 Some changes from the last 1 second of operation before crash might be
 written, while some changes from the last 4 seconds might be still
 unwritten.  This is data corruption, which could be worse than losing
 a few
 minutes of changes.  At least, if you rollback, you know the data is
 consistent, and you know what you lost.  You won't continue having
 more
 losses afterward caused by inconsistent data on disk.
 
 How exactly is this different from rolling back to some other point of
 time?.
 
 I think you don't quite understand how ZFS works; all operations are
 grouped in transaction groups; all the transactions in a particular
 group
 are commit in one operation.  I don't know what partial ordering ZFS

Dude, don't be so arrogant.  Acting like you know what I'm talking about
better than I do.  Face it that you have something to learn here.

Yes, all the transactions in a transaction group are either committed
entirely to disk, or not at all.  But they're not necessarily committed to
disk in the same order that the user level applications requested.  Meaning:
If I have an application that writes to disk in sync mode intentionally
... perhaps because my internal file format consistency would be corrupt if
I wrote out-of-order ... If the sysadmin has disabled ZIL, my sync write
will not block, and I will happily issue more write operations.  As long as
the OS remains operational, no problem.  The OS keeps the filesystem
consistent in RAM, and correctly manages all the open file handles.  But if
the OS dies for some reason, some of my later writes may have been committed
to disk while some of my earlier writes could be lost, which were still
being buffered in system RAM for a later transaction group.

This is particularly likely to happen, if my application issues a very small
sync write, followed by a larger async write, followed by a very small sync
write, and so on.  Then the OS will buffer my small sync writes and attempt
to aggregate them into a larger sequential block for the sake of accelerated
performance.  The end result is:  My larger async writes are sometimes
committed to disk before my small sync writes.  But the only reason I would
ever know or care about that would be if the ZIL were disabled, and the OS
crashed.  Afterward, my file has internal inconsistency.

Perfect examples of applications behaving this way would be databases and
virtual machines.


 Why do you think that a Snapshot has a better quality than the last
 snapshot available?

If you rollback to a snapshot from several minutes ago, you can rest assured
all the transaction groups that belonged to that snapshot have been
committed.  So although you're losing the most recent few minutes of data,
you can rest assured you haven't got file corruption in any of the existing
files.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey

 This approach does not solve the problem.  When you do a snapshot,
 the txg is committed.  If you wish to reduce the exposure to loss of
 sync data and run with ZIL disabled, then you can change the txg commit
 interval -- however changing the txg commit interval will not eliminate
 the
 possibility of data loss.

The default commit interval is what, 30 seconds?  Doesn't that guarantee
that any snapshot taken more than 30 seconds ago will have been fully
committed to disk?

Therefore, any snapshot older than 30 seconds old is guaranteed to be
consistent on disk.  While anything less than 30 seconds old could possibly
have some later writes committed to disk before some older writes from a few
seconds before.

If I'm wrong about this, please explain.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.

If you rollback to a snapshot that's at least 30 seconds old, then all the
writes for that snapshot are guaranteed to be committed to disk already, and
in the right order.  You're acknowledging the loss of some known time worth
of data.  But you're gaining a guarantee of internal file consistency.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey

 Is that what sync means in Linux?  

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik



Dude, don't be so arrogant.  Acting like you know what I'm talking about
better than I do.  Face it that you have something to learn here.

You may say that, but then you post this:


 Why do you think that a Snapshot has a better quality than the last
 snapshot available?

If you rollback to a snapshot from several minutes ago, you can rest assured
all the transaction groups that belonged to that snapshot have been
committed.  So although you're losing the most recent few minutes of data,
you can rest assured you haven't got file corruption in any of the existing
files.


But the actual fact is that there is *NO* difference between the last
uberblock and an uberblock named as snapshot-such-and-so.  All changes 
made after the uberblock was written are discarded by rolling back.


All the transaction groups referenced by last uberblock *are* written to 
disk.

Disabling the ZIL makes sure that fsync() and sync() no longer work;
whether you take a named snapshot or the uberblock is immaterial; your
strategy will cause more data to be lost.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


 Is that what sync means in Linux?  

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.


We're talking about the sync for NFS exports in Linux; what do they mean 
with sync NFS exports? 


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


 This approach does not solve the problem.  When you do a snapshot,
 the txg is committed.  If you wish to reduce the exposure to loss of
 sync data and run with ZIL disabled, then you can change the txg commit
 interval -- however changing the txg commit interval will not eliminate
 the
 possibility of data loss.

The default commit interval is what, 30 seconds?  Doesn't that guarantee
that any snapshot taken more than 30 seconds ago will have been fully
committed to disk?

When a system boots and it finds the snapshot, then all the data referred 
by the snapshot are on-disk.  But the snapshot doesn't guarantee more than 
the last valid uberblock.

Therefore, any snapshot older than 30 seconds old is guaranteed to be
consistent on disk.  While anything less than 30 seconds old could possibly
have some later writes committed to disk before some older writes from a few
seconds before.

If I'm wrong about this, please explain.

When a pointer to data is committed to disk by ZFS, then the data is 
also on disk.  (if the pointer is reachable from the uberblock,
then the data is also on dissk and reachable from the uberblock)

You don't need to wait 30 seconds.  If it's there, it's there.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.

If you rollback to a snapshot that's at least 30 seconds old, then all the
writes for that snapshot are guaranteed to be committed to disk already, and
in the right order.  You're acknowledging the loss of some known time worth
of data.  But you're gaining a guarantee of internal file consistency.


I don't know what ZFS guarantees when you disable the zil; the one broken 
promise is that when fsync() returns, that the data may not have 
committed to stable storage when fsync() returns.

I'm not sure whether there is a barrier when there is a sync()/fsync(),
if that is the case, then ZFS is still safe for your application.



Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker

On Mar 31, 2010, at 11:51 PM, Edward Ned Harvey  
solar...@nedharvey.com wrote:



A MegaRAID card with write-back cache? It should also be cheaper than
the F20.


I haven't posted results yet, but I just finished a few weeks of  
extensive

benchmarking various configurations.  I can say this:

WriteBack cache is much faster than naked disks, but if you can  
buy an SSD
or two for ZIL log device, the dedicated ZIL is yet again much  
faster than

WriteBack.

It doesn't have to be F20.  You could use the Intel X25 for  
example.  If
you're running solaris proper, you better mirror your ZIL log  
device.  If

you're running opensolaris ... I don't know if that's important.  I'll
probably test it, just to be sure, but I might never get around to it
because I don't have a justifiable business reason to build the  
opensolaris

machine just for this one little test.

Seriously, all disks configured WriteThrough (spindle and SSD disks  
alike)
using the dedicated ZIL SSD device, very noticeably faster than  
enabling the

WriteBack.


What do you get with both SSD ZIL and WriteBack disks enabled?

I mean if you have both why not use both? Then both async and sync IO  
benefits.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker

On Mar 31, 2010, at 11:58 PM, Edward Ned Harvey  
solar...@nedharvey.com wrote:


We ran into something similar with these drives in an X4170 that  
turned

out to
be  an issue of the preconfigured logical volumes on the drives. Once
we made
sure all of our Sun PCI HBAs where running the exact same version of
firmware
and recreated the volumes on new drives arriving from Sun we got back
into sync
on the X25-E devices sizes.


Can you elaborate?  Just today, we got the replacement drive that has
precisely the right version of firmware and everything.  Still, when  
we
plugged in that drive, and create simple volume in the storagetek  
raid
utility, the new drive is 0.001 Gb smaller than the old drive.  I'm  
still

hosed.

Are you saying I might benefit by sticking the SSD into some laptop,  
and

zero'ing the disk?  And then attach to the sun server?

Are you saying I might benefit by finding some other way to make the  
drive

available, instead of using the storagetek raid utility?


I know it is way after the fact, but I find it best to coerce each  
drive down to the whole GB boundary using format (create Solaris  
partition just up to the boundary). Then if you ever get a drive a  
little smaller it still should fit.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker


On Apr 1, 2010, at 8:42 AM, casper@sun.com wrote:




Is that what sync means in Linux?


A sync write is one in which the application blocks until the OS  
acks that
the write has been committed to disk.  An async write is given to  
the OS,
and the OS is permitted to buffer the write to disk at its own  
discretion.
Meaning the async write function call returns sooner, and the  
application is

free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.   
But sync
writes are done by applications which need to satisfy a race  
condition for
the sake of internal consistency.  Applications which need to know  
their

next commands will not begin until after the previous sync write was
committed to disk.



We're talking about the sync for NFS exports in Linux; what do  
they mean

with sync NFS exports?


See section A1 in the FAQ:

http://nfs.sourceforge.net/

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Darren J Moffat


On 01/04/2010 14:49, Ross Walker wrote:

We're talking about the sync for NFS exports in Linux; what do they
mean
with sync NFS exports?


See section A1 in the FAQ:

http://nfs.sourceforge.net/


I think B4 is the answer to Casper's question:

 BEGIN QUOTE 
Linux servers (although not the Solaris reference implementation) allow 
this requirement to be relaxed by setting a per-export option in 
/etc/exports. The name of this export option is [a]sync (note that 
there is also a client-side mount option by the same name, but it has a 
different function, and does not defeat NFS protocol compliance).


When set to sync, Linux server behavior strictly conforms to the NFS 
protocol. This is default behavior in most other server implementations. 
When set to async, the Linux server replies to NFS clients before 
flushing data or metadata modifying operations to permanent storage, 
thus improving performance, but breaking all guarantees about server 
reboot recovery.

 END QUOTE 

For more info the whole of section B4 though B6.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker

On Thu, Apr 1, 2010 at 10:03 AM, Darren J Moffat
darr...@opensolaris.org wrote:
 On 01/04/2010 14:49, Ross Walker wrote:

 We're talking about the sync for NFS exports in Linux; what do they
 mean
 with sync NFS exports?

 See section A1 in the FAQ:

 http://nfs.sourceforge.net/

 I think B4 is the answer to Casper's question:

  BEGIN QUOTE 
 Linux servers (although not the Solaris reference implementation) allow this
 requirement to be relaxed by setting a per-export option in /etc/exports.
 The name of this export option is [a]sync (note that there is also a
 client-side mount option by the same name, but it has a different function,
 and does not defeat NFS protocol compliance).

 When set to sync, Linux server behavior strictly conforms to the NFS
 protocol. This is default behavior in most other server implementations.
 When set to async, the Linux server replies to NFS clients before flushing
 data or metadata modifying operations to permanent storage, thus improving
 performance, but breaking all guarantees about server reboot recovery.
  END QUOTE 

 For more info the whole of section B4 though B6.

True, I was thinking more of the protocol summary.

 Is that what sync means in Linux?  As NFS doesn't use close or
 fsync, what exactly are the semantics.

 (For NFSv2/v3 each *operation* is sync and the client needs to make sure
 it can continue; for NFSv4, some operations are async and the client
 needs to use COMMIT)

Actually the COMMIT command was introduced in NFSv3.

The full details:

NFS Version 3 introduces the concept of safe asynchronous writes. A
Version 3 client can specify that the server is allowed to reply
before it has saved the requested data to disk, permitting the server
to gather small NFS write operations into a single efficient disk
write operation. A Version 3 client can also specify that the data
must be written to disk before the server replies, just like a Version
2 write. The client specifies the type of write by setting the
stable_how field in the arguments of each write operation to UNSTABLE
to request a safe asynchronous write, and FILE_SYNC for an NFS Version
2 style write.

Servers indicate whether the requested data is permanently stored by
setting a corresponding field in the response to each NFS write
operation. A server can respond to an UNSTABLE write request with an
UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the
requested data resides on permanent storage yet. An NFS
protocol-compliant server must respond to a FILE_SYNC request only
with a FILE_SYNC reply.

Clients ensure that data that was written using a safe asynchronous
write has been written onto permanent storage using a new operation
available in Version 3 called a COMMIT. Servers do not send a response
to a COMMIT operation until all data specified in the request has been
written to permanent storage. NFS Version 3 clients must protect
buffered data that has been written using a safe asynchronous write
but not yet committed. If a server reboots before a client has sent an
appropriate COMMIT, the server can reply to the eventual COMMIT
request in a way that forces the client to resend the original write
operation. Version 3 clients use COMMIT operations when flushing safe
asynchronous writes to the server during a close(2) or fsync(2) system
call, or when encountering memory pressure.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Bob Friesenhahn


On Thu, 1 Apr 2010, Edward Ned Harvey wrote:


If I'm wrong about this, please explain.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.


Zfs writes data in transaction groups and each bunch of data which 
gets written is bounded by a transaction group.  The current state of 
the data at the time the TXG starts will be the state of the data once 
the TXG completes.  If the system spontaneously reboots then it will 
restart at the last completed TXG so any residual writes which might 
have occured while a TXG write was in progress will be discarded. 
Based on this, I think that your ordering concerns (sync writes 
getting to disk faster than async writes) are unfounded for normal 
file I/O.


However, if file I/O is done via memory mapped files, then changed 
memory pages will not necessarily be written.  The changes will not be 
known to ZFS until the kernel decides that a dirty page should be 
written or there is a conflicting traditional I/O which would update 
the same file data.  Use of msync(3C) is necessary to assure that file 
data updated via mmap() will be seen by ZFS and comitted to disk in an 
orderly fashion.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Robert Milkowski


On 01/04/2010 13:01, Edward Ned Harvey wrote:

Is that what sync means in Linux?
 

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.

   

ROTFL!!!

I think you should explain it even further for Casper :) :) :) :) :) :) :)

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


On 01/04/2010 13:01, Edward Ned Harvey wrote:
 Is that what sync means in Linux?
  
 A sync write is one in which the application blocks until the OS acks that
 the write has been committed to disk.  An async write is given to the OS,
 and the OS is permitted to buffer the write to disk at its own discretion.
 Meaning the async write function call returns sooner, and the application is
 free to continue doing other stuff, including issuing more writes.

 Async writes are faster from the point of view of the application.  But sync
 writes are done by applications which need to satisfy a race condition for
 the sake of internal consistency.  Applications which need to know their
 next commands will not begin until after the previous sync write was
 committed to disk.


ROTFL!!!

I think you should explain it even further for Casper :) :) :) :) :) :) :)



:-)

So what I *really* wanted to know what sync meant for the NFS server
in the case of Linux.

Apparently it means implement the NFS protocol to the letter.

I'm happy to see that it is now the default and I hope this will cause the 
Linux NFS client implementation to be faster for conforming NFS servers.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Bob Friesenhahn


On Thu, 1 Apr 2010, Edward Ned Harvey wrote:


Dude, don't be so arrogant.  Acting like you know what I'm talking about
better than I do.  Face it that you have something to learn here.


Geez!


Yes, all the transactions in a transaction group are either committed
entirely to disk, or not at all.  But they're not necessarily committed to
disk in the same order that the user level applications requested.  Meaning:
If I have an application that writes to disk in sync mode intentionally
... perhaps because my internal file format consistency would be corrupt if
I wrote out-of-order ... If the sysadmin has disabled ZIL, my sync write
will not block, and I will happily issue more write operations.  As long as
the OS remains operational, no problem.  The OS keeps the filesystem
consistent in RAM, and correctly manages all the open file handles.  But if
the OS dies for some reason, some of my later writes may have been committed
to disk while some of my earlier writes could be lost, which were still
being buffered in system RAM for a later transaction group.


The purpose of the ZIL is to act like a fast log for synchronous 
writes.  It allows the system to quickly confirm a synchronous write 
request with the minimum amount of work.  As you say, OS keeps the 
filesystem consistent in RAM.  There is no 1:1 ordering between 
application write requests and zfs writes and in fact, if the same 
portion of file is updated many times, or the file is created/deleted 
many times, zfs only writes the updated data which is current when the 
next TXG is written.  For a synchronous write, zfs advances its index 
in the slog once the corresponding data has been committed in a TXG. 
In other words, the sync and async write paths are the same when 
it comes to writing final data to disk.


There is however the recovery case where synchronous writes were 
affirmed which were not yet written in a TXG and the system 
spontaneously reboots.  In this case the synchronous writes will occur 
based on the slog, and uncommitted async writes will have been lost. 
Perhaps this is the case you are worried about.


It does seem like rollback to a snapshot does help here (to assure 
that sync  async data is consistent), but it certainly does not help 
any NFS clients.  Only a broken application uses sync writes 
sometimes, and async writes at other times.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 >

1 - 100 of 208 matches

Mail list logo