Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Andrey Kuzmin
On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling richard.ell...@gmail.comwrote:

 On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

  Andrey Kuzmin wrote:
  Well, I'm more accustomed to  sequential vs. random, but YMMW.
  As to 67000 512 byte writes (this sounds suspiciously close to 32Mb
 fitting into cache), did you have write-back enabled?
 
  It's a sustained number, so it shouldn't matter.

 That is only 34 MB/sec.  The disk can do better for sequential writes.

 Note: in ZFS, such writes will be coalesced into 128KB chunks.


So this is just 256 IOPS in the controller, not 64K.

Regards,
Andrey


  -- richard

 --
 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
 http://nexenta-rotterdam.eventbrite.com/







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Andrey Kuzmin
On Fri, Jun 11, 2010 at 1:26 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 11/06/2010 09:22, sensille wrote:

 Andrey Kuzmin wrote:


 On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling
 richard.ell...@gmail.commailto:richard.ell...@gmail.com  wrote:

 On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

   Andrey Kuzmin wrote:
   Well, I'm more accustomed to  sequential vs. random, but YMMW.
   As to 67000 512 byte writes (this sounds suspiciously close to
 32Mb fitting into cache), did you have write-back enabled?
 
   It's a sustained number, so it shouldn't matter.

 That is only 34 MB/sec.  The disk can do better for sequential
 writes.

 Note: in ZFS, such writes will be coalesced into 128KB chunks.


 So this is just 256 IOPS in the controller, not 64K.


 No, it's 67k ops, it was a completely ZFS-free test setup. iostat also
 confirmed
 the numbers.


 It's a really simple test everyone can do it.

 # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512

 I did a test on my workstation a moment ago and got about 21k IOPS from my
 sata drive (iostat).
 The trick here of course is that this is sequentail write with no other
 workload going on and a drive should be able to nicely coalesce these IOs
 and do a sequential writes with large blocks.


Exactly, though one might still wonder where the coalescing actually
happens, in the respective OS layer or in the controller. Nonetheless, this
is hardly a common use-case one would design h/w for.

Regards,
Andrey




 --
 Robert Milkowski
 http://milek.blogspot.com

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin
On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 21/10/2009 03:54, Bob Friesenhahn wrote:


 I would be interested to know how many IOPS an OS like Solaris is able to
 push through a single device interface.  The normal driver stack is likely
 limited as to how many IOPS it can sustain for a given LUN since the driver
 stack is optimized for high latency devices like disk drives.  If you are
 creating a driver stack, the design decisions you make when requests will be
 satisfied in about 12ms would be much different than if requests are
 satisfied in 50us.  Limitations of existing software stacks are likely
 reasons why Sun is designing hardware with more device interfaces and more
 independent devices.



 Open Solaris 2009.06, 1KB READ I/O:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0


/dev/null is usually a poor choice for a test lie this. Just to be on the
safe side, I'd rerun it with /dev/random.

Regards,
Andrey


 # iostat -xnzCM 1|egrep device|c[0123]$
 [...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17497.30.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17498.80.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17277.60.0   16.90.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17441.30.0   17.00.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17333.90.0   16.90.0  0.0  0.80.00.0   0  82 c0


 Now lets see how it looks like for a single SAS connection but dd to 11x
 SSDs:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0

 # iostat -xnzCM 1|egrep device|c[0123]$
 [...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104243.30.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104249.20.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104208.10.0  101.80.0  0.2  9.70.00.1   0 967 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104245.80.0  101.80.0  0.2  9.70.00.1   0 966 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104221.90.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104212.20.0  101.80.0  0.2  9.70.00.1   0 967 c0


 It looks like a single CPU core still hasn't been saturated and the
 bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris
 2009.06 can do at least 100,000 IOPS to a single SAS port.

 It also scales well - I did run above dd's over 4x SAS ports at the same
 time and it scaled linearly by achieving well over 400k IOPS.


 hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw.
 1.27.3.0), connected to F5100.


 --
 Robert Milkowski
 http://milek.blogspot.com

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin
Sorry, my bad. _Reading_ from /dev/null may be an issue, but not writing to
it, of course.

Regards,
Andrey



On Thu, Jun 10, 2010 at 6:46 PM, Robert Milkowski mi...@task.gda.pl wrote:

  On 10/06/2010 15:39, Andrey Kuzmin wrote:

 On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.plwrote:

 On 21/10/2009 03:54, Bob Friesenhahn wrote:


 I would be interested to know how many IOPS an OS like Solaris is able to
 push through a single device interface.  The normal driver stack is likely
 limited as to how many IOPS it can sustain for a given LUN since the driver
 stack is optimized for high latency devices like disk drives.  If you are
 creating a driver stack, the design decisions you make when requests will be
 satisfied in about 12ms would be much different than if requests are
 satisfied in 50us.  Limitations of existing software stacks are likely
 reasons why Sun is designing hardware with more device interfaces and more
 independent devices.



 Open Solaris 2009.06, 1KB READ I/O:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0


  /dev/null is usually a poor choice for a test lie this. Just to be on the
 safe side, I'd rerun it with /dev/random.


 That wouldn't work, would it?
 Please notice that I'm reading *from* an ssd and writing *to* /dev/null


 --
 Robert Milkowski
 http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin
As to your results, it sounds almost too good to be true. As Bob has pointed
out, h/w design targeted hundreds IOPS, and it was hard to believe it can
scale 100x. Fantastic.

Regards,
Andrey



On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 21/10/2009 03:54, Bob Friesenhahn wrote:


 I would be interested to know how many IOPS an OS like Solaris is able to
 push through a single device interface.  The normal driver stack is likely
 limited as to how many IOPS it can sustain for a given LUN since the driver
 stack is optimized for high latency devices like disk drives.  If you are
 creating a driver stack, the design decisions you make when requests will be
 satisfied in about 12ms would be much different than if requests are
 satisfied in 50us.  Limitations of existing software stacks are likely
 reasons why Sun is designing hardware with more device interfaces and more
 independent devices.



 Open Solaris 2009.06, 1KB READ I/O:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
 # iostat -xnzCM 1|egrep device|c[0123]$
 [...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17497.30.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17498.80.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17277.60.0   16.90.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17441.30.0   17.00.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  17333.90.0   16.90.0  0.0  0.80.00.0   0  82 c0


 Now lets see how it looks like for a single SAS connection but dd to 11x
 SSDs:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0
 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0

 # iostat -xnzCM 1|egrep device|c[0123]$
 [...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104243.30.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104249.20.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104208.10.0  101.80.0  0.2  9.70.00.1   0 967 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104245.80.0  101.80.0  0.2  9.70.00.1   0 966 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104221.90.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  104212.20.0  101.80.0  0.2  9.70.00.1   0 967 c0


 It looks like a single CPU core still hasn't been saturated and the
 bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris
 2009.06 can do at least 100,000 IOPS to a single SAS port.

 It also scales well - I did run above dd's over 4x SAS ports at the same
 time and it scaled linearly by achieving well over 400k IOPS.


 hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw.
 1.27.3.0), connected to F5100.


 --
 Robert Milkowski
 http://milek.blogspot.com

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin
On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net wrote:

 Andrey Kuzmin wrote:

 As to your results, it sounds almost too good to be true. As Bob has
 pointed out, h/w design targeted hundreds IOPS, and it was hard to believe
 it can scale 100x. Fantastic.


 Hundreds IOPS is not quite true, even with hard drives. I just tested
 a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache


Linear? May be sequential?

Regards,
Andrey


 enabled.

 --Arne


 Regards,
 Andrey




 On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.plmailto:
 mi...@task.gda.pl wrote:

On 21/10/2009 03:54, Bob Friesenhahn wrote:


I would be interested to know how many IOPS an OS like Solaris
is able to push through a single device interface.  The normal
driver stack is likely limited as to how many IOPS it can
sustain for a given LUN since the driver stack is optimized for
high latency devices like disk drives.  If you are creating a
driver stack, the design decisions you make when requests will
be satisfied in about 12ms would be much different than if
requests are satisfied in 50us.  Limitations of existing
software stacks are likely reasons why Sun is designing hardware
with more device interfaces and more independent devices.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Andrey Kuzmin
Well, I'm more accustomed to  sequential vs. random, but YMMW.

As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting
into cache), did you have write-back enabled?

Regards,
Andrey



On Fri, Jun 11, 2010 at 12:03 AM, Arne Jansen sensi...@gmx.net wrote:

 Andrey Kuzmin wrote:

  On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net mailto:
 sensi...@gmx.net wrote:

Andrey Kuzmin wrote:

As to your results, it sounds almost too good to be true. As Bob
has pointed out, h/w design targeted hundreds IOPS, and it was
hard to believe it can scale 100x. Fantastic.


Hundreds IOPS is not quite true, even with hard drives. I just tested
a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache


 Linear? May be sequential?


 Aren't these synonyms? linear as opposed to random.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Compellant announces zNAS

2010-04-29 Thread Andrey Kuzmin
I believe the name is Compellent Technologies,
http://www.google.com/finance?q=NYSE:CML.
Regards,
Andrey




On Wed, Apr 28, 2010 at 5:54 AM, Richard Elling
richard.ell...@richardelling.com wrote:
 Today, Compellant announced their zNAS addition to their unified storage
 line. zNAS uses ZFS behind the scenes.
 http://www.compellent.com/Community/Blog/Posts/2010/4/Compellent-zNAS.aspx

 Congrats Compellant!
  -- richard

 ZFS storage and performance consulting at http://www.RichardElling.com
 ZFS training on deduplication, NexentaStor, and NAS performance
 Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com





 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-10 Thread Andrey Kuzmin
No, until all snapshots referencing the file in question are removed.

Simplest way to understand snapshots is to consider them as
references. Any file-system object (say, file or block) is only
removed when its reference count drops to zero.

Regards,
Andrey




On Sat, Apr 10, 2010 at 10:20 PM, Roy Sigurd Karlsbakk
r...@karlsbakk.net wrote:
 Hi all

 Is it possible to securely delete a file from a zfs dataset/zpool once it's 
 been snapshotted, meaning delete (and perhaps overwrite) all copies of this 
 file?

 Best regards

 roy
 --
 Roy Sigurd Karlsbakk
 (+47) 97542685
 r...@karlsbakk.net
 http://blogg.karlsbakk.net/
 --
 I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det 
 er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
 idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
 relevante synonymer på norsk.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS caching of compressed data

2010-03-28 Thread Andrey Kuzmin
There had been a discussion of the topic on this list bout a onth ago,
and I'd been told that similar ideas (compressed metadata/data in
ARC/L2ARC) is on zfs dev agenda.

Regards,
Andrey




On Sun, Mar 28, 2010 at 2:42 AM, Stuart Anderson
ander...@ligo.caltech.edu wrote:

 On Oct 2, 2009, at 11:54 AM, Robert Milkowski wrote:

 Stuart Anderson wrote:

 On Oct 2, 2009, at 5:05 AM, Robert Milkowski wrote:

 Stuart Anderson wrote:
 I am wondering if the following idea makes any sense as a way to get ZFS 
 to cache compressed data in DRAM?

 In particular, given a 2-way zvol mirror of highly compressible data on 
 persistent storage devices, what would go wrong if I dynamically added a 
 ramdisk as a 3rd mirror device at boot time?

 Would ZFS route most (or all) of the reads to the lower latency DRAM 
 device?

 In the case of an un-clean shutdown where there was no opportunity to 
 actively remove the ramdisk from the pool before shutdown would there be 
 any problem at boot time when the ramdisk is still registered but 
 unavailable?

 Note, this Gedanken experiment is for highly compressible (~9x) metadata 
 for a non-ZFS filesystem.

 You would only get about 33% of IO's served from ram-disk.

 With SVM you are allowed to specify a read policy on sub-mirrors for just 
 this reason, e.g.,
 http://wikis.sun.com/display/BigAdmin/Using+a+SVM+submirror+on+a+ramdisk+to+increase+read+performance

 Is there no equivalent in ZFS?


 Nope, at least not right now.

 Curious if anyone knows of any other ideas/plans for ZFS caching compressed 
 data internally? or externally via a ramdisk mirror device that handles 
 most/all read requests?

 Thanks.

 --
 Stuart Anderson  ander...@ligo.caltech.edu
 http://www.ligo.caltech.edu/~anderson



 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Andrey Kuzmin
On Wed, Feb 24, 2010 at 11:09 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Wed, 24 Feb 2010, Steve wrote:

 The overhead I was thinking of was more in the pointer structures...
 (bearing in mind this is a 128 bit file system), I would guess that memory
 requirements would be HUGE for all these files...otherwise arc is gonna
 struggle, and paging system is going mental?

 It is not reasonable to assume that zfs has to retain everything in memory.

At the same time 400M files in a single directory should lead to a lot
of contention on locks associated with look-ups. Spreading files
between a reasonable number of dirs could mitigate this.

Regards,
Andrey



 I have a directory here containing a million files and it has not caused any
 strain for zfs at all although it can cause considerable stress on
 applications.

 400 million tiny files is quite a lot and I would hate to use anything but
 mirrors with so many tiny files.

 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Andrey Kuzmin
On Thu, Feb 25, 2010 at 12:26 AM, Steve steve.jack...@norman.com wrote:
 thats not the issue here, as they are spread out in a folder structure based 
 on an integer split into hex blocks...  00/00/00/01 etc...

 but the number of pointers involved with all these files, and directories 
 (which are files)
 must have an impact on a system with limited RAM?

 There is 4GB RAM in this system btw...

If any significant portion of these 400M files is accessed on a
regular basis, you'd be
(1) stressing ARC to the limits
(2) stressing spindles so that any concurrent sequential I/O would suffer.

Small files are always an issue, try moving them off HDDs onto a
mirrored SSDs, not necessarily most expensive ones. 400M 2K files is
just 400GB, within the reach of a few SSDs.


Regards,
Andrey


 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Andrey Kuzmin
On Thu, Feb 25, 2010 at 12:34 AM, Andrey Kuzmin
andrey.v.kuz...@gmail.com wrote:
 On Thu, Feb 25, 2010 at 12:26 AM, Steve steve.jack...@norman.com wrote:
 thats not the issue here, as they are spread out in a folder structure based 
 on an integer split into hex blocks...  00/00/00/01 etc...

 but the number of pointers involved with all these files, and directories 
 (which are files)
 must have an impact on a system with limited RAM?

 There is 4GB RAM in this system btw...

 If any significant portion of these 400M files is accessed on a
 regular basis, you'd be
 (1) stressing ARC to the limits
 (2) stressing spindles so that any concurrent sequential I/O would suffer.

 Small files are always an issue, try moving them off HDDs onto a
 mirrored SSDs, not necessarily most expensive ones. 400M 2K files is

1K meant, fat fingers.


Regards,
Andrey


 just 400GB, within the reach of a few SSDs.


 Regards,
 Andrey


 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Observations about compressability of metadata L2ARC

2010-02-21 Thread Andrey Kuzmin
I don't see why this couldn't be extended beyond metadata (+1 for the
idea): if zvol is compressed, ARC/L2ARC could store compressed data.
The gain is apparent: if user has compression enabled for the volume,
he/she expects volume's data to be compressable at good ratio,
yielding significant reduction of ARC memory footprint/L2ARC usable
capacity boost.

Regards,
Andrey

On Sun, Feb 21, 2010 at 7:24 PM, Tomas Ögren st...@acc.umu.se wrote:
 Hello.

 I got an idea.. How about creating an ramdisk, making a pool out of it,
 then making compressed zvols and add those as l2arc.. Instant compressed
 arc ;)

 So I did some tests with secondarycache=metadata...

               capacity     operations    bandwidth
 pool         used  avail   read  write   read  write
 --  -  -  -  -  -  -
 ftp         5.07T  1.78T    198     17  11.3M  1.51M
  raidz2    1.72T   571G     58      5  3.78M   514K
 ...
  raidz2    1.64T   656G     75      6  3.78M   524K
 ...
  raidz2    1.70T   592G     64      5  3.74M   512K
 ...
 cache           -      -      -      -      -      -
  /dev/zvol/dsk/ramcache/ramvol  84.4M  7.62M      4     17  45.4K 233K
  /dev/zvol/dsk/ramcache/ramvol2  84.3M  7.71M      4     17  41.5K 233K
  /dev/zvol/dsk/ramcache/ramvol3    84M     8M      4     18  42.0K 236K
  /dev/zvol/dsk/ramcache/ramvol4  84.8M  7.25M      3     17  39.1K 225K
  /dev/zvol/dsk/ramcache/ramvol5  84.9M  7.08M      3     14  38.0K 193K

 NAME              RATIO  COMPRESS
 ramcache/ramvol   1.00x       off
 ramcache/ramvol2  4.27x      lzjb
 ramcache/ramvol3  6.12x    gzip-1
 ramcache/ramvol4  6.77x      gzip
 ramcache/ramvol5  6.82x    gzip-9

 This was after 'find /ftp' had been running for about 1h, along with all
 the background noise of its regular nfs serving tasks.

 I took an image of the uncompressed one (ramvol) and ran that through
 regular gzip and got 12-14x compression, probably due to smaller block
 size (default 8k) in the zvols.. So I tried with both 8k and 64k..

 After not running that long (but at least filled), I got:

 NAME              RATIO  COMPRESS  VOLBLOCK
 ramcache/ramvol   1.00x       off        8K
 ramcache/ramvol2  5.57x      lzjb        8K
 ramcache/ramvol3  7.56x      lzjb       64K
 ramcache/ramvol4  7.35x    gzip-1        8K
 ramcache/ramvol5  11.68x    gzip-1       64K


 Not sure how to measure the cpu usage of the various compression levels
 for (de)compressing this data..  It does show that having metadata in
 ram compressed could be a big win though, if you have cpu cycles to
 spare..

 Thoughts?


 /Tomas
 --
 Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
 |- Student at Computing Science, University of Umeå
 `- Sysadmin at {cs,acc}.umu.se - 070-5858487
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] improve meta data performance

2010-02-18 Thread Andrey Kuzmin
Try an inexpensive MLC SSD (Intel/Micron) for L2ARC. Won't help
metadat updates, but should boost reads.

Regards,
Andrey




On Thu, Feb 18, 2010 at 11:23 PM, Chris Banal cba...@gmail.com wrote:
 We have a SunFire X4500 running Solaris 10U5 which does about 5-8k nfs ops
 of which about 90% are meta data. In hind sight it would have been
 significantly better  to use a mirrored configuration but we opted for 4 x
 (9+2) raidz2 at the time. We can not take the downtime necessary to change
 the zpool configuration.

 We need to improve the meta data performance with little to no money. Does
 anyone have any suggestions? Is there such a thing as a Sun supported NVRAM
 PCI-X card compatible with the X4500 which can be used as an L2ARC?

 Thanks,
 Chris

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send/receive : panic and reboot

2010-02-09 Thread Andrey Kuzmin
Just an observation: panic occurs in avl_add when called from
find_ds_by_guid that tries to add existing snapshot id to the avl tree
(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu_send.c#find_ds_by_guid).

HTH,
Andrey




On Tue, Feb 9, 2010 at 1:37 AM, Bruno Damour br...@ruomad.net wrote:
 On 02/ 8/10 06:38 PM, Lori Alt wrote:

 Can you please send a complete list of the actions taken:  The commands you
 used to create the send stream, the commands used to receive the stream.
 Also the output of `zfs list -t all` on both the sending and receiving
 sides.  If you were able to collect a core dump (it should be in
 /var/crash/hostname), it would be good to upload it.

 The panic you're seeing is in the code that is specific to receiving a
 dedup'ed stream.  It's possible that you could do the migration if you
 turned off dedup (i.e. didn't specify -D) when creating the send stream..
 However, then we wouldn't be able to diagnose and fix what appears to be a
 bug.

 The best way to get us the crash dump is to upload it here:

 https://supportfiles.sun.com/upload

 We need either both vmcore.X and unix.X OR you can just send us vmdump.X.

 Sometimes big uploads have mixed results, so if there is a problem some
 helpful hints are
 on
 http://wikis.sun.com/display/supportfiles/Sun+Support+Files+-+Help+and+Users+Guide,
 specifically in section 7.

 It's best to include your name or your initials or something in the name of
 the file you upload.  As
 you might imagine we get a lot of files uploaded named vmcore.1

 You might also create a defect report at http://defect.opensolaris.org/bz/

 Lori


 On 02/08/10 09:41, Bruno Damour wrote:

 copied from opensolaris-dicuss as this probably belongs here.

 I kept on trying to migrate my pool with children (see previous threads) and
 had the (bad) idea to try the -d option on the receive part.
 The system reboots immediately.

 Here is the log in /var/adm/messages

 Feb 8 16:07:09 amber unix: [ID 836849 kern.notice]
 Feb 8 16:07:09 amber ^Mpanic[cpu1]/thread=ff014ba86e40:
 Feb 8 16:07:09 amber genunix: [ID 169834 kern.notice] avl_find() succeeded
 inside avl_add()
 Feb 8 16:07:09 amber unix: [ID 10 kern.notice]
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4660
 genunix:avl_add+59 ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46c0
 zfs:find_ds_by_guid+b9 ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46f0
 zfs:findfunc+23 ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c47d0
 zfs:dmu_objset_find_spa+38c ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4810
 zfs:dmu_objset_find+40 ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4a70
 zfs:dmu_recv_stream+448 ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4c40
 zfs:zfs_ioc_recv+41d ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4cc0
 zfs:zfsdev_ioctl+175 ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d00
 genunix:cdev_ioctl+45 ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d40
 specfs:spec_ioctl+5a ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4dc0
 genunix:fop_ioctl+7b ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4ec0
 genunix:ioctl+18e ()
 Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4f10
 unix:brand_sys_syscall32+1ca ()
 Feb 8 16:07:09 amber unix: [ID 10 kern.notice]
 Feb 8 16:07:09 amber genunix: [ID 672855 kern.notice] syncing file
 systems...
 Feb 8 16:07:09 amber genunix: [ID 904073 kern.notice] done
 Feb 8 16:07:10 amber genunix: [ID 111219 kern.notice] dumping to
 /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
 Feb 8 16:07:10 amber ahci: [ID 405573 kern.info] NOTICE: ahci0:
 ahci_tran_reset_dport port 3 reset port
 Feb 8 16:07:35 amber genunix: [ID 10 kern.notice]
 Feb 8 16:07:35 amber genunix: [ID 665016 kern.notice] ^M100% done: 107693
 pages dumped,
 Feb 8 16:07:35 amber genunix: [ID 851671 kern.notice] dump succeeded


 Hello,
 I'll try to do my best.

 Here are the commands :

 amber ~ # zfs unmount data
 amber ~ # zfs snapshot -r d...@prededup
 amber ~ # zpool destroy ezdata
 amber ~ # zpool create ezdata c6t1d0
 amber ~ # zfs set dedup=on ezdata
 amber ~ # zfs set compress=on ezdata
 amber ~ # zfs send -RD d...@prededup |zfs receive ezdata/data
 cannot receive new filesystem stream: destination 'ezdata/data' exists
 must specify -F to overwrite it
 amber ~ # zpool destroy ezdata
 amber ~ # zpool create ezdata c6t1d0
 amber ~ # zfs set compression=on ezdata
 amber ~ # zfs set dedup=on ezdata
 amber ~ # zfs send -RD d...@prededup |zfs receive -F ezdata/data
 cannot receive new filesystem stream: destination has snapshots (eg.
 ezdata/d...@prededup)
 must destroy them to overwrite it

 Each time the send/receive command took some hours and transferred 151G 

Re: [zfs-discuss] Impact of an enterprise class SSD on ZIL performance

2010-02-05 Thread Andrey Kuzmin
On Fri, Feb 5, 2010 at 10:55 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 5 Feb 2010, Miles Nordin wrote:

   ls r...@nexenta:/volumes# hdadm write_cache off c3t5

   ls  c3t5 write_cache disabled

 You might want to repeat his test with X25-E.  If the X25-E is also
 dropping cache flush commands (it might!), you may be, compared to
 disabling the ZIL, slowing down your pool for no reason, and making it
 more fragile as well since an exported pool with a dead ZIL cannot be
 imported.

 Others have tested the X25-E and found that with its cache enabled, it does
 drop flushed writes, but is clearly not such a gaping chasm as the X25-M.
  Some time has passed so there is the possibility that X25-E firmware has
 (or will) improve.  If Sun offers an X25-E based device for use as an slog,
 you can be sure that its has been qualified for this purpose, and may
 contain modified firmware.

 The 'E' stands for Extreme and not Enterprise as some tend to believe.

Exactly. It would be therefore very interesting to hear on performance
from anyone using (real) enterprise SSD (which now spells STEC) as
slog.

Regards,
Andrey


 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-03 Thread Andrey Kuzmin
On Wed, Feb 3, 2010 at 6:11 PM, Ross Walker rswwal...@gmail.com wrote:
 On Feb 3, 2010, at 9:53 AM, Henu henrik.he...@tut.fi wrote:

 Okay, so first of all, it's true that send is always fast and 100%
 reliable because it uses blocks to see differences. Good, and thanks for
 this information. If everything else fails, I can parse the information I
 want from send stream :)

 But am I right, that there is no other methods to get the list of changed
 files other than the send command?

At zfs_send level there are no files, just DMU objects (modified in
some txg which is the basis for changed/unchanged decision).


 And in my situation I do not need to create snapshots. They are already
 created. The only thing that I need to do, is to get list of all the changed
 files (and maybe the location of difference in them, but I can do this
 manually if needed) between two already created snapshots.

 Not a ZFS method, but you could use rsync with the dry run option to list
 all changed files between two file systems.

That's painfully resource-intensive on both (sending and receiving)
ends, and it would be IMHO really beneficial to come up with an
interface that lets user-space (including off-the-shelf backup tools)
to iterate objects changed between two given snapshots.


Regards,
Andrey


 -Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup memory overhead

2010-01-21 Thread Andrey Kuzmin
On Thu, Jan 21, 2010 at 10:00 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Jan 21, 2010, at 8:04 AM, erik.ableson wrote:

 Hi all,

 I'm going to be trying out some tests using b130 for dedup on a server with 
 about 1,7Tb of useable storage (14x146 in two raidz vdevs of 7 disks).  What 
 I'm trying to get a handle on is how to estimate the memory overhead 
 required for dedup on that amount of storage.  From what I gather, the dedup 
 hash keys are held in ARC and L2ARC and as such are in competition for the 
 available memory.

 ... and written to disk, of course.

 For ARC sizing, more is always better.

 So the question is how much memory or L2ARC would be necessary to ensure 
 that I'm never going back to disk to read out the hash keys. Better yet 
 would be some kind of algorithm for calculating the overhead. eg - averaged 
 block size of 4K = a hash key for every 4k stored and a hash occupies 256 
 bits. An associated question is then how does the ARC handle competition 
 between hash keys and regular ARC functions?

 AFAIK, there is no special treatment given to the DDT. The DDT is stored like
 other metadata and (currently) not easily accounted for.

 Also the DDT keys are 320 bits. The key itself includes the logical and 
 physical
 block size and compression. The DDT entry is even larger.

Looking at dedupe code, I noticed that on-disk DDT entries are
compressed less efficiently than possible: key is not compressed at
all (I'd expect roughly 2:1 compression ration with sha256 data),
while other entry data is currently passed through zle compressor only
(I'd expect this one to be less efficient than off-the-shelf
compressors, feel free to correct me if I'm wrong). Is this v1, going
to be improved in the future?

Further, with huge dedupe memory footprint and heavy performance
impact when DDT entries need to be read from disk, it might be
worthwhile to consider compression of in-core ddt entries
(specifically for DDTs or, more generally, making ARC/L2ARC
compression-aware). Has this been considered?

Regards,
Andrey


 I think it is better to think of the ARC as caching the uncompressed DDT
 blocks which were written to disk.  The number of these will be data 
 dependent.
 zdb -S poolname will give you an idea of the number of blocks and how well
 dedup will work on your data, but that means you already have the data in a
 pool.
  -- richard


 Based on these estimations, I think that I should be able to calculate the 
 following:
 1,7   TB
 1740,8        GB
 1782579,2     MB
 1825361100,8  KB
 4     average block size
 456340275,2   blocks
 256   hash key size-bits
 1,16823E+11   hash key overhead - bits
 1460206,4 hash key size-bytes
 14260633,6    hash key size-KB
 13926,4       hash key size-MB
 13,6  hash key overhead-GB

 Of course the big question on this will be the average block size - or 
 better yet - to be able to analyze an existing datastore to see just how 
 many blocks it uses and what is the current distribution of different block 
 sizes. I'm currently playing around with zdb with mixed success  on 
 extracting this kind of data. That's also a worst case scenario since it's 
 counting really small blocks and using 100% of available storage - highly 
 unlikely.

 # zdb -ddbb siovale/iphone
 Dataset siovale/iphone [ZPL], ID 2381, cr_txg 3764691, 44.6G, 99 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, 
 flags 0x0

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         0    7    16K    16K  57.0K    64K   77.34  DMU dnode
         1    1    16K     1K  1.50K     1K  100.00  ZFS master node
         2    1    16K    512  1.50K    512  100.00  ZFS delete queue
         3    2    16K    16K  18.0K    32K  100.00  ZFS directory
         4    3    16K   128K   408M   408M  100.00  ZFS plain file
         5    1    16K    16K  3.00K    16K  100.00  FUID table
         6    1    16K     4K  4.50K     4K  100.00  ZFS plain file
         7    1    16K  6.50K  6.50K  6.50K  100.00  ZFS plain file
         8    3    16K   128K   952M   952M  100.00  ZFS plain file
         9    3    16K   128K   912M   912M  100.00  ZFS plain file
        10    3    16K   128K   695M   695M  100.00  ZFS plain file
        11    3    16K   128K   914M   914M  100.00  ZFS plain file

 Now, if I'm understanding this output properly, object 4 is composed of 
 128KB blocks with a total size of 408MB, meaning that it uses 3264 blocks.  
 Can someone confirm (or correct) that assumption? Also, I note that each 
 object  (as far as my limited testing has shown) has a single block size 
 with no internal variation.

 Interestingly, all of my zvols seem to use fixed size blocks - that is, 
 there is no variation in the block sizes - they're all the size defined on 
 creation with no dynamic block sizes being used. I previously thought that 
 the -b option set the maximum size, rather than fixing all blocks.  Learned 
 something today :-)

 # zdb -ddbb 

Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!

2010-01-15 Thread Andrey Kuzmin
On Fri, Jan 15, 2010 at 2:07 AM, Christopher George
cgeo...@ddrdrive.com wrote:
 Why not enlighten EMC/NTAP on this then?

 On the basic chemistry and possible failure characteristics of Li-Ion
 batteries?

 I will agree, if I had system level control as in either example, one could
 definitely help mitigate said risks compared to selling a card based
 product where I have very little control over the thermal envelopes I am
 subjected.

 Could you please elaborate on the last statement, provided you meant
 anything beyond UPS is a power-backup standard?

 Although, I do think the discourse is healthy and relevant.  At this point, I
 am comfortable to agree to disagree.  I respect your point of view, and do

Same on my side. I don't object to your design decision, my objection
was to the negative advertisement wrt another design. Good luck with
beta and beyond.

Regards,
Andrey

 agree strongly that Li-Ion batteries play a critical and highly valued role in
 many industries.


 Thanks,

 Christopher George
 Founder/CTO
 www.ddrdrive.com
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!

2010-01-14 Thread Andrey Kuzmin
On Thu, Jan 14, 2010 at 11:35 AM, Christopher George
cgeo...@ddrdrive.com wrote:
 I'm not sure about others on the list, but I have a dislike of AC power
 bricks in my racks.

 I definitely empathize with your position concerning AC power bricks, but
 until the perfect battery is created, and we are far from it, it comes down to
 tradeoffs.  I personally believe the ignition risk, thermal wear-out, and the
 inflexible proprietary nature of Li-Ion solutions simply outweigh the benefits
 of internal or all inclusive mounting for enterprise bound NVRAM.

That's kind of an overstatement. NVRAM backed by on-board LI-Ion
batteries has been used in storage industry for years; I can easily
point out a company that has shipped tens of thousands of such boards
over last 10 years.

Regards,
Andrey

 Is the state of the power input exposed to software in some way? In
 other terms, can I have a nagios check running on my server that
 triggers an alert if the power cable accidentally gets pulled out?

 Absolutely, the X1 monitors the external supply and can detect not only a
 disconnect but any loss of power.  In all cases, the card throws an interrupt
 so that the device driver (and ultimately user space) can be immediately
 notified.  The X1 does not rely on external power until the host power drops
 below a certain threshold, so attaching/detaching the external power cable
 has no effect on data integrity as long as the host is powered on.

 OK, which means that the UPS must be separate to the UPS powering
 the server then.

 Correct, a dedicated (in this case redundant) UPS is expected.

 Any plans on a pci-e multi-lane version then?

 Not at this time.  In addition to the reduced power and thermal output, the
 PCIe x1 connector has the added benefit of not competing with other HBA's
 which do require a x4 or x8 PCIe connection.

 Very appreciative of the feedback!

 Christopher George
 Founder/CTO
 www.ddrdrive.com
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!

2010-01-14 Thread Andrey Kuzmin
On Thu, Jan 14, 2010 at 10:02 PM, Christopher George
cgeo...@ddrdrive.com wrote:
 That's kind of an overstatement. NVRAM backed by on-board LI-Ion
 batteries has been used in storage industry for years;

 Respectfully, I stand by my three points of Li-Ion batteries as they relate
 to enterprise class NVRAM: ignition risk, thermal wear-out, and
 proprietary design.  As a prior post stated, there is a dearth of published
 failure statistics of Li-Ion based BBUs.

Why not enlighten EMC/NTAP on this then?


 I can easily point out a company that has shipped tens of
 thousands of such boards over last 10 years.

 No argument here, I would venture the risks for consumer based Li-Ion
 based products did not become apparent or commonly accepted until
 the user base grew several orders of magnitude greater than tens of
 thousands.

 For the record, I agree there is a marked convenience with an integrated
 high energy Li-Ion battery solution - but at what cost?

Um, with Li-Ion battery in each and every of a billions of cell phones
out there ...


 We chose an external solution because it is a proven and industry
 standard method of enterprise class data backup.

Could you please elaborate on the last statement, provided you meant
anything beyond UPS is a power-backup standard?

Regards,
Andrey


 Thanks,

 Christopher George
 Founder/CTO
 www.ddrdrive.com
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-05 Thread Andrey Kuzmin
600? I've heard 1.5GBps reported.

On 1/5/10, Eric D. Mudama edmud...@bounceswoosh.org wrote:
 On Mon, Jan  4 at 16:43, Wes Felter wrote:
Eric D. Mudama wrote:

I am not convinced that a general purpose CPU, running other software
in parallel, will be able to be timely and responsive enough to
maximize bandwidth in an SSD controller without specialized hardware
support.

Fusion-io would seem to be a counter-example, since it uses a fairly
simple controller (I guess the controller still performs ECC and
maybe XOR) and the driver eats a whole x86 core. The result is very
high performance.

Wes Felter

 I see what you're saying, but it isn't obvious (to me) how well
 they're using all the hardware at hand.  2GB/s of bandwidth over their
 PCI-e link and what looks like a TON of NAND, with a nearly-dedicated
 x86 core...  resuting in 600MB/s or something like that?

 While the number is very good for NAND flash SSDs, it seems like a TON
 of horsepower going to waste, and they still have a large onboard
 controller/FPGA.  I guess enough CPU can make the units faster, but
 i'm just not sold.

 --
 Eric D. Mudama
 edmud...@mail.bounceswoosh.org

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



-- 
Regards,
Andrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] getting decent NFS performance

2009-12-23 Thread Andrey Kuzmin
And how do you expect the mirrored iSCSI volume to work after
failover, with secondary (ex-primary) unreachable?

Regards,
Andrey




On Wed, Dec 23, 2009 at 9:40 AM, Erik Trimble erik.trim...@sun.com wrote:
 Charles Hedrick wrote:

 Is ISCSI reliable enough for this?


 YES.

 The original idea is a good one, and one that I'd not thought of.  The (old)
 iSCSI implementation is quite mature, if not anywhere as nice
 (feature/flexibility-wise) as the new COMSTAR stuff.

 I'm thinking that just putting in a straight-through cable between the two
 machine is the best idea here, rather than going through a switch.

 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation

2009-12-21 Thread Andrey Kuzmin
It might be helpful to contact SSD vendor, report the issue and
inquire if half a year wearing out is expected behavior for this
model. Further, if you have an option to replace one (or both) SSDs
with fresh ones, this could tell for sure if they are the root cause.

Regards,
Andrey




On Mon, Dec 21, 2009 at 1:18 PM, Erik Trimble erik.trim...@sun.com wrote:
 Mart van Santen wrote:

 Hi,

 We have a X4150 with a J4400 attached. Configured with 2x32GB SSD's, in
 mirror configuration (ZIL) and 12x 500GB SATA disks. We are running this
 setup for over a half year now in production for NFS and iSCSI for a bunch
 of virtual machines (currently about 100 VM's, Mostly Linux, some Windows)

 Since last week we have performance problems, cause IO Wait in the VM's.
 Of course we did a big search in networking issue's, hanging machines,
 filewall  traffic tests, but were unable to find any problems. So we had a
 look into the zpool and dropped one of the mirrored SSD's from the pool (we
 had some indication the ZIL was not working ok). No success. After adding
 the disk, we  discovered the IO wait during the resilvering process was
 OK, or at least much better, again. So last night we did the same handling,
 dropped  added the same disk, and yes, again, the IO wait looked better.
 This morning the same story.

 Because this machine is a production machine, we cannot tolerate to much
 experiments. We now know this operation saves us for about 4 to 6 hours
 (time to resilvering), but we didn't had the courage to detach/attach the
 other SSD yet. We will try only a resilver, without detach/attach, this
 night, to see what happens.

 Can anybody explain how the detach/attach and resilver process works, and
 especially if there is something different during the resilvering and the
 handling of the SSD's/slog disks?


 Regards,


 Mart



 Do the I/O problems go away when only one of the SSDs is attached?


 Frankly, I'm betting that your SSDs are wearing out.   Resilvering will
 essentially be one big streaming write, which is optimal for SSDs (even an
 SLC-based SSD, as you likely have, performs far better when writing large
 amounts of data at once).  NFS (and to a lesser extent iSCSI) is generally a
 whole lot of random small writes, which are hard on an SSD (especially
 MLC-based ones, but even SLC ones).   The resilvering process is likely
 turning many of the random writes coming in to the system into a large
 streaming write to the /resilvering/ drive.

 My guess is that the SSD you are having problems with has reached the end of
 it's useful lifespan, and the I/O problems you are seeing during normal
 operation are the result of that SSD's problems with committing data.
 There's no cure for this, other than replacing the SSD with a new one.

 SSDs are not hard drives. Even high-quality modern ones have /significantly/
 lower USE lifespans than an HD - that is, a heavily-used SSD will die well
 before a HD, but a very-lightly used SSD will likely outlast a HD.  And, in
 the case of SSDs, writes are far harder on the SSD than reads are.


 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I determine dedupe effectiveness?

2009-12-19 Thread Andrey Kuzmin
On Sat, Dec 19, 2009 at 7:20 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Sat, 19 Dec 2009, Colin Raven wrote:

 There is no original, there is no copy. There is one block with reference
 counters.

 - Fred can rm his file (because clearly it isn't a file, it's a filename
 and that's all)
 - result: the reference count is decremented by one - the data remains on
 disk.

 While the similarity to hard links is a good analogy, there really is a
 unique file in this case.  If Fred does a 'rm' on the file then the
 reference count on all the file blocks is reduced by one, and the block is
 freed if the reference count goes to zero.  Behavior is similar to the case
 where a snapshot references the file block.  If Janet updates a block in the
 file, then that updated block becomes unique to her copy of the file (and
 the reference count on the original is reduced by one) and it remains unique
 unless it happens to match a block in some other existing file (or snapshot
 of a file).

 When we are children, we are told that sharing is good.  In the case or
 references, sharing is usually good, but if there is a huge amount of
 sharing, then it can take longer to delete a set of files since the mutual
 references create a hot spot which must be updated sequentially.  Files
 are usually created slowly so we don't notice much impact from this sharing,
 but we expect (hope) that files will be deleted almost instantaneously.

I believe this has been taken care of in space maps design
(http://blogs.sun.com/bonwick/entry/space_maps provides a nice
overview).

Regards,
Andrey


 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-17 Thread Andrey Kuzmin
Downside you have described happens only when the same checksum is
used for data protection and duplicate detection. This implies sha256,
BTW, since fletcher-based dedupe has been dropped in recent builds.

On 12/17/09, Kjetil Torgrim Homme kjeti...@linpro.no wrote:
 Andrey Kuzmin andrey.v.kuz...@gmail.com writes:
 Darren J Moffat wrote:
 Andrey Kuzmin wrote:
 Resilvering has noting to do with sha256: one could resilver long
 before dedupe was introduced in zfs.

 SHA256 isn't just used for dedup it is available as one of the
 checksum algorithms right back to pool version 1 that integrated in
 build 27.

 'One of' is the key word. And thanks for code pointers, I'll take a
 look.

 I didn't mention sha256 at all :-).  the reasoning is the same no matter
 what hash algorithm you're using (fletcher2, fletcher4 or sha256.  dedup
 doesn't require sha256 either, you can use fletcher4.

 the question was: why does data have to be compressed before it can be
 recognised as a duplicate?  it does seem like a waste of CPU, no?  I
 attempted to show the downsides to identifying blocks by their
 uncompressed hash.  (BTW, it doesn't affect storage efficiency, the same
 duplicate blocks will be discovered either way.)

 --
 Kjetil T. Homme
 Redpill Linpro AS - Changing the game

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



-- 
Regards,
Andrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-17 Thread Andrey Kuzmin
On Thu, Dec 17, 2009 at 6:14 PM, Kjetil Torgrim Homme
kjeti...@linpro.no wrote:
 Darren J Moffat darr...@opensolaris.org writes:
 Kjetil Torgrim Homme wrote:
 Andrey Kuzmin andrey.v.kuz...@gmail.com writes:

 Downside you have described happens only when the same checksum is
 used for data protection and duplicate detection. This implies sha256,
 BTW, since fletcher-based dedupe has been dropped in recent builds.

 if the hash used for dedup is completely separate from the hash used
 for data protection, I don't see any downsides to computing the dedup
 hash from uncompressed data.  why isn't it?

 It isn't separate because that isn't how Jeff and Bill designed it.

 thanks for confirming that, Darren.

 I think the design the have is great.

 I don't disagree.

 Instead of trying to pick holes in the theory can you demonstrate a
 real performance problem with compression=on and dedup=on and show
 that it is because of the compression step ?

 compression requires CPU, actually quite a lot of it.  even with the
 lean and mean lzjb, you will get not much more than 150 MB/s per core or
 something like that.  so, if you're copying a 10 GB image file, it will
 take a minute or two, just to compress the data so that the hash can be
 computed so that the duplicate block can be identified.  if the dedup
 hash was based on uncompressed data, the copy would be limited by
 hashing efficiency (and dedup tree lookup)

This isn't exactly true. If, speculatively, one stores two hashes, one
for uncompressed data in ddt and another one, for compressed data,
with data block for data healing, one wins compression for duplicates
and pays by extra hash computation for singletons. So a more correct
question would be if the set of cases where duplicates/singletons and
compression/hashing bandwidth ratios are such that one wins is
non-empty (or, rather, o practical importance).

Regards,
Andrey
.

 I don't know how tightly interwoven the dedup hash tree and the block
 pointer hash tree are, or if it is all possible to disentangle them.

 conceptually it doesn't seem impossible, but that's easy for me to
 say, with no knowledge of the zio pipeline...

 oh, how does encryption play into this?  just don't?  knowing that
 someone else has the same block as you is leaking information, but that
 may be acceptable -- just make different pools for people you don't
 trust.

 Otherwise if you want it changed code it up and show how what you have
 done is better in all cases.

 I wish I could :-)

 --
 Kjetil T. Homme
 Redpill Linpro AS - Changing the game

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Andrey Kuzmin
Yet again, I don't see how RAID-Z reconstruction is related to the
subject discussed (what data should be sha256'ed when both dedupe and
compression are enabled, raw or compressed ). sha256 has nothing to do
with bad block detection (may be it will when encryption is
implemented, but for now sha256 is used for duplicate candidates
look-up only).

Regards,
Andrey




On Wed, Dec 16, 2009 at 5:18 PM, Kjetil Torgrim Homme
kjeti...@linpro.no wrote:
 Andrey Kuzmin andrey.v.kuz...@gmail.com writes:

 Kjetil Torgrim Homme wrote:
 for some reason I, like Steve, thought the checksum was calculated on
 the uncompressed data, but a look in the source confirms you're right,
 of course.

 thinking about the consequences of changing it, RAID-Z recovery would be
 much more CPU intensive if hashing was done on uncompressed data --

 I don't quite see how dedupe (based on sha256) and parity (based on
 crc32) are related.

 I tried to hint at an explanation:

 every possible combination of the N-1 disks would have to be
 decompressed (and most combinations would fail), and *then* the
 remaining candidates would be hashed to see if the data is correct.

 the key is that you don't know which block is corrupt.  if everything is
 hunky-dory, the parity will match the data.  parity in RAID-Z1 is not a
 checksum like CRC32, it is simply XOR (like in RAID 5).  here's an
 example with four data disks and one paritydisk:

  D1  D2  D3  D4  PP
  00  01  10  10  01

 this is a single stripe with 2-bit disk blocks for simplicity.  if you
 XOR together all the blocks, you get 00.  that's the simple premise for
 reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and
 so on.

 so what happens if a bit flips in D4 and it becomes 00?  the total XOR
 isn't 00 anymore, it is 10 -- something is wrong.  but unless you get a
 hardware signal from D4, you don't know which block is corrupt.  this is
 a major problem with RAID 5, the data is irrevocably corrupt.  the
 parity discovers the error, and can alert the user, but that's the best
 it can do.  in RAID-Z the hash saves the day: first *assume* D1 is bad
 and reconstruct it from parity.  if the hash for the block is OK, D1
 *was* bad.  otherwise, assume D2 is bad.  and so on.

 so, the parity calculation will indicate which stripes contain bad
 blocks.  but the hashing, the sanity check for which disk blocks are
 actually bad must be calculated over all the stripes a ZFS block
 (record) consists of.

 this would be done on a per recordsize basis, not per stripe, which
 means reconstruction would fail if two disk blocks (512 octets) on
 different disks and in different stripes go bad.  (doing an exhaustive
 search for all possible permutations to handle that case doesn't seem
 realistic.)

 actually this is the same for compression before/after hashing.  it's
 just that each permutation is more expensive to check.

 in addition, hashing becomes slightly more expensive since more data
 needs to be hashed.

 overall, my guess is that this choice (made before dedup!) will give
 worse performance in normal situations in the future, when dedup+lzjb
 will be very common, at a cost of faster and more reliable resilver.  in
 any case, there is not much to be done about it now.

 --
 Kjetil T. Homme
 Redpill Linpro AS - Changing the game

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Andrey Kuzmin
On Wed, Dec 16, 2009 at 7:25 PM, Kjetil Torgrim Homme
kjeti...@linpro.no wrote:
 Andrey Kuzmin andrey.v.kuz...@gmail.com writes:
 Yet again, I don't see how RAID-Z reconstruction is related to the
 subject discussed (what data should be sha256'ed when both dedupe and
 compression are enabled, raw or compressed ). sha256 has nothing to do
 with bad block detection (may be it will when encryption is
 implemented, but for now sha256 is used for duplicate candidates
 look-up only).

 how do you think RAID-Z resilvering works?  please correct me where I'm
 wrong.

Resilvering has noting to do with sha256: one could resilver long
before dedupe was introduced in zfs.

Regards,
Andrey


 --
 Kjetil T. Homme
 Redpill Linpro AS - Changing the game

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Troubleshooting dedup performance

2009-12-16 Thread Andrey Kuzmin
On Wed, Dec 16, 2009 at 6:41 PM, Chris Murray chrismurra...@gmail.com wrote:
 Hi,

 I run a number of virtual machines on ESXi 4, which reside in ZFS file
 systems and are accessed over NFS. I've found that if I enable dedup,
 the virtual machines immediately become unusable, hang, and whole
 datastores disappear from ESXi's view. (See the attached screenshot from
 vSphere client at around the 21:54 mark for the drop in connectivity).
 I'm on OpenSolaris Preview, build 128a.

 I've set dedup to what I believe are the least resource-intensive
 settings - checksum=fletcher4 on the pool,  dedup=on rather than

I believe checksum=fletcher4 is acceptable in dedup=verify mode only.
What you're doing is seemingly deduplication with weak checksum w/o
verification.


Regards,
Andrey

 verify, but it is still the same.

 Where can I start troubleshooting? I get the feeling that my hardware
 isn't up to the job, but some numbers to verify that would be nice
 before I start investigating an upgrade.

 vmstat showed plenty of idle CPU cycles, and zpool iostat just
 showed slow throughput, as the ESXi graph does. As soon as I set
 dedup=off, the virtual machines leapt into action again (22:15 on the
 screenshot).

 Many thanks,
 Chris

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Andrey Kuzmin
On Wed, Dec 16, 2009 at 7:46 PM, Darren J Moffat
darr...@opensolaris.org wrote:
 Andrey Kuzmin wrote:

 On Wed, Dec 16, 2009 at 7:25 PM, Kjetil Torgrim Homme
 kjeti...@linpro.no wrote:

 Andrey Kuzmin andrey.v.kuz...@gmail.com writes:

 Yet again, I don't see how RAID-Z reconstruction is related to the
 subject discussed (what data should be sha256'ed when both dedupe and
 compression are enabled, raw or compressed ). sha256 has nothing to do
 with bad block detection (may be it will when encryption is
 implemented, but for now sha256 is used for duplicate candidates
 look-up only).

 how do you think RAID-Z resilvering works?  please correct me where I'm
 wrong.

 Resilvering has noting to do with sha256: one could resilver long
 before dedupe was introduced in zfs.

 SHA256 isn't just used for dedup it is available as one of the checksum
 algorithms right back to pool version 1 that integrated in build 27.

'One of' is the key word. And thanks for code pointers, I'll take a look.

Regards,
Andrey

 SHA256 is also used to checksum the pool uberblock.

 This means that SHA256 is used during resilvering and especially so if you
 have checksum=sha256 for your datasets.

 If you still don't believe me check the source code history:

 http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c
 http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sha256.c

 Look at the date when that integrated 31st October 2005.

 In case you still doubt me look at the fix I just integrated today:

 http://mail.opensolaris.org/pipermail/onnv-notify/2009-December/011090.html


 --
 Darren J Moffat

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Troubleshooting dedup performance

2009-12-16 Thread Andrey Kuzmin
On Wed, Dec 16, 2009 at 8:09 PM, Cyril Plisko cyril.pli...@mountall.com wrote:
 I've set dedup to what I believe are the least resource-intensive
 settings - checksum=fletcher4 on the pool,  dedup=on rather than

 I believe checksum=fletcher4 is acceptable in dedup=verify mode only.
 What you're doing is seemingly deduplication with weak checksum w/o
 verification.

 I think fletcher4 use for the deduplication purposes was disabled [1]
 at all, right before build 129 cut.


 [1] 
 http://hg.genunix.org/onnv-gate.hg/diff/93c7076216f6/usr/src/common/zfs/zfs_prop.c

Peculiar fix, quotes the reason being checksum errors because we are
not computing the byteswapped checksum, but solves it by dropping
checksum support instead of adding byte-swapped checksum computation.
A I missing something?

Regards,
Andrey




 --
 Regards,
        Cyril

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-15 Thread Andrey Kuzmin
On Tue, Dec 15, 2009 at 3:06 PM, Kjetil Torgrim Homme
kjeti...@linpro.no wrote:
 Robert Milkowski mi...@task.gda.pl writes:
 On 13/12/2009 20:51, Steve Radich, BitShop, Inc. wrote:
 Because if you can de-dup anyway why bother to compress THEN check?
 This SEEMS to be the behaviour - i.e. I would suspect many of the
 files I'm writing are dups - however I see high cpu use even though
 on some of the copies I see almost no disk writes.

 First, the checksum is calculated after compression happens.

 for some reason I, like Steve, thought the checksum was calculated on
 the uncompressed data, but a look in the source confirms you're right,
 of course.

 thinking about the consequences of changing it, RAID-Z recovery would be
 much more CPU intensive if hashing was done on uncompressed data --

I don't quite see how dedupe (based on sha256) and parity (based on
crc32) are related.

Regards,
Andrey

 every possible combination of the N-1 disks would have to be
 decompressed (and most combinations would fail), and *then* the
 remaining candidates would be hashed to see if the data is correct.

 this would be done on a per recordsize basis, not per stripe, which
 means reconstruction would fail if two disk blocks (512 octets) on
 different disks and in different stripes go bad.  (doing an exhaustive
 search for all possible permutations to handle that case doesn't seem
 realistic.)

 in addition, hashing becomes slightly more expensive since more data
 needs to be hashed.

 overall, my guess is that this choice (made before dedup!) will give
 worse performance in normal situations in the future, when dedup+lzjb
 will be very common, at a cost of faster and more reliable resilver.  in
 any case, there is not much to be done about it now.

 --
 Kjetil T. Homme
 Redpill Linpro AS - Changing the game

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 + SFA F20 PCIe?

2009-12-14 Thread Andrey Kuzmin
On Mon, Dec 14, 2009 at 4:04 AM, Jens Elkner
jel+...@cs.uni-magdeburg.de wrote:
 On Sat, Dec 12, 2009 at 04:23:21PM +, Andrey Kuzmin wrote:
 As to whether it makes sense (as opposed to two distinct physical
 devices), you would have read cache hits competing with log writes for
 bandwidth. I doubt both will be pleased :-)

 Hmm - good point. What I'm trying to accomplish:

 Actually our current prototype thumper setup is:
        root pool (1x 2-way mirror SATA)
        hotspare  (2x SATA shared)
        pool1 (12x 2-way mirror SATA)   ~25% used       user homes
        pool2 (10x 2-way mirror SATA)   ~25% used       mm files, archives, 
 ISOs

 So pool2 is not really a problem - delivers about 600MB/s uncached,
 about 1.8 GB/s cached (i.e. read a 2nd time, tested with a 3.8GB iso)
 and is not contineously stressed. However sync write is ~ 200 MB/s
 or 20 MB/s and mirror, only.

 Problem is pool1 - user homes! So GNOME/firefox/eclipse/subversion/soffice
 usually via NFS and a litle bit via samba - a lot of more or less small
 files, probably widely spread over the platters. E.g. checkin' out a
 project from a svn|* repository into a home takes hours. Also having
 its workspace on NFS isn't fun (compared to linux xfs driven local soft
 2-way mirror).

Flash-based read cache should help here by minimizing (metadata) read
latency, and flash-based log would bring down write latency. The only
drawback of using single F20 is that you're trying to minimize both
with the same device.


 So, seems to be a really interesting thing and I expect at least wrt.
 user homes a real improvement, no matter, how the final configuration
 will look like.

 Maybe the experts at the source are able to do some 4x SSD vs. 1xF20
 benchmarks? I guess at least if they turn out to be good enough, it
 wouldn't hurt ;-)

Would be interesting indeed.

Regards,
Andrey


  Jens Elkner wrote:
 ...
  whether it is possible/supported/would make sense to use a Sun Flash
  Accelerator F20 PCIe Card in a X4540 instead of 2.5 SSDs?

 Regards,
 jel.
 --
 Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
 Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
 39106 Magdeburg, Germany         Tel: +49 391 67 12768
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-14 Thread Andrey Kuzmin
On Sun, Dec 13, 2009 at 11:51 PM, Steve Radich, BitShop, Inc.
ste...@bitshop.com wrote:
 I enabled compression on a zfs filesystem with compression=gzip9 - i.e. 
 fairly slow compression - this stores backups of databases (which compress 
 fairly well).

 The next question is:  Is the CRC on the disk based on the uncompressed data 
 (which seems more likely to be able to be recovered) or based on the zipped 
 data (which seems slightly less likely to be able to be recovered).

 Why?

 Because if you can de-dup anyway why bother to compress THEN check? This 
 SEEMS to be the behaviour - i.e. I

ZFS deduplication is block-level, so to deduplicate one needs data
broken into blocks to be written. With compression enabled, you don't
have these until data is compressed. Looks like cycles waste indeed,
but ...

Regards,
Andrey

 would suspect many of the files I'm writing are dups - however I see high cpu 
 use even though
 on some of the copies I see almost no disk writes.

 If the dup check logic happens first AND it's a duplicate I shouldn't see 
 hardly any CPU use (because it won't need to compress the data).

 Steve Radich
 BitShop.com
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-14 Thread Andrey Kuzmin
On Mon, Dec 14, 2009 at 9:53 PM,  casper@sun.com wrote:

On Mon, Dec 14, 2009 at 09:30:29PM +0300, Andrey Kuzmin wrote:
 ZFS deduplication is block-level, so to deduplicate one needs data
 broken into blocks to be written. With compression enabled, you don't
 have these until data is compressed. Looks like cycles waste indeed,
 but ...

ZFS compression is also block-level.  Both are done on ZFS blocks.  ZFS
compression is not streamwise.


 And if you enable verify and you checksum the uncompressed data, you
 will need to uncompress before you can verify.

Right, but 'verify' seems to be 'extreme safety' and thus rather rare
use case. Saving cycles lost to compress duplicates looks to outweigh
'uncompress before verify' overhead, imo.

Regards,
Andrey


 Casper

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-14 Thread Andrey Kuzmin
On 12/14/09, Cyril Plisko cyril.pli...@mountall.com wrote:
 On Mon, Dec 14, 2009 at 9:32 PM, Andrey Kuzmin
 andrey.v.kuz...@gmail.com wrote:

 Right, but 'verify' seems to be 'extreme safety' and thus rather rare
 use case.

 Hmm, dunno. I wouldn't set anything, but scratch file system to
 dedup=on. Anything of even slight significance is set to dedup=verify.

 Saving cycles lost to compress duplicates looks to outweigh
 'uncompress before verify' overhead, imo.

 Dedup doesn't come for free - it imposes additional load on CPU. just
 like a checksumming and compression. The more fancy things we want our
 file system to do for us, the stronger CPU it'll take.

 --
 Regards,
 Cyril

Verify mode actually looks compress/dedupe order-neutral. To do
byte-comparison, one can either compress new block or decompress old
one, and the latter is usually a bit easter. Pipeline design may
dictate a choice, for instance one could compress new block while old
one is being fetched from disk for comparison, but overall it looks
pretty close. And with dedupe=on reversing the order, if feasible,
saves quite some cycles.

Regards,
Andrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 + SFA F20 PCIe?

2009-12-12 Thread Andrey Kuzmin
As to whether it makes sense (as opposed to two distinct physical
devices), you would have read cache hits competing with log writes for
bandwidth. I doubt both will be pleased :-)

On 12/12/09, Robert Milkowski mi...@task.gda.pl wrote:
 Jens Elkner wrote:
 Hi,

 just got a quote from our campus reseller, that readzilla and logzilla
 are not available for the X4540 - hmm strange Anyway, wondering
 whether it is possible/supported/would make sense to use a Sun Flash
 Accelerator F20 PCIe Card in a X4540 instead of 2.5 SSDs?

 If so, is it possible to partition the F20, e.g. into 36 GB logzilla,
 60GB readzilla (also interesting for other X servers)?


 IIRC the card presents 4x LUNs so you could use each of them for
 different purpose.
 You could also use different slices.
 me or not. Is this correct?



 It still does. The capacitor is not for flushing data to disks drives!
 The card has a small amount of DRAM memory on it which is being flushed
 to FLASH. Capacitor is to make sure it actually happens if the power is
 lost.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



-- 
Regards,
Andrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SMC for ZFS administration in OpenSolaris 2009.06?

2009-12-11 Thread Andrey Kuzmin
On Fri, Dec 11, 2009 at 11:43 PM, Nick nick.couch...@seakr.com wrote:
 No, it is not, for a couple of reasons.  First of all, rumor is that SMC is 
 being discontinued in favor
 of a WBEM/CIM- based management system.

Any specific implementation meant? Are there any plans wrt OpenPegasus?


Regards,
Andrey


Second, the SMC code is not open-source, which means it cannot be
included in OpenSolaris.  It is included in Solaris Express Community
Edition (SXCE), and there are several posts and instructions available
for installing the packages from SXCE to Opensolaris.  Even so, some
issues do tend to pop up getting it working - for example, logging in
is still got me stumped, because I can't log in as root due to
Opensolaris' RBAC configuration, but I also can't log in as the
unprivileged user I've created.

 You can also check out EON - go to http://eonstorage.blogspot.com/.  
 Unfortunately because of a bug in the 128 version of the code, the latest 
 build you can get for EON is 125, which doesn't include deduplication (if 
 that's important to you).  I also don't believe that EON currently has a 
 web-based management interface - it's in the works - so that doesn't really 
 help you there.

 -Nick
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS dedup report tool

2009-12-09 Thread Andrey Kuzmin
On Wed, Dec 9, 2009 at 2:26 PM, Bruno Sousa bso...@epinfante.com wrote:
 Hi all,

 Is there any way to generate some report related to the de-duplication
 feature of ZFS within a zpool/zfs pool?
 I mean, its nice to have the dedup ratio, but it think it would be also
 good to have a report where we could see what directories/files have
 been found as repeated and therefore they suffered deduplication.

Nice to have at first glance, but could you detail on any specific
use-case you see?

Regards,
Andrey


 Thanks for your time,
 Bruno

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS dedup report tool

2009-12-09 Thread Andrey Kuzmin
On Wed, Dec 9, 2009 at 2:47 PM, Bruno Sousa bso...@epinfante.com wrote:
 Hi Andrey,

 For instance, i talked about deduplication to my manager and he was
 happy because less data = less storage, and therefore less costs .
 However, now the IT group of my company needs to provide to management
 board, a report of duplicated data found per share, and in our case one
 share means one specific company department/division.
 Bottom line, the mindset is something like :

    * one share equals to a specific department within the company
    * the department demands a X value of data storage
    * the data storage costs Y
    * making a report of the amount of data consumed by a department,
      before and after deduplication, means that data storage costs can
      be seen per department

Do you currently have tools that report storage usage per share? What
you ask for looks like a request to make these deduplication-aware.

    * if theres a cost reduction due to the usage of deduplication, part
      of that money can be used for business , either IT related
      subjects or general business
    * management board wants to see numbers related to costs, and not
      things like the racio of deduplication in SAN01 is 3x, because
      for management this is geek talk

Just divide storage costs by deduplication factor (1), and here you
are (provided you can do it by department).

Regards,
Andrey


 I hope i was somehow clear, but i can try to explain better if needed.

 Thanks,
 Bruno

 Andrey Kuzmin wrote:
 On Wed, Dec 9, 2009 at 2:26 PM, Bruno Sousa bso...@epinfante.com wrote:

 Hi all,

 Is there any way to generate some report related to the de-duplication
 feature of ZFS within a zpool/zfs pool?
 I mean, its nice to have the dedup ratio, but it think it would be also
 good to have a report where we could see what directories/files have
 been found as repeated and therefore they suffered deduplication.


 Nice to have at first glance, but could you detail on any specific
 use-case you see?

 Regards,
 Andrey


 Thanks for your time,
 Bruno

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS dedup report tool

2009-12-09 Thread Andrey Kuzmin
On Wed, Dec 9, 2009 at 10:43 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Wed, 9 Dec 2009, Bruno Sousa wrote:

 Despite the fact that i agree in general with your comments, in reality
 it all comes to money..
 So in this case, if i could prove that ZFS was able to find X amount of
 duplicated data, and since that X amount of data has a price of Y per
 GB, IT could be seen as business enabler instead of a cost centre.

 Most of the cost of storing business data is related to the cost of backing
 it up and administering it rather than the cost of the system on which it is
 stored.  In this case it is reasonable to know the total amount of user data
 (and charge for it), since it likely needs to be backed up and managed.
  Deduplication does not help much here.

Um, I thought deduplication had been invented to reduce backup window :).

Regards,
Andrey

 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] freeNAS moves to Linux from FreeBSD

2009-12-08 Thread Andrey Kuzmin
On Tue, Dec 8, 2009 at 7:02 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Mon, 7 Dec 2009, Michael DeMan (OA) wrote:

 Args for FreeBSD + ZFS:

 - Limited budget
 - We are familiar with managing FreeBSD.
 - We are familiar with tuning FreeBSD.
 - Licensing model

 Args against OpenSolaris + ZFS:
 - Hardware compatibility
 - Lack of knowledge for tuning and associated costs for training staff to
 learn 'yet one more operating system' they need to support.
 - Licensing model

 If you think about it a little bit, you will see that there is no
 significant difference in the licensing model between FreeBSD+ZFS and
 OpenSolaris+ZFS.  It is not possible to be a little bit pregnant. Either
 one is pregnant, or one is not.


Well, FreeBSD pretends it's possible, by shipping zfs and bearing BSD
license at the same time.

Regards,
Andrey

 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Seagate announces enterprise SSD

2009-12-08 Thread Andrey Kuzmin
On Tue, Dec 8, 2009 at 9:32 PM, Richard Elling richard.ell...@gmail.com wrote:
 FYI,
 Seagate has announced a new enterprise SSD.  The specs appear
 to be competitive:
        + 2.5 form factor
        + 5 year warranty
        + power loss protection
        + 0.44% annual failure rate (AFR) (2M hours MTBF, IMHO too low :-)
        + UER 1e-16 (new), 1e-15 (5 years)
        + 30,000/25,000 4 KB read IOPS (peak/aligned zero offset)
        + 30,000/10,500 4 KB write IOPS (peak/aligned zero offset)

IIRC, last figures are for 200GB model, with write performance
degrading by a factor of two for 100  (another 1/2) 50GB models.
Parallelization, or rather lack of it.

Regards,
Andrey


 http://www.seagate.com/www/en-us/products/servers/pulsar/pulsar/
 http://storageeffect.media.seagate.com/2009/12/storage-effect/seagate-pulsar-the-first-enterprise-ready-ssd/
 http://www.seagate.com/docs/pdf/marketing/po_pulsar.pdf
  -- richard

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Transaction consistency of ZFS

2009-12-07 Thread Andrey Kuzmin
On Sun, Dec 6, 2009 at 8:11 PM, Anurag Agarwal anu...@kqinfotech.com wrote:
 Hi,

 My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all the
 writes in zfs are logged in the ZIL. And if that indeed is the case, then

IIRC, there is some upper limit (1MB?) on writes that go to ZIL, with
larger ones executed directly. Yet again, this is an outsider's
impression, not the architect's () statement.

Regards,
Andrey

 yes, ZFS does guarantee the sequential consistency, even when there are
 power outage or server crash. You might loose some writes if ZIL has not
 committed to disk. But that would not change the sequential consistency
 guarantee.

 There is no need to do a fsync or open the file with O_SYNC. It should work
 as it is.

 I have not done any experiments to verify this, so please take my
 observation with pinch of salt.
 Any ZFS developers to verify or refute this.

 Regards,
 Anurag.

 On Sun, Dec 6, 2009 at 8:12 AM, nxyyt schumi@gmail.com wrote:

 This question is forwarded from ZFS-discussion. Hope any developer can
 throw some light on it.

 I'm a newbie to ZFS. I have a special question against the COW transaction
 of ZFS.

 Does ZFS keeps the sequential consistency of the same file  when it meets
 power outage or server crash?

 Assume following scenario:

 My application has only a single thread and it appends the data to the
 file continuously. Suppose at time t1, it append a buf named A to the file.
 At time t2, which is later than t1, it appends a buf named B to the file. If
 the server crashes after t2, is it possible the buf B is flushed back to the
 disk but buf A is not?

 My application appends the file only without truncation or overwrite.Does
 ZFS keep the consistency that the data written to a file in sequential order
 or casual order be flushed to disk in the same order?

  If the uncommitted writer operation to a single file always binding with
 the same opening transaction group and all transaction group is committed in
 sequential order, I think the answer should be YES. In other words,
 [b]whether there is only one opening transaction group at any time and  the
 transaction group is committed in order for a single pool?[/b]


 Hope anybody can help me clarify it. Thank you very much!
 --
 This message posted from opensolaris.org
 ___
 zfs-code mailing list
 zfs-c...@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-code



 --
 Anurag Agarwal
 CEO, Founder
 KQ Infotech, Pune
 www.kqinfotech.com
 9881254401
 Coordinator Akshar Bharati
 www.aksharbharati.org
 Spreading joy through reading

 ___
 zfs-code mailing list
 zfs-c...@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-code


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss