Re: [zfs-discuss] Sun Flash Accelerator F20
On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling richard.ell...@gmail.comwrote: On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: Andrey Kuzmin wrote: Well, I'm more accustomed to sequential vs. random, but YMMW. As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? It's a sustained number, so it shouldn't matter. That is only 34 MB/sec. The disk can do better for sequential writes. Note: in ZFS, such writes will be coalesced into 128KB chunks. So this is just 256 IOPS in the controller, not 64K. Regards, Andrey -- richard -- ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On Fri, Jun 11, 2010 at 1:26 PM, Robert Milkowski mi...@task.gda.pl wrote: On 11/06/2010 09:22, sensille wrote: Andrey Kuzmin wrote: On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling richard.ell...@gmail.commailto:richard.ell...@gmail.com wrote: On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: Andrey Kuzmin wrote: Well, I'm more accustomed to sequential vs. random, but YMMW. As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? It's a sustained number, so it shouldn't matter. That is only 34 MB/sec. The disk can do better for sequential writes. Note: in ZFS, such writes will be coalesced into 128KB chunks. So this is just 256 IOPS in the controller, not 64K. No, it's 67k ops, it was a completely ZFS-free test setup. iostat also confirmed the numbers. It's a really simple test everyone can do it. # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512 I did a test on my workstation a moment ago and got about 21k IOPS from my sata drive (iostat). The trick here of course is that this is sequentail write with no other workload going on and a drive should be able to nicely coalesce these IOs and do a sequential writes with large blocks. Exactly, though one might still wonder where the coalescing actually happens, in the respective OS layer or in the controller. Nonetheless, this is hardly a common use-case one would design h/w for. Regards, Andrey -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl wrote: On 21/10/2009 03:54, Bob Friesenhahn wrote: I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. Open Solaris 2009.06, 1KB READ I/O: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0 /dev/null is usually a poor choice for a test lie this. Just to be on the safe side, I'd rerun it with /dev/random. Regards, Andrey # iostat -xnzCM 1|egrep device|c[0123]$ [...] r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17497.30.0 17.10.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17498.80.0 17.10.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17277.60.0 16.90.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17441.30.0 17.00.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17333.90.0 16.90.0 0.0 0.80.00.0 0 82 c0 Now lets see how it looks like for a single SAS connection but dd to 11x SSDs: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0 # iostat -xnzCM 1|egrep device|c[0123]$ [...] r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104243.30.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104249.20.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104208.10.0 101.80.0 0.2 9.70.00.1 0 967 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104245.80.0 101.80.0 0.2 9.70.00.1 0 966 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104221.90.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104212.20.0 101.80.0 0.2 9.70.00.1 0 967 c0 It looks like a single CPU core still hasn't been saturated and the bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris 2009.06 can do at least 100,000 IOPS to a single SAS port. It also scales well - I did run above dd's over 4x SAS ports at the same time and it scaled linearly by achieving well over 400k IOPS. hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw. 1.27.3.0), connected to F5100. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
Sorry, my bad. _Reading_ from /dev/null may be an issue, but not writing to it, of course. Regards, Andrey On Thu, Jun 10, 2010 at 6:46 PM, Robert Milkowski mi...@task.gda.pl wrote: On 10/06/2010 15:39, Andrey Kuzmin wrote: On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.plwrote: On 21/10/2009 03:54, Bob Friesenhahn wrote: I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. Open Solaris 2009.06, 1KB READ I/O: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0 /dev/null is usually a poor choice for a test lie this. Just to be on the safe side, I'd rerun it with /dev/random. That wouldn't work, would it? Please notice that I'm reading *from* an ssd and writing *to* /dev/null -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
As to your results, it sounds almost too good to be true. As Bob has pointed out, h/w design targeted hundreds IOPS, and it was hard to believe it can scale 100x. Fantastic. Regards, Andrey On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl wrote: On 21/10/2009 03:54, Bob Friesenhahn wrote: I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. Open Solaris 2009.06, 1KB READ I/O: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0 # iostat -xnzCM 1|egrep device|c[0123]$ [...] r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17497.30.0 17.10.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17498.80.0 17.10.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17277.60.0 16.90.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17441.30.0 17.00.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17333.90.0 16.90.0 0.0 0.80.00.0 0 82 c0 Now lets see how it looks like for a single SAS connection but dd to 11x SSDs: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0 # iostat -xnzCM 1|egrep device|c[0123]$ [...] r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104243.30.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104249.20.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104208.10.0 101.80.0 0.2 9.70.00.1 0 967 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104245.80.0 101.80.0 0.2 9.70.00.1 0 966 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104221.90.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104212.20.0 101.80.0 0.2 9.70.00.1 0 967 c0 It looks like a single CPU core still hasn't been saturated and the bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris 2009.06 can do at least 100,000 IOPS to a single SAS port. It also scales well - I did run above dd's over 4x SAS ports at the same time and it scaled linearly by achieving well over 400k IOPS. hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw. 1.27.3.0), connected to F5100. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net wrote: Andrey Kuzmin wrote: As to your results, it sounds almost too good to be true. As Bob has pointed out, h/w design targeted hundreds IOPS, and it was hard to believe it can scale 100x. Fantastic. Hundreds IOPS is not quite true, even with hard drives. I just tested a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache Linear? May be sequential? Regards, Andrey enabled. --Arne Regards, Andrey On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.plmailto: mi...@task.gda.pl wrote: On 21/10/2009 03:54, Bob Friesenhahn wrote: I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
Well, I'm more accustomed to sequential vs. random, but YMMW. As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? Regards, Andrey On Fri, Jun 11, 2010 at 12:03 AM, Arne Jansen sensi...@gmx.net wrote: Andrey Kuzmin wrote: On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net mailto: sensi...@gmx.net wrote: Andrey Kuzmin wrote: As to your results, it sounds almost too good to be true. As Bob has pointed out, h/w design targeted hundreds IOPS, and it was hard to believe it can scale 100x. Fantastic. Hundreds IOPS is not quite true, even with hard drives. I just tested a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache Linear? May be sequential? Aren't these synonyms? linear as opposed to random. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Compellant announces zNAS
I believe the name is Compellent Technologies, http://www.google.com/finance?q=NYSE:CML. Regards, Andrey On Wed, Apr 28, 2010 at 5:54 AM, Richard Elling richard.ell...@richardelling.com wrote: Today, Compellant announced their zNAS addition to their unified storage line. zNAS uses ZFS behind the scenes. http://www.compellent.com/Community/Blog/Posts/2010/4/Compellent-zNAS.aspx Congrats Compellant! -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
No, until all snapshots referencing the file in question are removed. Simplest way to understand snapshots is to consider them as references. Any file-system object (say, file or block) is only removed when its reference count drops to zero. Regards, Andrey On Sat, Apr 10, 2010 at 10:20 PM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: Hi all Is it possible to securely delete a file from a zfs dataset/zpool once it's been snapshotted, meaning delete (and perhaps overwrite) all copies of this file? Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS caching of compressed data
There had been a discussion of the topic on this list bout a onth ago, and I'd been told that similar ideas (compressed metadata/data in ARC/L2ARC) is on zfs dev agenda. Regards, Andrey On Sun, Mar 28, 2010 at 2:42 AM, Stuart Anderson ander...@ligo.caltech.edu wrote: On Oct 2, 2009, at 11:54 AM, Robert Milkowski wrote: Stuart Anderson wrote: On Oct 2, 2009, at 5:05 AM, Robert Milkowski wrote: Stuart Anderson wrote: I am wondering if the following idea makes any sense as a way to get ZFS to cache compressed data in DRAM? In particular, given a 2-way zvol mirror of highly compressible data on persistent storage devices, what would go wrong if I dynamically added a ramdisk as a 3rd mirror device at boot time? Would ZFS route most (or all) of the reads to the lower latency DRAM device? In the case of an un-clean shutdown where there was no opportunity to actively remove the ramdisk from the pool before shutdown would there be any problem at boot time when the ramdisk is still registered but unavailable? Note, this Gedanken experiment is for highly compressible (~9x) metadata for a non-ZFS filesystem. You would only get about 33% of IO's served from ram-disk. With SVM you are allowed to specify a read policy on sub-mirrors for just this reason, e.g., http://wikis.sun.com/display/BigAdmin/Using+a+SVM+submirror+on+a+ramdisk+to+increase+read+performance Is there no equivalent in ZFS? Nope, at least not right now. Curious if anyone knows of any other ideas/plans for ZFS caching compressed data internally? or externally via a ramdisk mirror device that handles most/all read requests? Thanks. -- Stuart Anderson ander...@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, Feb 24, 2010 at 11:09 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Wed, 24 Feb 2010, Steve wrote: The overhead I was thinking of was more in the pointer structures... (bearing in mind this is a 128 bit file system), I would guess that memory requirements would be HUGE for all these files...otherwise arc is gonna struggle, and paging system is going mental? It is not reasonable to assume that zfs has to retain everything in memory. At the same time 400M files in a single directory should lead to a lot of contention on locks associated with look-ups. Spreading files between a reasonable number of dirs could mitigate this. Regards, Andrey I have a directory here containing a million files and it has not caused any strain for zfs at all although it can cause considerable stress on applications. 400 million tiny files is quite a lot and I would hate to use anything but mirrors with so many tiny files. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Thu, Feb 25, 2010 at 12:26 AM, Steve steve.jack...@norman.com wrote: thats not the issue here, as they are spread out in a folder structure based on an integer split into hex blocks... 00/00/00/01 etc... but the number of pointers involved with all these files, and directories (which are files) must have an impact on a system with limited RAM? There is 4GB RAM in this system btw... If any significant portion of these 400M files is accessed on a regular basis, you'd be (1) stressing ARC to the limits (2) stressing spindles so that any concurrent sequential I/O would suffer. Small files are always an issue, try moving them off HDDs onto a mirrored SSDs, not necessarily most expensive ones. 400M 2K files is just 400GB, within the reach of a few SSDs. Regards, Andrey -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Thu, Feb 25, 2010 at 12:34 AM, Andrey Kuzmin andrey.v.kuz...@gmail.com wrote: On Thu, Feb 25, 2010 at 12:26 AM, Steve steve.jack...@norman.com wrote: thats not the issue here, as they are spread out in a folder structure based on an integer split into hex blocks... 00/00/00/01 etc... but the number of pointers involved with all these files, and directories (which are files) must have an impact on a system with limited RAM? There is 4GB RAM in this system btw... If any significant portion of these 400M files is accessed on a regular basis, you'd be (1) stressing ARC to the limits (2) stressing spindles so that any concurrent sequential I/O would suffer. Small files are always an issue, try moving them off HDDs onto a mirrored SSDs, not necessarily most expensive ones. 400M 2K files is 1K meant, fat fingers. Regards, Andrey just 400GB, within the reach of a few SSDs. Regards, Andrey -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Observations about compressability of metadata L2ARC
I don't see why this couldn't be extended beyond metadata (+1 for the idea): if zvol is compressed, ARC/L2ARC could store compressed data. The gain is apparent: if user has compression enabled for the volume, he/she expects volume's data to be compressable at good ratio, yielding significant reduction of ARC memory footprint/L2ARC usable capacity boost. Regards, Andrey On Sun, Feb 21, 2010 at 7:24 PM, Tomas Ögren st...@acc.umu.se wrote: Hello. I got an idea.. How about creating an ramdisk, making a pool out of it, then making compressed zvols and add those as l2arc.. Instant compressed arc ;) So I did some tests with secondarycache=metadata... capacity operations bandwidth pool used avail read write read write -- - - - - - - ftp 5.07T 1.78T 198 17 11.3M 1.51M raidz2 1.72T 571G 58 5 3.78M 514K ... raidz2 1.64T 656G 75 6 3.78M 524K ... raidz2 1.70T 592G 64 5 3.74M 512K ... cache - - - - - - /dev/zvol/dsk/ramcache/ramvol 84.4M 7.62M 4 17 45.4K 233K /dev/zvol/dsk/ramcache/ramvol2 84.3M 7.71M 4 17 41.5K 233K /dev/zvol/dsk/ramcache/ramvol3 84M 8M 4 18 42.0K 236K /dev/zvol/dsk/ramcache/ramvol4 84.8M 7.25M 3 17 39.1K 225K /dev/zvol/dsk/ramcache/ramvol5 84.9M 7.08M 3 14 38.0K 193K NAME RATIO COMPRESS ramcache/ramvol 1.00x off ramcache/ramvol2 4.27x lzjb ramcache/ramvol3 6.12x gzip-1 ramcache/ramvol4 6.77x gzip ramcache/ramvol5 6.82x gzip-9 This was after 'find /ftp' had been running for about 1h, along with all the background noise of its regular nfs serving tasks. I took an image of the uncompressed one (ramvol) and ran that through regular gzip and got 12-14x compression, probably due to smaller block size (default 8k) in the zvols.. So I tried with both 8k and 64k.. After not running that long (but at least filled), I got: NAME RATIO COMPRESS VOLBLOCK ramcache/ramvol 1.00x off 8K ramcache/ramvol2 5.57x lzjb 8K ramcache/ramvol3 7.56x lzjb 64K ramcache/ramvol4 7.35x gzip-1 8K ramcache/ramvol5 11.68x gzip-1 64K Not sure how to measure the cpu usage of the various compression levels for (de)compressing this data.. It does show that having metadata in ram compressed could be a big win though, if you have cpu cycles to spare.. Thoughts? /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se - 070-5858487 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] improve meta data performance
Try an inexpensive MLC SSD (Intel/Micron) for L2ARC. Won't help metadat updates, but should boost reads. Regards, Andrey On Thu, Feb 18, 2010 at 11:23 PM, Chris Banal cba...@gmail.com wrote: We have a SunFire X4500 running Solaris 10U5 which does about 5-8k nfs ops of which about 90% are meta data. In hind sight it would have been significantly better to use a mirrored configuration but we opted for 4 x (9+2) raidz2 at the time. We can not take the downtime necessary to change the zpool configuration. We need to improve the meta data performance with little to no money. Does anyone have any suggestions? Is there such a thing as a Sun supported NVRAM PCI-X card compatible with the X4500 which can be used as an L2ARC? Thanks, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive : panic and reboot
Just an observation: panic occurs in avl_add when called from find_ds_by_guid that tries to add existing snapshot id to the avl tree (http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu_send.c#find_ds_by_guid). HTH, Andrey On Tue, Feb 9, 2010 at 1:37 AM, Bruno Damour br...@ruomad.net wrote: On 02/ 8/10 06:38 PM, Lori Alt wrote: Can you please send a complete list of the actions taken: The commands you used to create the send stream, the commands used to receive the stream. Also the output of `zfs list -t all` on both the sending and receiving sides. If you were able to collect a core dump (it should be in /var/crash/hostname), it would be good to upload it. The panic you're seeing is in the code that is specific to receiving a dedup'ed stream. It's possible that you could do the migration if you turned off dedup (i.e. didn't specify -D) when creating the send stream.. However, then we wouldn't be able to diagnose and fix what appears to be a bug. The best way to get us the crash dump is to upload it here: https://supportfiles.sun.com/upload We need either both vmcore.X and unix.X OR you can just send us vmdump.X. Sometimes big uploads have mixed results, so if there is a problem some helpful hints are on http://wikis.sun.com/display/supportfiles/Sun+Support+Files+-+Help+and+Users+Guide, specifically in section 7. It's best to include your name or your initials or something in the name of the file you upload. As you might imagine we get a lot of files uploaded named vmcore.1 You might also create a defect report at http://defect.opensolaris.org/bz/ Lori On 02/08/10 09:41, Bruno Damour wrote: copied from opensolaris-dicuss as this probably belongs here. I kept on trying to migrate my pool with children (see previous threads) and had the (bad) idea to try the -d option on the receive part. The system reboots immediately. Here is the log in /var/adm/messages Feb 8 16:07:09 amber unix: [ID 836849 kern.notice] Feb 8 16:07:09 amber ^Mpanic[cpu1]/thread=ff014ba86e40: Feb 8 16:07:09 amber genunix: [ID 169834 kern.notice] avl_find() succeeded inside avl_add() Feb 8 16:07:09 amber unix: [ID 10 kern.notice] Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4660 genunix:avl_add+59 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46c0 zfs:find_ds_by_guid+b9 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46f0 zfs:findfunc+23 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c47d0 zfs:dmu_objset_find_spa+38c () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4810 zfs:dmu_objset_find+40 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4a70 zfs:dmu_recv_stream+448 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4c40 zfs:zfs_ioc_recv+41d () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4cc0 zfs:zfsdev_ioctl+175 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d00 genunix:cdev_ioctl+45 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d40 specfs:spec_ioctl+5a () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4dc0 genunix:fop_ioctl+7b () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4ec0 genunix:ioctl+18e () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4f10 unix:brand_sys_syscall32+1ca () Feb 8 16:07:09 amber unix: [ID 10 kern.notice] Feb 8 16:07:09 amber genunix: [ID 672855 kern.notice] syncing file systems... Feb 8 16:07:09 amber genunix: [ID 904073 kern.notice] done Feb 8 16:07:10 amber genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel Feb 8 16:07:10 amber ahci: [ID 405573 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 3 reset port Feb 8 16:07:35 amber genunix: [ID 10 kern.notice] Feb 8 16:07:35 amber genunix: [ID 665016 kern.notice] ^M100% done: 107693 pages dumped, Feb 8 16:07:35 amber genunix: [ID 851671 kern.notice] dump succeeded Hello, I'll try to do my best. Here are the commands : amber ~ # zfs unmount data amber ~ # zfs snapshot -r d...@prededup amber ~ # zpool destroy ezdata amber ~ # zpool create ezdata c6t1d0 amber ~ # zfs set dedup=on ezdata amber ~ # zfs set compress=on ezdata amber ~ # zfs send -RD d...@prededup |zfs receive ezdata/data cannot receive new filesystem stream: destination 'ezdata/data' exists must specify -F to overwrite it amber ~ # zpool destroy ezdata amber ~ # zpool create ezdata c6t1d0 amber ~ # zfs set compression=on ezdata amber ~ # zfs set dedup=on ezdata amber ~ # zfs send -RD d...@prededup |zfs receive -F ezdata/data cannot receive new filesystem stream: destination has snapshots (eg. ezdata/d...@prededup) must destroy them to overwrite it Each time the send/receive command took some hours and transferred 151G
Re: [zfs-discuss] Impact of an enterprise class SSD on ZIL performance
On Fri, Feb 5, 2010 at 10:55 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Fri, 5 Feb 2010, Miles Nordin wrote: ls r...@nexenta:/volumes# hdadm write_cache off c3t5 ls c3t5 write_cache disabled You might want to repeat his test with X25-E. If the X25-E is also dropping cache flush commands (it might!), you may be, compared to disabling the ZIL, slowing down your pool for no reason, and making it more fragile as well since an exported pool with a dead ZIL cannot be imported. Others have tested the X25-E and found that with its cache enabled, it does drop flushed writes, but is clearly not such a gaping chasm as the X25-M. Some time has passed so there is the possibility that X25-E firmware has (or will) improve. If Sun offers an X25-E based device for use as an slog, you can be sure that its has been qualified for this purpose, and may contain modified firmware. The 'E' stands for Extreme and not Enterprise as some tend to believe. Exactly. It would be therefore very interesting to hear on performance from anyone using (real) enterprise SSD (which now spells STEC) as slog. Regards, Andrey Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to get a list of changed files between two snapshots?
On Wed, Feb 3, 2010 at 6:11 PM, Ross Walker rswwal...@gmail.com wrote: On Feb 3, 2010, at 9:53 AM, Henu henrik.he...@tut.fi wrote: Okay, so first of all, it's true that send is always fast and 100% reliable because it uses blocks to see differences. Good, and thanks for this information. If everything else fails, I can parse the information I want from send stream :) But am I right, that there is no other methods to get the list of changed files other than the send command? At zfs_send level there are no files, just DMU objects (modified in some txg which is the basis for changed/unchanged decision). And in my situation I do not need to create snapshots. They are already created. The only thing that I need to do, is to get list of all the changed files (and maybe the location of difference in them, but I can do this manually if needed) between two already created snapshots. Not a ZFS method, but you could use rsync with the dry run option to list all changed files between two file systems. That's painfully resource-intensive on both (sending and receiving) ends, and it would be IMHO really beneficial to come up with an interface that lets user-space (including off-the-shelf backup tools) to iterate objects changed between two given snapshots. Regards, Andrey -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup memory overhead
On Thu, Jan 21, 2010 at 10:00 PM, Richard Elling richard.ell...@gmail.com wrote: On Jan 21, 2010, at 8:04 AM, erik.ableson wrote: Hi all, I'm going to be trying out some tests using b130 for dedup on a server with about 1,7Tb of useable storage (14x146 in two raidz vdevs of 7 disks). What I'm trying to get a handle on is how to estimate the memory overhead required for dedup on that amount of storage. From what I gather, the dedup hash keys are held in ARC and L2ARC and as such are in competition for the available memory. ... and written to disk, of course. For ARC sizing, more is always better. So the question is how much memory or L2ARC would be necessary to ensure that I'm never going back to disk to read out the hash keys. Better yet would be some kind of algorithm for calculating the overhead. eg - averaged block size of 4K = a hash key for every 4k stored and a hash occupies 256 bits. An associated question is then how does the ARC handle competition between hash keys and regular ARC functions? AFAIK, there is no special treatment given to the DDT. The DDT is stored like other metadata and (currently) not easily accounted for. Also the DDT keys are 320 bits. The key itself includes the logical and physical block size and compression. The DDT entry is even larger. Looking at dedupe code, I noticed that on-disk DDT entries are compressed less efficiently than possible: key is not compressed at all (I'd expect roughly 2:1 compression ration with sha256 data), while other entry data is currently passed through zle compressor only (I'd expect this one to be less efficient than off-the-shelf compressors, feel free to correct me if I'm wrong). Is this v1, going to be improved in the future? Further, with huge dedupe memory footprint and heavy performance impact when DDT entries need to be read from disk, it might be worthwhile to consider compression of in-core ddt entries (specifically for DDTs or, more generally, making ARC/L2ARC compression-aware). Has this been considered? Regards, Andrey I think it is better to think of the ARC as caching the uncompressed DDT blocks which were written to disk. The number of these will be data dependent. zdb -S poolname will give you an idea of the number of blocks and how well dedup will work on your data, but that means you already have the data in a pool. -- richard Based on these estimations, I think that I should be able to calculate the following: 1,7 TB 1740,8 GB 1782579,2 MB 1825361100,8 KB 4 average block size 456340275,2 blocks 256 hash key size-bits 1,16823E+11 hash key overhead - bits 1460206,4 hash key size-bytes 14260633,6 hash key size-KB 13926,4 hash key size-MB 13,6 hash key overhead-GB Of course the big question on this will be the average block size - or better yet - to be able to analyze an existing datastore to see just how many blocks it uses and what is the current distribution of different block sizes. I'm currently playing around with zdb with mixed success on extracting this kind of data. That's also a worst case scenario since it's counting really small blocks and using 100% of available storage - highly unlikely. # zdb -ddbb siovale/iphone Dataset siovale/iphone [ZPL], ID 2381, cr_txg 3764691, 44.6G, 99 objects ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 Object lvl iblk dblk dsize lsize %full type 0 7 16K 16K 57.0K 64K 77.34 DMU dnode 1 1 16K 1K 1.50K 1K 100.00 ZFS master node 2 1 16K 512 1.50K 512 100.00 ZFS delete queue 3 2 16K 16K 18.0K 32K 100.00 ZFS directory 4 3 16K 128K 408M 408M 100.00 ZFS plain file 5 1 16K 16K 3.00K 16K 100.00 FUID table 6 1 16K 4K 4.50K 4K 100.00 ZFS plain file 7 1 16K 6.50K 6.50K 6.50K 100.00 ZFS plain file 8 3 16K 128K 952M 952M 100.00 ZFS plain file 9 3 16K 128K 912M 912M 100.00 ZFS plain file 10 3 16K 128K 695M 695M 100.00 ZFS plain file 11 3 16K 128K 914M 914M 100.00 ZFS plain file Now, if I'm understanding this output properly, object 4 is composed of 128KB blocks with a total size of 408MB, meaning that it uses 3264 blocks. Can someone confirm (or correct) that assumption? Also, I note that each object (as far as my limited testing has shown) has a single block size with no internal variation. Interestingly, all of my zvols seem to use fixed size blocks - that is, there is no variation in the block sizes - they're all the size defined on creation with no dynamic block sizes being used. I previously thought that the -b option set the maximum size, rather than fixing all blocks. Learned something today :-) # zdb -ddbb
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
On Fri, Jan 15, 2010 at 2:07 AM, Christopher George cgeo...@ddrdrive.com wrote: Why not enlighten EMC/NTAP on this then? On the basic chemistry and possible failure characteristics of Li-Ion batteries? I will agree, if I had system level control as in either example, one could definitely help mitigate said risks compared to selling a card based product where I have very little control over the thermal envelopes I am subjected. Could you please elaborate on the last statement, provided you meant anything beyond UPS is a power-backup standard? Although, I do think the discourse is healthy and relevant. At this point, I am comfortable to agree to disagree. I respect your point of view, and do Same on my side. I don't object to your design decision, my objection was to the negative advertisement wrt another design. Good luck with beta and beyond. Regards, Andrey agree strongly that Li-Ion batteries play a critical and highly valued role in many industries. Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
On Thu, Jan 14, 2010 at 11:35 AM, Christopher George cgeo...@ddrdrive.com wrote: I'm not sure about others on the list, but I have a dislike of AC power bricks in my racks. I definitely empathize with your position concerning AC power bricks, but until the perfect battery is created, and we are far from it, it comes down to tradeoffs. I personally believe the ignition risk, thermal wear-out, and the inflexible proprietary nature of Li-Ion solutions simply outweigh the benefits of internal or all inclusive mounting for enterprise bound NVRAM. That's kind of an overstatement. NVRAM backed by on-board LI-Ion batteries has been used in storage industry for years; I can easily point out a company that has shipped tens of thousands of such boards over last 10 years. Regards, Andrey Is the state of the power input exposed to software in some way? In other terms, can I have a nagios check running on my server that triggers an alert if the power cable accidentally gets pulled out? Absolutely, the X1 monitors the external supply and can detect not only a disconnect but any loss of power. In all cases, the card throws an interrupt so that the device driver (and ultimately user space) can be immediately notified. The X1 does not rely on external power until the host power drops below a certain threshold, so attaching/detaching the external power cable has no effect on data integrity as long as the host is powered on. OK, which means that the UPS must be separate to the UPS powering the server then. Correct, a dedicated (in this case redundant) UPS is expected. Any plans on a pci-e multi-lane version then? Not at this time. In addition to the reduced power and thermal output, the PCIe x1 connector has the added benefit of not competing with other HBA's which do require a x4 or x8 PCIe connection. Very appreciative of the feedback! Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
On Thu, Jan 14, 2010 at 10:02 PM, Christopher George cgeo...@ddrdrive.com wrote: That's kind of an overstatement. NVRAM backed by on-board LI-Ion batteries has been used in storage industry for years; Respectfully, I stand by my three points of Li-Ion batteries as they relate to enterprise class NVRAM: ignition risk, thermal wear-out, and proprietary design. As a prior post stated, there is a dearth of published failure statistics of Li-Ion based BBUs. Why not enlighten EMC/NTAP on this then? I can easily point out a company that has shipped tens of thousands of such boards over last 10 years. No argument here, I would venture the risks for consumer based Li-Ion based products did not become apparent or commonly accepted until the user base grew several orders of magnitude greater than tens of thousands. For the record, I agree there is a marked convenience with an integrated high energy Li-Ion battery solution - but at what cost? Um, with Li-Ion battery in each and every of a billions of cell phones out there ... We chose an external solution because it is a proven and industry standard method of enterprise class data backup. Could you please elaborate on the last statement, provided you meant anything beyond UPS is a power-backup standard? Regards, Andrey Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] preview of new SSD based on SandForce controller
600? I've heard 1.5GBps reported. On 1/5/10, Eric D. Mudama edmud...@bounceswoosh.org wrote: On Mon, Jan 4 at 16:43, Wes Felter wrote: Eric D. Mudama wrote: I am not convinced that a general purpose CPU, running other software in parallel, will be able to be timely and responsive enough to maximize bandwidth in an SSD controller without specialized hardware support. Fusion-io would seem to be a counter-example, since it uses a fairly simple controller (I guess the controller still performs ECC and maybe XOR) and the driver eats a whole x86 core. The result is very high performance. Wes Felter I see what you're saying, but it isn't obvious (to me) how well they're using all the hardware at hand. 2GB/s of bandwidth over their PCI-e link and what looks like a TON of NAND, with a nearly-dedicated x86 core... resuting in 600MB/s or something like that? While the number is very good for NAND flash SSDs, it seems like a TON of horsepower going to waste, and they still have a large onboard controller/FPGA. I guess enough CPU can make the units faster, but i'm just not sold. -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Regards, Andrey ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] getting decent NFS performance
And how do you expect the mirrored iSCSI volume to work after failover, with secondary (ex-primary) unreachable? Regards, Andrey On Wed, Dec 23, 2009 at 9:40 AM, Erik Trimble erik.trim...@sun.com wrote: Charles Hedrick wrote: Is ISCSI reliable enough for this? YES. The original idea is a good one, and one that I'd not thought of. The (old) iSCSI implementation is quite mature, if not anywhere as nice (feature/flexibility-wise) as the new COMSTAR stuff. I'm thinking that just putting in a straight-through cable between the two machine is the best idea here, rather than going through a switch. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation
It might be helpful to contact SSD vendor, report the issue and inquire if half a year wearing out is expected behavior for this model. Further, if you have an option to replace one (or both) SSDs with fresh ones, this could tell for sure if they are the root cause. Regards, Andrey On Mon, Dec 21, 2009 at 1:18 PM, Erik Trimble erik.trim...@sun.com wrote: Mart van Santen wrote: Hi, We have a X4150 with a J4400 attached. Configured with 2x32GB SSD's, in mirror configuration (ZIL) and 12x 500GB SATA disks. We are running this setup for over a half year now in production for NFS and iSCSI for a bunch of virtual machines (currently about 100 VM's, Mostly Linux, some Windows) Since last week we have performance problems, cause IO Wait in the VM's. Of course we did a big search in networking issue's, hanging machines, filewall traffic tests, but were unable to find any problems. So we had a look into the zpool and dropped one of the mirrored SSD's from the pool (we had some indication the ZIL was not working ok). No success. After adding the disk, we discovered the IO wait during the resilvering process was OK, or at least much better, again. So last night we did the same handling, dropped added the same disk, and yes, again, the IO wait looked better. This morning the same story. Because this machine is a production machine, we cannot tolerate to much experiments. We now know this operation saves us for about 4 to 6 hours (time to resilvering), but we didn't had the courage to detach/attach the other SSD yet. We will try only a resilver, without detach/attach, this night, to see what happens. Can anybody explain how the detach/attach and resilver process works, and especially if there is something different during the resilvering and the handling of the SSD's/slog disks? Regards, Mart Do the I/O problems go away when only one of the SSDs is attached? Frankly, I'm betting that your SSDs are wearing out. Resilvering will essentially be one big streaming write, which is optimal for SSDs (even an SLC-based SSD, as you likely have, performs far better when writing large amounts of data at once). NFS (and to a lesser extent iSCSI) is generally a whole lot of random small writes, which are hard on an SSD (especially MLC-based ones, but even SLC ones). The resilvering process is likely turning many of the random writes coming in to the system into a large streaming write to the /resilvering/ drive. My guess is that the SSD you are having problems with has reached the end of it's useful lifespan, and the I/O problems you are seeing during normal operation are the result of that SSD's problems with committing data. There's no cure for this, other than replacing the SSD with a new one. SSDs are not hard drives. Even high-quality modern ones have /significantly/ lower USE lifespans than an HD - that is, a heavily-used SSD will die well before a HD, but a very-lightly used SSD will likely outlast a HD. And, in the case of SSDs, writes are far harder on the SSD than reads are. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do I determine dedupe effectiveness?
On Sat, Dec 19, 2009 at 7:20 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Sat, 19 Dec 2009, Colin Raven wrote: There is no original, there is no copy. There is one block with reference counters. - Fred can rm his file (because clearly it isn't a file, it's a filename and that's all) - result: the reference count is decremented by one - the data remains on disk. While the similarity to hard links is a good analogy, there really is a unique file in this case. If Fred does a 'rm' on the file then the reference count on all the file blocks is reduced by one, and the block is freed if the reference count goes to zero. Behavior is similar to the case where a snapshot references the file block. If Janet updates a block in the file, then that updated block becomes unique to her copy of the file (and the reference count on the original is reduced by one) and it remains unique unless it happens to match a block in some other existing file (or snapshot of a file). When we are children, we are told that sharing is good. In the case or references, sharing is usually good, but if there is a huge amount of sharing, then it can take longer to delete a set of files since the mutual references create a hot spot which must be updated sequentially. Files are usually created slowly so we don't notice much impact from this sharing, but we expect (hope) that files will be deleted almost instantaneously. I believe this has been taken care of in space maps design (http://blogs.sun.com/bonwick/entry/space_maps provides a nice overview). Regards, Andrey Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Downside you have described happens only when the same checksum is used for data protection and duplicate detection. This implies sha256, BTW, since fletcher-based dedupe has been dropped in recent builds. On 12/17/09, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Andrey Kuzmin andrey.v.kuz...@gmail.com writes: Darren J Moffat wrote: Andrey Kuzmin wrote: Resilvering has noting to do with sha256: one could resilver long before dedupe was introduced in zfs. SHA256 isn't just used for dedup it is available as one of the checksum algorithms right back to pool version 1 that integrated in build 27. 'One of' is the key word. And thanks for code pointers, I'll take a look. I didn't mention sha256 at all :-). the reasoning is the same no matter what hash algorithm you're using (fletcher2, fletcher4 or sha256. dedup doesn't require sha256 either, you can use fletcher4. the question was: why does data have to be compressed before it can be recognised as a duplicate? it does seem like a waste of CPU, no? I attempted to show the downsides to identifying blocks by their uncompressed hash. (BTW, it doesn't affect storage efficiency, the same duplicate blocks will be discovered either way.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Regards, Andrey ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
On Thu, Dec 17, 2009 at 6:14 PM, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Darren J Moffat darr...@opensolaris.org writes: Kjetil Torgrim Homme wrote: Andrey Kuzmin andrey.v.kuz...@gmail.com writes: Downside you have described happens only when the same checksum is used for data protection and duplicate detection. This implies sha256, BTW, since fletcher-based dedupe has been dropped in recent builds. if the hash used for dedup is completely separate from the hash used for data protection, I don't see any downsides to computing the dedup hash from uncompressed data. why isn't it? It isn't separate because that isn't how Jeff and Bill designed it. thanks for confirming that, Darren. I think the design the have is great. I don't disagree. Instead of trying to pick holes in the theory can you demonstrate a real performance problem with compression=on and dedup=on and show that it is because of the compression step ? compression requires CPU, actually quite a lot of it. even with the lean and mean lzjb, you will get not much more than 150 MB/s per core or something like that. so, if you're copying a 10 GB image file, it will take a minute or two, just to compress the data so that the hash can be computed so that the duplicate block can be identified. if the dedup hash was based on uncompressed data, the copy would be limited by hashing efficiency (and dedup tree lookup) This isn't exactly true. If, speculatively, one stores two hashes, one for uncompressed data in ddt and another one, for compressed data, with data block for data healing, one wins compression for duplicates and pays by extra hash computation for singletons. So a more correct question would be if the set of cases where duplicates/singletons and compression/hashing bandwidth ratios are such that one wins is non-empty (or, rather, o practical importance). Regards, Andrey . I don't know how tightly interwoven the dedup hash tree and the block pointer hash tree are, or if it is all possible to disentangle them. conceptually it doesn't seem impossible, but that's easy for me to say, with no knowledge of the zio pipeline... oh, how does encryption play into this? just don't? knowing that someone else has the same block as you is leaking information, but that may be acceptable -- just make different pools for people you don't trust. Otherwise if you want it changed code it up and show how what you have done is better in all cases. I wish I could :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Yet again, I don't see how RAID-Z reconstruction is related to the subject discussed (what data should be sha256'ed when both dedupe and compression are enabled, raw or compressed ). sha256 has nothing to do with bad block detection (may be it will when encryption is implemented, but for now sha256 is used for duplicate candidates look-up only). Regards, Andrey On Wed, Dec 16, 2009 at 5:18 PM, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Andrey Kuzmin andrey.v.kuz...@gmail.com writes: Kjetil Torgrim Homme wrote: for some reason I, like Steve, thought the checksum was calculated on the uncompressed data, but a look in the source confirms you're right, of course. thinking about the consequences of changing it, RAID-Z recovery would be much more CPU intensive if hashing was done on uncompressed data -- I don't quite see how dedupe (based on sha256) and parity (based on crc32) are related. I tried to hint at an explanation: every possible combination of the N-1 disks would have to be decompressed (and most combinations would fail), and *then* the remaining candidates would be hashed to see if the data is correct. the key is that you don't know which block is corrupt. if everything is hunky-dory, the parity will match the data. parity in RAID-Z1 is not a checksum like CRC32, it is simply XOR (like in RAID 5). here's an example with four data disks and one paritydisk: D1 D2 D3 D4 PP 00 01 10 10 01 this is a single stripe with 2-bit disk blocks for simplicity. if you XOR together all the blocks, you get 00. that's the simple premise for reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and so on. so what happens if a bit flips in D4 and it becomes 00? the total XOR isn't 00 anymore, it is 10 -- something is wrong. but unless you get a hardware signal from D4, you don't know which block is corrupt. this is a major problem with RAID 5, the data is irrevocably corrupt. the parity discovers the error, and can alert the user, but that's the best it can do. in RAID-Z the hash saves the day: first *assume* D1 is bad and reconstruct it from parity. if the hash for the block is OK, D1 *was* bad. otherwise, assume D2 is bad. and so on. so, the parity calculation will indicate which stripes contain bad blocks. but the hashing, the sanity check for which disk blocks are actually bad must be calculated over all the stripes a ZFS block (record) consists of. this would be done on a per recordsize basis, not per stripe, which means reconstruction would fail if two disk blocks (512 octets) on different disks and in different stripes go bad. (doing an exhaustive search for all possible permutations to handle that case doesn't seem realistic.) actually this is the same for compression before/after hashing. it's just that each permutation is more expensive to check. in addition, hashing becomes slightly more expensive since more data needs to be hashed. overall, my guess is that this choice (made before dedup!) will give worse performance in normal situations in the future, when dedup+lzjb will be very common, at a cost of faster and more reliable resilver. in any case, there is not much to be done about it now. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
On Wed, Dec 16, 2009 at 7:25 PM, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Andrey Kuzmin andrey.v.kuz...@gmail.com writes: Yet again, I don't see how RAID-Z reconstruction is related to the subject discussed (what data should be sha256'ed when both dedupe and compression are enabled, raw or compressed ). sha256 has nothing to do with bad block detection (may be it will when encryption is implemented, but for now sha256 is used for duplicate candidates look-up only). how do you think RAID-Z resilvering works? please correct me where I'm wrong. Resilvering has noting to do with sha256: one could resilver long before dedupe was introduced in zfs. Regards, Andrey -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Troubleshooting dedup performance
On Wed, Dec 16, 2009 at 6:41 PM, Chris Murray chrismurra...@gmail.com wrote: Hi, I run a number of virtual machines on ESXi 4, which reside in ZFS file systems and are accessed over NFS. I've found that if I enable dedup, the virtual machines immediately become unusable, hang, and whole datastores disappear from ESXi's view. (See the attached screenshot from vSphere client at around the 21:54 mark for the drop in connectivity). I'm on OpenSolaris Preview, build 128a. I've set dedup to what I believe are the least resource-intensive settings - checksum=fletcher4 on the pool, dedup=on rather than I believe checksum=fletcher4 is acceptable in dedup=verify mode only. What you're doing is seemingly deduplication with weak checksum w/o verification. Regards, Andrey verify, but it is still the same. Where can I start troubleshooting? I get the feeling that my hardware isn't up to the job, but some numbers to verify that would be nice before I start investigating an upgrade. vmstat showed plenty of idle CPU cycles, and zpool iostat just showed slow throughput, as the ESXi graph does. As soon as I set dedup=off, the virtual machines leapt into action again (22:15 on the screenshot). Many thanks, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
On Wed, Dec 16, 2009 at 7:46 PM, Darren J Moffat darr...@opensolaris.org wrote: Andrey Kuzmin wrote: On Wed, Dec 16, 2009 at 7:25 PM, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Andrey Kuzmin andrey.v.kuz...@gmail.com writes: Yet again, I don't see how RAID-Z reconstruction is related to the subject discussed (what data should be sha256'ed when both dedupe and compression are enabled, raw or compressed ). sha256 has nothing to do with bad block detection (may be it will when encryption is implemented, but for now sha256 is used for duplicate candidates look-up only). how do you think RAID-Z resilvering works? please correct me where I'm wrong. Resilvering has noting to do with sha256: one could resilver long before dedupe was introduced in zfs. SHA256 isn't just used for dedup it is available as one of the checksum algorithms right back to pool version 1 that integrated in build 27. 'One of' is the key word. And thanks for code pointers, I'll take a look. Regards, Andrey SHA256 is also used to checksum the pool uberblock. This means that SHA256 is used during resilvering and especially so if you have checksum=sha256 for your datasets. If you still don't believe me check the source code history: http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sha256.c Look at the date when that integrated 31st October 2005. In case you still doubt me look at the fix I just integrated today: http://mail.opensolaris.org/pipermail/onnv-notify/2009-December/011090.html -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Troubleshooting dedup performance
On Wed, Dec 16, 2009 at 8:09 PM, Cyril Plisko cyril.pli...@mountall.com wrote: I've set dedup to what I believe are the least resource-intensive settings - checksum=fletcher4 on the pool, dedup=on rather than I believe checksum=fletcher4 is acceptable in dedup=verify mode only. What you're doing is seemingly deduplication with weak checksum w/o verification. I think fletcher4 use for the deduplication purposes was disabled [1] at all, right before build 129 cut. [1] http://hg.genunix.org/onnv-gate.hg/diff/93c7076216f6/usr/src/common/zfs/zfs_prop.c Peculiar fix, quotes the reason being checksum errors because we are not computing the byteswapped checksum, but solves it by dropping checksum support instead of adding byte-swapped checksum computation. A I missing something? Regards, Andrey -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
On Tue, Dec 15, 2009 at 3:06 PM, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Robert Milkowski mi...@task.gda.pl writes: On 13/12/2009 20:51, Steve Radich, BitShop, Inc. wrote: Because if you can de-dup anyway why bother to compress THEN check? This SEEMS to be the behaviour - i.e. I would suspect many of the files I'm writing are dups - however I see high cpu use even though on some of the copies I see almost no disk writes. First, the checksum is calculated after compression happens. for some reason I, like Steve, thought the checksum was calculated on the uncompressed data, but a look in the source confirms you're right, of course. thinking about the consequences of changing it, RAID-Z recovery would be much more CPU intensive if hashing was done on uncompressed data -- I don't quite see how dedupe (based on sha256) and parity (based on crc32) are related. Regards, Andrey every possible combination of the N-1 disks would have to be decompressed (and most combinations would fail), and *then* the remaining candidates would be hashed to see if the data is correct. this would be done on a per recordsize basis, not per stripe, which means reconstruction would fail if two disk blocks (512 octets) on different disks and in different stripes go bad. (doing an exhaustive search for all possible permutations to handle that case doesn't seem realistic.) in addition, hashing becomes slightly more expensive since more data needs to be hashed. overall, my guess is that this choice (made before dedup!) will give worse performance in normal situations in the future, when dedup+lzjb will be very common, at a cost of faster and more reliable resilver. in any case, there is not much to be done about it now. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 + SFA F20 PCIe?
On Mon, Dec 14, 2009 at 4:04 AM, Jens Elkner jel+...@cs.uni-magdeburg.de wrote: On Sat, Dec 12, 2009 at 04:23:21PM +, Andrey Kuzmin wrote: As to whether it makes sense (as opposed to two distinct physical devices), you would have read cache hits competing with log writes for bandwidth. I doubt both will be pleased :-) Hmm - good point. What I'm trying to accomplish: Actually our current prototype thumper setup is: root pool (1x 2-way mirror SATA) hotspare (2x SATA shared) pool1 (12x 2-way mirror SATA) ~25% used user homes pool2 (10x 2-way mirror SATA) ~25% used mm files, archives, ISOs So pool2 is not really a problem - delivers about 600MB/s uncached, about 1.8 GB/s cached (i.e. read a 2nd time, tested with a 3.8GB iso) and is not contineously stressed. However sync write is ~ 200 MB/s or 20 MB/s and mirror, only. Problem is pool1 - user homes! So GNOME/firefox/eclipse/subversion/soffice usually via NFS and a litle bit via samba - a lot of more or less small files, probably widely spread over the platters. E.g. checkin' out a project from a svn|* repository into a home takes hours. Also having its workspace on NFS isn't fun (compared to linux xfs driven local soft 2-way mirror). Flash-based read cache should help here by minimizing (metadata) read latency, and flash-based log would bring down write latency. The only drawback of using single F20 is that you're trying to minimize both with the same device. So, seems to be a really interesting thing and I expect at least wrt. user homes a real improvement, no matter, how the final configuration will look like. Maybe the experts at the source are able to do some 4x SSD vs. 1xF20 benchmarks? I guess at least if they turn out to be good enough, it wouldn't hurt ;-) Would be interesting indeed. Regards, Andrey Jens Elkner wrote: ... whether it is possible/supported/would make sense to use a Sun Flash Accelerator F20 PCIe Card in a X4540 instead of 2.5 SSDs? Regards, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
On Sun, Dec 13, 2009 at 11:51 PM, Steve Radich, BitShop, Inc. ste...@bitshop.com wrote: I enabled compression on a zfs filesystem with compression=gzip9 - i.e. fairly slow compression - this stores backups of databases (which compress fairly well). The next question is: Is the CRC on the disk based on the uncompressed data (which seems more likely to be able to be recovered) or based on the zipped data (which seems slightly less likely to be able to be recovered). Why? Because if you can de-dup anyway why bother to compress THEN check? This SEEMS to be the behaviour - i.e. I ZFS deduplication is block-level, so to deduplicate one needs data broken into blocks to be written. With compression enabled, you don't have these until data is compressed. Looks like cycles waste indeed, but ... Regards, Andrey would suspect many of the files I'm writing are dups - however I see high cpu use even though on some of the copies I see almost no disk writes. If the dup check logic happens first AND it's a duplicate I shouldn't see hardly any CPU use (because it won't need to compress the data). Steve Radich BitShop.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
On Mon, Dec 14, 2009 at 9:53 PM, casper@sun.com wrote: On Mon, Dec 14, 2009 at 09:30:29PM +0300, Andrey Kuzmin wrote: ZFS deduplication is block-level, so to deduplicate one needs data broken into blocks to be written. With compression enabled, you don't have these until data is compressed. Looks like cycles waste indeed, but ... ZFS compression is also block-level. Both are done on ZFS blocks. ZFS compression is not streamwise. And if you enable verify and you checksum the uncompressed data, you will need to uncompress before you can verify. Right, but 'verify' seems to be 'extreme safety' and thus rather rare use case. Saving cycles lost to compress duplicates looks to outweigh 'uncompress before verify' overhead, imo. Regards, Andrey Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
On 12/14/09, Cyril Plisko cyril.pli...@mountall.com wrote: On Mon, Dec 14, 2009 at 9:32 PM, Andrey Kuzmin andrey.v.kuz...@gmail.com wrote: Right, but 'verify' seems to be 'extreme safety' and thus rather rare use case. Hmm, dunno. I wouldn't set anything, but scratch file system to dedup=on. Anything of even slight significance is set to dedup=verify. Saving cycles lost to compress duplicates looks to outweigh 'uncompress before verify' overhead, imo. Dedup doesn't come for free - it imposes additional load on CPU. just like a checksumming and compression. The more fancy things we want our file system to do for us, the stronger CPU it'll take. -- Regards, Cyril Verify mode actually looks compress/dedupe order-neutral. To do byte-comparison, one can either compress new block or decompress old one, and the latter is usually a bit easter. Pipeline design may dictate a choice, for instance one could compress new block while old one is being fetched from disk for comparison, but overall it looks pretty close. And with dedupe=on reversing the order, if feasible, saves quite some cycles. Regards, Andrey ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 + SFA F20 PCIe?
As to whether it makes sense (as opposed to two distinct physical devices), you would have read cache hits competing with log writes for bandwidth. I doubt both will be pleased :-) On 12/12/09, Robert Milkowski mi...@task.gda.pl wrote: Jens Elkner wrote: Hi, just got a quote from our campus reseller, that readzilla and logzilla are not available for the X4540 - hmm strange Anyway, wondering whether it is possible/supported/would make sense to use a Sun Flash Accelerator F20 PCIe Card in a X4540 instead of 2.5 SSDs? If so, is it possible to partition the F20, e.g. into 36 GB logzilla, 60GB readzilla (also interesting for other X servers)? IIRC the card presents 4x LUNs so you could use each of them for different purpose. You could also use different slices. me or not. Is this correct? It still does. The capacitor is not for flushing data to disks drives! The card has a small amount of DRAM memory on it which is being flushed to FLASH. Capacitor is to make sure it actually happens if the power is lost. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Regards, Andrey ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SMC for ZFS administration in OpenSolaris 2009.06?
On Fri, Dec 11, 2009 at 11:43 PM, Nick nick.couch...@seakr.com wrote: No, it is not, for a couple of reasons. First of all, rumor is that SMC is being discontinued in favor of a WBEM/CIM- based management system. Any specific implementation meant? Are there any plans wrt OpenPegasus? Regards, Andrey Second, the SMC code is not open-source, which means it cannot be included in OpenSolaris. It is included in Solaris Express Community Edition (SXCE), and there are several posts and instructions available for installing the packages from SXCE to Opensolaris. Even so, some issues do tend to pop up getting it working - for example, logging in is still got me stumped, because I can't log in as root due to Opensolaris' RBAC configuration, but I also can't log in as the unprivileged user I've created. You can also check out EON - go to http://eonstorage.blogspot.com/. Unfortunately because of a bug in the 128 version of the code, the latest build you can get for EON is 125, which doesn't include deduplication (if that's important to you). I also don't believe that EON currently has a web-based management interface - it's in the works - so that doesn't really help you there. -Nick -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup report tool
On Wed, Dec 9, 2009 at 2:26 PM, Bruno Sousa bso...@epinfante.com wrote: Hi all, Is there any way to generate some report related to the de-duplication feature of ZFS within a zpool/zfs pool? I mean, its nice to have the dedup ratio, but it think it would be also good to have a report where we could see what directories/files have been found as repeated and therefore they suffered deduplication. Nice to have at first glance, but could you detail on any specific use-case you see? Regards, Andrey Thanks for your time, Bruno ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup report tool
On Wed, Dec 9, 2009 at 2:47 PM, Bruno Sousa bso...@epinfante.com wrote: Hi Andrey, For instance, i talked about deduplication to my manager and he was happy because less data = less storage, and therefore less costs . However, now the IT group of my company needs to provide to management board, a report of duplicated data found per share, and in our case one share means one specific company department/division. Bottom line, the mindset is something like : * one share equals to a specific department within the company * the department demands a X value of data storage * the data storage costs Y * making a report of the amount of data consumed by a department, before and after deduplication, means that data storage costs can be seen per department Do you currently have tools that report storage usage per share? What you ask for looks like a request to make these deduplication-aware. * if theres a cost reduction due to the usage of deduplication, part of that money can be used for business , either IT related subjects or general business * management board wants to see numbers related to costs, and not things like the racio of deduplication in SAN01 is 3x, because for management this is geek talk Just divide storage costs by deduplication factor (1), and here you are (provided you can do it by department). Regards, Andrey I hope i was somehow clear, but i can try to explain better if needed. Thanks, Bruno Andrey Kuzmin wrote: On Wed, Dec 9, 2009 at 2:26 PM, Bruno Sousa bso...@epinfante.com wrote: Hi all, Is there any way to generate some report related to the de-duplication feature of ZFS within a zpool/zfs pool? I mean, its nice to have the dedup ratio, but it think it would be also good to have a report where we could see what directories/files have been found as repeated and therefore they suffered deduplication. Nice to have at first glance, but could you detail on any specific use-case you see? Regards, Andrey Thanks for your time, Bruno ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup report tool
On Wed, Dec 9, 2009 at 10:43 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Wed, 9 Dec 2009, Bruno Sousa wrote: Despite the fact that i agree in general with your comments, in reality it all comes to money.. So in this case, if i could prove that ZFS was able to find X amount of duplicated data, and since that X amount of data has a price of Y per GB, IT could be seen as business enabler instead of a cost centre. Most of the cost of storing business data is related to the cost of backing it up and administering it rather than the cost of the system on which it is stored. In this case it is reasonable to know the total amount of user data (and charge for it), since it likely needs to be backed up and managed. Deduplication does not help much here. Um, I thought deduplication had been invented to reduce backup window :). Regards, Andrey Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] freeNAS moves to Linux from FreeBSD
On Tue, Dec 8, 2009 at 7:02 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Mon, 7 Dec 2009, Michael DeMan (OA) wrote: Args for FreeBSD + ZFS: - Limited budget - We are familiar with managing FreeBSD. - We are familiar with tuning FreeBSD. - Licensing model Args against OpenSolaris + ZFS: - Hardware compatibility - Lack of knowledge for tuning and associated costs for training staff to learn 'yet one more operating system' they need to support. - Licensing model If you think about it a little bit, you will see that there is no significant difference in the licensing model between FreeBSD+ZFS and OpenSolaris+ZFS. It is not possible to be a little bit pregnant. Either one is pregnant, or one is not. Well, FreeBSD pretends it's possible, by shipping zfs and bearing BSD license at the same time. Regards, Andrey Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate announces enterprise SSD
On Tue, Dec 8, 2009 at 9:32 PM, Richard Elling richard.ell...@gmail.com wrote: FYI, Seagate has announced a new enterprise SSD. The specs appear to be competitive: + 2.5 form factor + 5 year warranty + power loss protection + 0.44% annual failure rate (AFR) (2M hours MTBF, IMHO too low :-) + UER 1e-16 (new), 1e-15 (5 years) + 30,000/25,000 4 KB read IOPS (peak/aligned zero offset) + 30,000/10,500 4 KB write IOPS (peak/aligned zero offset) IIRC, last figures are for 200GB model, with write performance degrading by a factor of two for 100 (another 1/2) 50GB models. Parallelization, or rather lack of it. Regards, Andrey http://www.seagate.com/www/en-us/products/servers/pulsar/pulsar/ http://storageeffect.media.seagate.com/2009/12/storage-effect/seagate-pulsar-the-first-enterprise-ready-ssd/ http://www.seagate.com/docs/pdf/marketing/po_pulsar.pdf -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs-code] Transaction consistency of ZFS
On Sun, Dec 6, 2009 at 8:11 PM, Anurag Agarwal anu...@kqinfotech.com wrote: Hi, My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all the writes in zfs are logged in the ZIL. And if that indeed is the case, then IIRC, there is some upper limit (1MB?) on writes that go to ZIL, with larger ones executed directly. Yet again, this is an outsider's impression, not the architect's () statement. Regards, Andrey yes, ZFS does guarantee the sequential consistency, even when there are power outage or server crash. You might loose some writes if ZIL has not committed to disk. But that would not change the sequential consistency guarantee. There is no need to do a fsync or open the file with O_SYNC. It should work as it is. I have not done any experiments to verify this, so please take my observation with pinch of salt. Any ZFS developers to verify or refute this. Regards, Anurag. On Sun, Dec 6, 2009 at 8:12 AM, nxyyt schumi@gmail.com wrote: This question is forwarded from ZFS-discussion. Hope any developer can throw some light on it. I'm a newbie to ZFS. I have a special question against the COW transaction of ZFS. Does ZFS keeps the sequential consistency of the same file when it meets power outage or server crash? Assume following scenario: My application has only a single thread and it appends the data to the file continuously. Suppose at time t1, it append a buf named A to the file. At time t2, which is later than t1, it appends a buf named B to the file. If the server crashes after t2, is it possible the buf B is flushed back to the disk but buf A is not? My application appends the file only without truncation or overwrite.Does ZFS keep the consistency that the data written to a file in sequential order or casual order be flushed to disk in the same order? If the uncommitted writer operation to a single file always binding with the same opening transaction group and all transaction group is committed in sequential order, I think the answer should be YES. In other words, [b]whether there is only one opening transaction group at any time and the transaction group is committed in order for a single pool?[/b] Hope anybody can help me clarify it. Thank you very much! -- This message posted from opensolaris.org ___ zfs-code mailing list zfs-c...@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-code -- Anurag Agarwal CEO, Founder KQ Infotech, Pune www.kqinfotech.com 9881254401 Coordinator Akshar Bharati www.aksharbharati.org Spreading joy through reading ___ zfs-code mailing list zfs-c...@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-code ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss