Re: [zfs-discuss] [zfs] Petabyte pool?
On Sat, 16 Mar 2013, Kristoffer Sheather @ CloudCentral wrote: Well, off the top of my head: 2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's 8 x 60-Bay JBOD's with 60 x 4TB SAS drives RAIDZ2 stripe over the 8 x JBOD's That should fit within 1 rack comfortably and provide 1 PB storage.. What does one do for power? What are the power requirements when the system is first powered on? Can drive spin-up be staggered between JBOD chassis? Does the server need to be powered up last so that it does not time out on the zfs import? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun X4200 Question...
On Mon, 11 Mar 2013, Tiernan OToole wrote: I know this might be the wrong place to ask, but hopefully someone can point me in the right direction... I got my hands on a Sun x4200. Its the original one, not the M2, and has 2 single core Opterons, 4Gb RAM and 4 73Gb SAS Disks... But, I dont know what to install on it... I was thinking of SmartOS, but the site mentions Intel support for VT, but nothing for AMD... The Opterons dont have VT, so i wont be using XEN, but the Zones may be useful... OpenIndiana or OmniOS seem like the most likely candidates. You can run VirtualBox on OpenIndiana and it should be able to work without VT extensions. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Huge Numbers of Illegal Requests
On Tue, 5 Mar 2013, Ed Shipe wrote: On 2 different OpenIndiana 151a7 systems, Im showing a huge number of Illegal Requests. There are no other apparent issues, performance is fine, etc,etc.Everything works great - what are these illegal requests? My Google-Foo is failing me... My system used to exhibit this problem so I opened Illumos issue 2998 (https://www.illumos.org/issues/2998). The weird thing is that the problem went away and has not returned. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On Mon, 4 Mar 2013, Matthew Ahrens wrote: Magic rsync options used: -a --inplace --no-whole-file --delete-excluded This causes rsync to overwrite the file blocks in place rather than writing to a new temporary file first. As a result, zfs COW produces primitive deduplication of at least the unchanged blocks (by writing nothing) while writing new COW blocks for the changed blocks. If I understand your use case correctly (the application overwrites some blocks with the same exact contents), ZFS will ignore these no-op writes only on recent Open ZFS (illumos / FreeBSD / Linux) builds with checksum=sha256 and compression!=off. AFAIK, Solaris ZFS will COW the blocks even if their content is identical to what's already there, causing the snapshots to diverge. With these rsync options, rsync will only overwrite a block if the contents of the block has changed. Rsync's notion of a block is different than zfs so there is not a perfect overlap. Rsync does need to read files on the destination filesystem to see if they have changed. If the system has sufficient RAM (and/or L2ARC) then files may still be cached from the previous day's run. In most cases only a small subset of the total files are updated (at least on my systems) so the caching requirements are small. Files updated on one day are more likely to be the ones updated on subsequent days. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On Tue, 5 Mar 2013, David Magda wrote: It's also possible to reduce the amount that rsync has to walk the entire file tree. Most folks simply do a rsync --options /my/source/ /the/dest/, but if you use zfs diff, and parse/feed the output of that to rsync, then the amount of thrashing can probably be minimized. Especially useful for file hierarchies that very many individual files, so you don't have to stat() every single one. Zfs diff only works for zfs filesystems. If one is using zfs filesystems then rsync may not be the best option. In the real world, data may be sourced from many types of systems and filesystems. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On Wed, 27 Feb 2013, Ian Collins wrote: Magic rsync options used: -a --inplace --no-whole-file --delete-excluded This causes rsync to overwrite the file blocks in place rather than writing to a new temporary file first. As a result, zfs COW produces primitive deduplication of at least the unchanged blocks (by writing nothing) while writing new COW blocks for the changed blocks. Do these options impact performance or reduce the incremental stream sizes? I don't see any adverse impact on performance and incremental stream size is quite considerably reduced. The main risk is that if the disk fills up you may end up with a corrupted file rather than just an rsync error. However, the snapshots help because an earlier version of the file is likely available. I just use -a --delete and the snapshots don't take up much space (compared with the incremental stream sizes). That is what I used to do before I learned better. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 26 Feb 2013, hagai wrote: for what is worth.. I had the same problem and found the answer here - http://forums.freebsd.org/showthread.php?t=27207 Given enough sequential I/O requests, zfs mirrors behave every much like RAID-0 for reads. Sequential prefetch is very important in order to avoid the latencies. While this script may not work perfectly as is for FreeBSD, it was very good at discovering a zfs performance bug (since corrected) and is still an interesting exercise for zfs to see how ZFS ARC caching helps for re-reads. See http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh;. The script will exercise an initial uncached read from disks, and then a (hopefully) cached re-read from disks. I think that it serves as a useful benchmark. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On Tue, 26 Feb 2013, Gary Driggs wrote: On Feb 26, 2013, at 12:44 AM, Sašo Kiselkov wrote: I'd also recommend that you go and subscribe to z...@lists.illumos.org, since this list is going to get shut down by Oracle next month. Whose description still reads, everything ZFS running on illumos-based distributions. Even FreeBSD's zfs is now based on zfs from Illumos. FreeBSD and Linux zfs developers contribute fixes back to zfs in Illumos. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On Tue, 26 Feb 2013, Richard Elling wrote: Consider using different policies for different data. For traditional file systems, you had relatively few policy options: readonly, nosuid, quota, etc. With ZFS, dedup and compression are also policy options. In your case, dedup for your media is not likely to be a good policy, but dedup for your backups could be a win (unless you're using something that already doesn't backup duplicate data -- eg most backup utilities). A way to approach this is to think of your directory structure and create file systems to match the policies. For example: I am finding that rsync with the right options (to directly block-overwrite) plus zfs snapshots is providing me with pretty amazing deduplication for backups without even enabling deduplication in zfs. Now backup storage goes a very long way. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On Wed, 27 Feb 2013, Ian Collins wrote: I am finding that rsync with the right options (to directly block-overwrite) plus zfs snapshots is providing me with pretty amazing deduplication for backups without even enabling deduplication in zfs. Now backup storage goes a very long way. We do the same for all of our legacy operating system backups. Take a snapshot then do an rsync and an excellent way of maintaining incremental backups for those. Magic rsync options used: -a --inplace --no-whole-file --delete-excluded This causes rsync to overwrite the file blocks in place rather than writing to a new temporary file first. As a result, zfs COW produces primitive deduplication of at least the unchanged blocks (by writing nothing) while writing new COW blocks for the changed blocks. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there performance penalty when adding vdev to existing pool
On Thu, 21 Feb 2013, Sašo Kiselkov wrote: On 02/21/2013 12:27 AM, Peter Wood wrote: Will adding another vdev hurt the performance? In general, the answer is: no. ZFS will try to balance writes to top-level vdevs in a fashion that assures even data distribution. If your data is equally likely to be hit in all places, then you will not incur any performance penalties. If, OTOH, newer data is more likely to be hit than old data , then yes, newer data will be served from fewer spindles. In that case it is possible to do a send/receive of the affected datasets into new locations and then renaming them. You have this reversed. The older data is served from fewer spindles than data written after the new vdev is added. Performance with the newer data should be improved. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss mailing list opensolaris EOL
On Fri, 15 Feb 2013, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: So, I hear, in a couple weeks' time, opensolaris.org is shutting down. What does that mean for this mailing list? Should we all be moving over to something at illumos or something? There is a 'illumos-zfs' list for illumos. Please see http://wiki.illumos.org/display/illumos/illumos+Mailing+Lists; for the available lists. Most open discussion of zfs occurs on the illumos list, although there is also useful discussion on the freebsd-fs list at freebsd.org. I'm going to encourage somebody in an official capacity at opensolaris to respond... I'm going to discourage unofficial responses, like, illumos enthusiasts etc simply trying to get people to jump this list. Good for you. I am sure that Larry will be contacting you soon. Previously Oracle announced and invited people to join their discussion forums, which are web-based and virtually dead. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss mailing list opensolaris EOL
On Sat, 16 Feb 2013, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: Tim Cook [mailto:t...@cook.ms] That would be the logical decision, yes. Not to poke fun, but did you really expect an official response after YEARS of nothing from Oracle? This is the same company that refused to release any Java patches until the DHS issued a national warning suggesting that everyone uninstall Java. Well, yes. We do have oracle employees who contribute to this mailing list. It is not accurate or fair to stereotype the whole company. Oracle by itself is as large as some cities or countries. Yes, these remaining employees do so because they still can. Except for those employees brave enough to post to Illumos/OpenIndiana lists (there are some), there will be no more avenues remaining for unmoderated two-way communication with the outside world. There have been some cases where people said unfavorable things about Oracle on this list. Oracle needs to control its message and the principle form of communication will be via private support calls authorized by service contracts and authorized corporate publications. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On Mon, 21 Jan 2013, Jim Klimov wrote: Yes, maybe there were more cool new things per year popping up with Sun's concentrated engineering talent and financing, but now it seems that most players - wherever they work now - took a pause from the marathon, to refine what was done in the decade before. And this is just as important as churning out innovations faster than people can comprehend or audit or use them. I am on most of the mailing lists where zfs is discussed and it is clear that significant issues/bugs are continually being discovered and fixed. Fixes come from both the Illumos community and from outside it (e.g. from FreeBSD). Zfs is already quite feature rich. Many of us would lobby for bug fixes and performance improvements over features. Sašo Kiselkov's LZ4 compression additions may qualify as features yet they also offer rather profound performance improvements. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On Sat, 19 Jan 2013, Stephan Budach wrote: Now, this zpool is made of 3-way mirrors and currently 13 out of 15 vdevs are resilvering (which they had gone through yesterday as well) and I never got any error while resilvering. I have been all over the setup to find any glitch or bad part, but I couldn't come up with anything significant. Doesn't this sound improbable, wouldn't one expect to encounter other chksum errors while resilvering is running? I can't attest to chksum errors since I have yet to see one on my machines (have seen several complete disk failures, or disks faulted by the system though). Checksum errors are bad and not seeing them should be the normal case. Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Regarding the dire fiber channel issue, are you using fiber channel switches or direct connections to the storage array(s)? If you are using switches, are they stable or are they doing something terrible like resetting? Do you have duplex connectivity? Have you verified that your FC HBA's firmware is correct? Did you check for messages in /var/adm/messages which might indicate when and how FC connectivity has been lost? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On Sat, 19 Jan 2013, Jim Klimov wrote: On 2013-01-19 18:17, Bob Friesenhahn wrote: Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Correction: that (verification) would be scrubbing ;) I don't think that zfs would call it scrubbing unless the user requested scrubbing. Unplugging a USB drive which is part of a mirror for a short while results in considerable activity when it is plugged back in. It is as if zfs does not trust the device which was temporarily unplugged and does a full validation of it. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On Sat, 19 Jan 2013, Stephan Budach wrote: Just ignore the timestamp, as it seems that the time is not set correctly, but the dates match my two issues from today and thursday, which accounts for three days. I didn't catch that before, but it seems to clearly indicate a problem with the FC connection… But, what do I make of this information? I don't know, but the issue/problem seems to below the zfs level so you need to fix that lower level before worrying about zfs. Did you check for messages in /var/adm/messages which might indicate when and how FC connectivity has been lost? Well, this is the most scaring part to me. Neither fmdump nor dmesg showed anything that would indicate a connectivity issue - at least not the last time. Weird. I wonder if multipathing is working for you at all. With my direct-connect setup, if a path is lost, then there is quite a lot of messaging to /var/adm/messages. I also see a lot of messaging related to multipathing when the system boots and first starts using the array. However, with the direct-connect setup, the HBA can report problems immediately if it sees a loss of signal. Your issues might be on the other side of the switch (on the storage array side) so the local HBA does not see the problem and timeouts are used. Make sure to check the logs in your storage array to see if it is encountering resets or flapping connectivity. Do you have duplex switches so that there are fully-redundant paths, or is only one switch used? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Wed, 16 Jan 2013, Thomas Nau wrote: Dear all I've a question concerning possible performance tuning for both iSCSI access and replicating a ZVOL through zfs send/receive. We export ZVOLs with the default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL SSDs and 128G of main memory The iSCSI access pattern (1 hour daytime average) looks like the following (Thanks to Richard Elling for the dtrace script) If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. [ stuff removed ] For disaster recovery we plan to sync the pool as often as possible to a remote location. Running send/receive after a day or so seems to take a significant amount of time wading through all the blocks and we hardly see network average traffic going over 45MB/s (almost idle 1G link). So here's the question: would increasing/decreasing the volblocksize improve the send/receive operation and what influence might show for the iSCSI side? Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On Thu, 17 Jan 2013, Peter Wood wrote: Unless there is some other way to test what/where these write operations are applied. You can install Brendan Gregg's DTraceToolkit and use it to find out who and what is doing all the writing. 1.2GB in an hour is quite a lot of writing. If this is going continuously, then it may be causing more fragmentation in conjunction with your snapshots. See http://www.brendangregg.com/dtrace.html;. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On Thu, 17 Jan 2013, Peter Wood wrote: Great points Jim. I have requested more information how the gallery share is being used and any temporary data will be moved out of there. About atime, it is set to on right now and I've considered to turn it off but I wasn't sure if this will effect incremental zfs send/receive. 'zfs send -i snapshot0 snapshot1' doesn't rely on the atime, right? Zfs send does not care about atime. The access time is useless other than as a way to see how long it has been since a file was accessed. For local access (not true for NFS), Zfs is lazy about updating atime on disk and so it may not be updated on disk until the next transaction group is written (e.g. up to 5 seconds) and so it does not represent much actual load. Without this behavior, the system could become unusable. For NFS you should disable atime on the NFS client mounts. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On Thu, 17 Jan 2013, Bob Friesenhahn wrote: For NFS you should disable atime on the NFS client mounts. This advice was wrong. It needs to be done on the server side. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On Wed, 16 Jan 2013, Peter Wood wrote: Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write operations on the drives and 661 on the ZIL. Compare to the other server (who is in way heavier use then this one) these numbers are extremely high. Any idea how to debug any further? Do some filesystems contain many snapshots? Do some filesystems use small zfs block sizes. Have the servers been used the same? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)
On Wed, 12 Dec 2012, Jamie Krier wrote: I am thinking about switching to an Illumos distro, but wondering if this problem may be present there as well. I believe that Illumos is forked before this new virtual memory sub-system was added to Solaris. There have not been such reports on Illumos or OpenIndiana mailing lists and I don't recall seeing this issue in the bug trackers. Illumos is not so good at dealing with huge memory systems but perhaps it is also more stable as well. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS array on marvell88sx in Solaris 11.1
On Wed, 12 Dec 2012, sol wrote: Hello I've got a ZFS box running perfectly with an 8-port SATA card using the marvell88sx driver in opensolaris-2009. However when I try to run Solaris-11 it won't boot. If I unplug some of the hard disks it might boot but then none of them show up in 'format' and none of them have configured status in 'cfgadm' (and there's an error or hang if I try to configure them). Does anyone have any suggestions how to solve the problem? Since you were previously using opensolaris-2009, have you considered trying OpenIndiana oi_151a7 instead? You could experiment by booting from the live CD and seeing if your disks show up. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS array on marvell88sx in Solaris 11.1
On Wed, 12 Dec 2012, sol wrote: Thanks for the reply. I've just tried openindiana and it behaves identically - disks attached to the mv88sx6081 don't show up as disks. (and APIC error interrupt (status0=0, status1=40) is emitted at boot.) I've tried some changes to /etc/system with no success (sata_func_enable=0x5, ahci_msi_enabled=0, sata_max_queue_depth=1) Is there anything else I can try? If the SATA card you are using is a JBOD-style card (i.e. disks are portable to a different controller), are you able/willing to swap it for one that Solaris is known to support well? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 6Tb Database with ZFS
On Sat, 1 Dec 2012, Fung Zheng wrote: Hello, Thanks for you reply, i forgot to mention that the doc Configuring ZFS for an Oracle Database was followed, this include primarycache, logbias, recordsize properties, all the best practices was followed and my only doubt is the arc_max parameter, i want to know if 24Gb is good enough for a 6Tb database, someone have had implemented something similar? which was the value used for arc_max? As I recall, you can tune zfs_arc_max while the system is running so you can easily adjust this while your database is running and observe behavior and without rebooting. It is possible that my recollection is wrong though. If my recollection is correct, then it is not so important to know what is good enough before starting to put your database in service. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove disk
On Sat, 1 Dec 2012, Jan Owoc wrote: When I would like to change the disk, I also would like change the disk enclosure, I don't want to use the old one. You didn't give much detail about the enclosure (how it's connected, how many disk bays it has, how it's used etc.), but are you able to power off the system and transfer the all the disks at once? And what happen if I have 24, 36 disks to change ? It's take mounth to do that. Those are the current limitations of zfs. Yes, with 12x2TB of data to copy it could take about a month. You can create a brand new pool with the new chassis and use 'zfs send' to send a full snapshot of each filesystem to the new pool. After the bulk of the data has been transferred, take new snapshots and send the remainder. This expects that both pools can be available at once. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS QoS and priorities
On Thu, 29 Nov 2012, Jim Klimov wrote: I've heard a claim that ZFS relies too much on RAM caching, but implements no sort of priorities (indeed, I've seen no knobs to tune those) - so that if the storage box receives many different types of IO requests with different administrative weights in the view of admins, it can not really throttle some IOs to boost others, when such IOs have to hit the pool's spindles. For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. AFAIK, now such requests would hit the ARC, then the disks if needed - in no particular order. Well, can the order be made particular with current ZFS architecture, i.e. by setting some datasets to have a certain NICEness or another priority mechanism? QoS poses a problem. Zfs needs to write a transaction group at a time. During part of the TXG write cycle, zfs does not return any data. Zfs writes TXGs quite hard so they fill the I/O channel. Even if one orders the reads during the TXG write cycle, zfs will not return any data for part of the time. There are really only a few solutions when resources might be limited: 1. Use fewer resources 2. Use resources more wisely 3. Add more resources until problem goes away I think that current zfs strives for #1 and QoS is option #2. Quite often, option #3 is effective because problems just go away once enough resources are available. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] suggestions for e-SATA HBA card on x86/x64
On Fri, 26 Oct 2012, Jerry Kemp wrote: Thanks for the SIIG pointer, most of the stuff I had archived from this list pointed to LSI products. I poked around on the site and reviewed SIIG's SATA and SAS HBA. I also hit up their search engine. I'm not implying I did an all inclusive search, but nothing I came across on their site indicated any type of Solaris or *Solaris distro support. What is important is if Solaris supports the card. I have no idea if Solaris supports any of their cards. Did I miss something on the site? Or maybe one of their sales people let you know this stuff worked with Solaris? Or should it just work as long as it meets SAS or SATA standards? They might not even know what Solaris is. Actually, they might since this outfit previously made the USB/FireWire combo card used in SPARC and Intel Sun workstations. It seems likely that SATA boards would work if they support the standard AHCI interface. I would not take any chance with unknown SAS. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] suggestions for e-SATA HBA card on x86/x64
On Thu, 25 Oct 2012, Sašo Kiselkov wrote: Look for Dell's 6Gbps SAS HBA cards. They can be had new for $100 and are essentially rebranded LSI 9200-8e cards. Always try to look for OEM cards with LSI, because buying directly from them is incredibly expensive. Do these support eSATA? It seems unlikely. I purchased an eSATA card (from SIIG, http://www.siig.com/) with the intention to try it with Solaris 10 to see if it would work but have not tried plugging it in yet. It seems likely that a numer of cheap eSATA cards may work. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] suggestions for e-SATA HBA card on x86/x64
On Thu, 25 Oct 2012, Sašo Kiselkov wrote: On 10/25/2012 04:09 PM, Bob Friesenhahn wrote: On Thu, 25 Oct 2012, Sašo Kiselkov wrote: Look for Dell's 6Gbps SAS HBA cards. They can be had new for $100 and are essentially rebranded LSI 9200-8e cards. Always try to look for OEM cards with LSI, because buying directly from them is incredibly expensive. Do these support eSATA? It seems unlikely. eSATA is just SATA with a different connector - all you need is a cheap conversion cable or appropriate eSATA-SATA bracket, e.g. http://www.satacables.com/html/sata-pci-brackets.html While this can certainly work, according to Wikipedia (http://en.wikipedia.org/wiki/Esata#eSATA), eSATA is more than just SATA with a different connector. eSATA specifies a higher voltage range (minimum voltage) than SATA. It may be that a HBA already uses this range, or maybe not. Text I read says that maximum cable length is significantly reduced if an adaptor is used. Also, I am curious to know how well hot-swap works with an enterprise-class SAS HBA and these cheap eSATA adaptors. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] all in one server
On Tue, 18 Sep 2012, Erik Ableson wrote: The bigger issue you'll run into will be data sizing as a year's worth of snapshot basically means that you're keeping a journal of every single write that's occurred over the year. If you are running The above is not a correct statement. The snapshot only preserves the file-level differences between the points in time. A snapshot does not preserve every single write. Zfs does not even send every single write to underlying disk. In some usage models, the same file may be re-written 100 times between snapshots, or might not ever appear in any snapshot. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zvol vs zfs send/zfs receive
On Sat, 15 Sep 2012, Dave Pooser wrote: The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. That... doesn't look right. (Comparing zfs list -t snapshot and looking at the 5.34 ref for the snapshot vs zfs list on the new system and looking at space used.) Is this a problem? Should I be panicking yet? Does the old pool use 512 byte sectors while the new pool uses 4K sectors? Is there any change to compression settings? With volblocksize of 8k on disks with 4K sectors one might expect very poor space utilization because metadata chunks will use/waste a minimum of 4k. There might be more space consumed by the metadata than the actual data. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL iops expectations
On Sat, 11 Aug 2012, Chris Nagele wrote: So far, running gnu dd with 512b and oflag=sync, the most we can get is 8k iops on the zil device. I even tried with some SSDs (Crucial M4, If this is one dd program running, then all you are measuring is sequential IOPS. That is, the next I/O will not start until the previous one has returned. What you want to test is threaded IOPS with some number of threads (each one represents a client) running. You can use iozone to effectively test that. This command runs with 16 threads and 8k blocks with a 2GB file: iozone -m -t 16 -T -O -r 8k -o -s 2G If you 'dd' from /dev/zero then the test is meaningless since zfs is able to compress zeros. If you 'dd' from /dev/random then the test is meaningless since the random generator is slow. Is this the expected result? Should I be pushing for more? In IRC I was told that I should be able to get 12k no problem. We are running NFS in a heavily used environment with millions of very small files, so low latency counts. Your test method is not valid. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what have you been buying for slog and l2arc?
On Tue, 7 Aug 2012, Sašo Kiselkov wrote: MLC is so much cheaper that you can simply slap on twice as much and use the rest for ECC, mirroring or simply overprovisioning sectors. The common practice to extending the lifecycle of MLC is by short-stroking it, i.e. using only a fraction of the capacity. E.g. a 40GB MLC unit with 5-10k cycles per cell can be turned into a 4GB unit (with the controller providing wear leveling) with effectively 50-100k cycles (that's SLC land) for about a hundred bucks. Also, since I'm mirroring it already with ZFS checksums to provide integrity checking, your argument simply doesn't hold up. Remember he also said that the current product is based principally on an FPGA. This FPGA must be interfacing directly with the Flash device so it would need to be substantially redesigned to deal with MLC Flash (probably at least an order of magnitude more complex), or else a microcontroller would need to be added to the design, and firmware would handle the substantial complexities. If the Flash device writes slower, then the power has to stay up longer. If the Flash device reads slower, then it takes longer for the drive to come back on line. Quite a lot of product would need to be sold in order to pay for both re-engineering and the cost of running a business. Regardless, continual product re-development is necessary or else it will surely die. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what have you been buying for slog and l2arc?
On Mon, 6 Aug 2012, Christopher George wrote: Intel's brief also clears up a prior controversy of what types of data are actually cached, per the brief it's both user and system data! I am glad to hear that both user AND system data is stored. That is rather reassuring. :-) Is your DDRDrive product still supported and moving? Is it well supported for Illumos? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what have you been buying for slog and l2arc?
On Mon, 6 Aug 2012, Stefan Ring wrote: Intel's brief also clears up a prior controversy of what types of data are actually cached, per the brief it's both user and system data! So you're saying that SSDs don't generally flush data to stable medium when instructed to? So data written before an fsync is not guaranteed to be seen after a power-down? If that -- ignoring cache flush requests -- is the whole reason why SSDs are so fast, I'm glad I haven't got one yet. Testing has shown that many SSDs do not flush the data prior to claiming that they have done so. The flush request may hasten the time until the next actual cache flush. As far as I am aware, Intel does not sell any enterprise-class SSDs even though they have sold some models with 'E' in the name. True enterprise SSDs can cost 5-10X the price of larger consumer models. A battery-backed RAM cache with Flash backup can be a whole lot faster and still satisfy many users. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what have you been buying for slog and l2arc?
On Fri, 3 Aug 2012, Neil Perrin wrote: For the slog, you should look for a SLC technology SSD which saves unwritten data on power failure. In Intel-speak, this is called Enhanced Power Loss Data Protection. I am not running across any Intel SSDs which claim to match these requirements. - That shouldn't be necessary. ZFS flushes the write cache for any device written before returning from the synchronous request to ensure data stability. Yes, but the problem is that the write IOPS go way way down (and device lifetime suffers) if the device is not able to perform write caching. A consumer-grade device advertizing 70K write IOPS is definitely not going to offer anything like that if it actually flushes its cache when requested. A device with a reserve of energy sufficient to write its cache to backing FLASH on power fail will be able to defer cache flush requests. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what have you been buying for slog and l2arc?
On Fri, 3 Aug 2012, Karl Rossing wrote: I'm looking at http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-ssd.html wondering what I should get. Are people getting intel 330's for l2arc and 520's for slog? For the slog, you should look for a SLC technology SSD which saves unwritten data on power failure. In Intel-speak, this is called Enhanced Power Loss Data Protection. I am not running across any Intel SSDs which claim to match these requirements. Extreme write IOPS claims in consumer SSDs are normally based on large write caches which can lose even more data if there is a power failure. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL devices and fragmentation
On Mon, 30 Jul 2012, Roy Sigurd Karlsbakk wrote: Should OI/Illumos be able to boot cleanly without manual action with the SLOG devices gone? If this is allowed, then data may be unnecessarily lost. When the drives are not all in one chassis, then it is not uncommon for one chassis to not come up immediately, or be slow to come up when recovering from a power failure. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?
On Sun, 29 Jul 2012, Jim Klimov wrote: Would extra copies on larger disks actually provide the extra reliability, or only add overheads and complicate/degrade the situation? My opinion is that complete hard drive failure and block-level media failure are two totally different things. Complete hard drive failure rates should not be directly related to total storage size whereas the probabily of media failure per drive is directly related to total storage size. Given this, and assuming that complete hard drive failure occurs much less often than partial media failure, using the copies feature should be pretty effective. Would the use of several copies cripple the write speeds? It would reduce the write rate by 1/2 or by whatever number of copies you have requested. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL devices and fragmentation
On Sun, 29 Jul 2012, Jim Klimov wrote: For several times now I've seen statements on this list implying that a dedicated ZIL/SLOG device catching sync writes for the log, also allows for more streamlined writes to the pool during normal healthy TXG syncs, than is the case with the default ZIL located within the pool. After reading what some others have posted, I should remind that zfs always has a ZIL (unless it is specifically disabled for testing). If it does not have a dedicated ZIL, then it uses the disks in the main pool to construct the ZIL. Dedicating a device to the ZIL should not improve the pool storage layout because the pool already had a ZIL. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IO load questions
On Tue, 24 Jul 2012, matth...@flash.shanje.com wrote: ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta, ~90% utilization on guest OS. I’m guessing guest OS is bottlenecking. Going to try physical hardware next week ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta, ~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck. Load average of 2.5 A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec performance, can’t remember CPU utilization on either side. Will retest and report those numbers. It feels like something is adding more overhead here than I would expect on the 4k recordsizes/IO workloads. Any thoughts where I should start on this? I’d really like to see closer to 10Gbit performance here, but it seems like the hardware isn’t able to cope with it? All systems have a bottleneck. You are highly unlikely to get close to 10Gbit performance with 4k random synchronous write. 25K IOPS seems pretty good to me. The 2.4GHz clock rate of the 4-core Xeon CPU you are using is not terribly high. Performance is likely better with a higher-clocked more modern design with more cores. Verify that the zfs checksum algorithm you are using is a low-cost one and that you have not enabled compression or deduplication. You did not tell us how your zfs pool is organized so it is impossible to comment more. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question on 4k sectors
On Mon, 23 Jul 2012, Anonymous Remailer (austria) wrote: The question was relative to some older boxes running S10 and not planning to upgrade the OS, keeping them alive as long as possible... Recent Solaris 10 kernel patches are addressing drives with 4k sectors. It seems that Solaris 10 will work with drives with 4k sectors so Solaris 10 users will not be stuck. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On Sat, 21 Jul 2012, Jim Klimov wrote: During this quick test I did not manage to craft a test which would inflate a file in the middle without touching its other blocks (other than using a text editor which saves the whole file - so that is irrelevant), in order to see if ZFS can insert smaller blocks in the middle of an existing file, and whether it would reallocate other blocks to fit the set recordsizes. The POSIX filesystem interface does not support such a thing ('insert'). Presumably the underlying zfs pool could support such a thing if there was a layer on top to request it. The closest equivalent in a POSIX filesystem would be if a previously-null block in a sparse file is updated to hold content. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On Wed, 18 Jul 2012, Michael Traffanstead wrote: I have an 8 drive ZFS array (RAIDZ2 - 1 Spare) using 5900rpm 2TB SATA drives with an hpt27xx controller under FreeBSD 10 (but I've seen the same issue with FreeBSD 9). The system has 8gigs and I'm letting FreeBSD auto-size the ARC. Running iozone (from ports), everything is fine for file sizes up to 8GB, but when it runs with a 16GB file the random write performance plummets using 64K record sizes. This is normal. The problem is that with zfs 128k block sizes, zfs needs to re-read the original 128k block so that it can compose and write the new 128k block. With sufficient RAM, this is normally avoided because the original block is already cached in the ARC. If you were to reduce the zfs blocksize to 64k then the performance dive at 64k would go away but there would still be write performance loss at sizes other than a multiple of 64k. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any company willing to support a 7410 ?
On Thu, 19 Jul 2012, Gordon Ross wrote: On Thu, Jul 19, 2012 at 5:38 AM, sol a...@yahoo.com wrote: Other than Oracle do you think any other companies would be willing to take over support for a clustered 7410 appliance with 6 JBODs? (Some non-Oracle names which popped out of google: Joyent/Coraid/Nexenta/Greenbytes/NAS/RackTop/EraStor/Illumos/???) I'm not sure, but I think there are people running NexentaStor on that h/w. If not, then on something pretty close. NS supports clustering, etc. You would lose the fancy user interface and monitoring stuff that the Fishworks team developed for the product. It would no longer be an appliance. No doubt, Nexenta has developed new cool stuff for NexentaStor. As others have said, only Oracle is capable of supporting the system as the original product. It could be re-installed to become something else. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On Fri, 20 Jul 2012, Jim Klimov wrote: I am not sure if I misunderstood the question or Bob's answer, but I have a gut feeling it is not fully correct: ZFS block sizes for files (filesystem datasets) are, at least by default, dynamically-sized depending on the contiguous write size as queued by the time a ZFS transaction is closed and flushed to disk. In case of RAIDZ layouts, this logical block is further Zfs data block sizes are fixed size! Only tail blocks are shorter. The underlying representation (how the data block gets stored) depends on if compression, raidz, deduplication, etc., are used. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Michael Hase wrote: If you were to add a second vdev (i.e. stripe) then you should see very close to 200% due to the default round-robin scheduling of the writes. My expectation would be 200%, as 4 disks are involved. It may not be the perfect 4x scaling, but imho it should be (and is for a scsi system) more than half of the theoretical throughput. This is solaris or a solaris derivative, not linux ;-) Here are some results from my own machine based on the 'virgin mount' test approach. The results show less boost than is reported by a benchmark tool like 'iozone' which sees benefits from caching. I get an initial sequential read speed of 657 MB/s on my new pool which has 1200 MB/s of raw bandwidth (if mirrors could produce 100% boost). Reading the file a second time reports 6.9 GB/s. The below is with a 2.6 GB test file but with a 26 GB test file (just add another zero to 'count' and wait longer) I see an initial read rate of 618 MB/s and a re-read rate of 8.2 GB/s. The raw disk can transfer 150 MB/s. % zpool status pool: tank state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scan: scrub repaired 0 in 0h10m with 0 errors on Mon Jul 16 04:30:48 2012 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0ONLINE 0 0 0 c7t5393E8CA21FAd0p0 ONLINE 0 0 0 c11t5393D8CA34B2d0p0 ONLINE 0 0 0 mirror-1ONLINE 0 0 0 c8t5393E8CA2066d0p0 ONLINE 0 0 0 c12t5393E8CA2196d0p0 ONLINE 0 0 0 mirror-2ONLINE 0 0 0 c9t5393D8CA82A2d0p0 ONLINE 0 0 0 c13t5393E8CA2116d0p0 ONLINE 0 0 0 mirror-3ONLINE 0 0 0 c10t5393D8CA59C2d0p0 ONLINE 0 0 0 c14t5393D8CA828Ed0p0 ONLINE 0 0 0 errors: No known data errors % pfexec zfs create tank/zfstest % pfexec zfs create tank/zfstest/defaults % cd /tank/zfstest/defaults % pfexec dd if=/dev/urandom of=random.dat bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 36.8133 s, 71.2 MB/s % cd .. % pfexec zfs umount tank/zfstest/defaults % pfexec zfs mount tank/zfstest/defaults % cd defaults % dd if=random.dat of=/dev/null bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 3.99229 s, 657 MB/s % pfexec dd if=/dev/rdsk/c7t5393E8CA21FAd0p0 of=/dev/null bs=128k count=2000 2000+0 records in 2000+0 records out 262144000 bytes (262 MB) copied, 1.74532 s, 150 MB/s % bc scale=8 657/150 4.3800 It is very difficult to benchmark with a cache which works so well: % dd if=random.dat of=/dev/null bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 0.379147 s, 6.9 GB/s Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem: Disconnected command timeout for Target X
On Tue, 17 Jul 2012, Roberto Scudeller wrote: Hi all, I'm using Opensolaris snv_134 with LSI Controllers and a motherboard supermicro, with 20 sata disks, zfs in raid-10 conf. I mounted this zfs_storage with NFS. I'm not opensolaris specialist. What're the commands to show hardware information? Like 'lshw' in linux but for opensolaris. cfgadm, prtconf, prtpicl, prtdiag zpool status fmadm faulty It sounds like you may have a broken cable or power supply failure to some disks. Bob The storage stopped working, but ping responds. SSH and NFS is out. When I open the console showing this messages: Jul 2 13:00:27 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2): Jul 2 13:00:27 storage Disconnected command timeout for Target 4 Jul 2 13:01:28 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2): Jul 2 13:01:28 storage Disconnected command timeout for Target 3 Jul 2 13:02:28 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2): Jul 2 13:02:28 storage Disconnected command timeout for Target 2 Jul 2 13:03:29 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2): Jul 2 13:03:29 storage Disconnected command timeout for Target 1 Jul 2 13:04:29 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2): Jul 2 13:04:29 storage Disconnected command timeout for Target 0 Jul 2 13:05:40 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2): Jul 2 13:05:40 storage Disconnected command timeout for Target 6 Jul 2 13:06:40 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3140@0 (mpt2): Jul 2 13:06:40 storage Disconnected command timeout for Target 5 Any ideas? Could help me? -- Roberto Scudeller -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Michael Hase wrote: The below is with a 2.6 GB test file but with a 26 GB test file (just add another zero to 'count' and wait longer) I see an initial read rate of 618 MB/s and a re-read rate of 8.2 GB/s. The raw disk can transfer 150 MB/s. To work around these caching effects just use a file 2 times the size of ram, iostat then shows the numbers really coming from disk. I always test like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but quite impressive ;-) Yes, in the past I have done benchmarking with file size 2X the size of memory. This does not necessary erase all caching because the ARC is smart enough not to toss everything. At the moment I have an iozone benchark run up from 8 GB to 256 GB file size. I see that it has started the 256 GB size now. It may be a while. Maybe a day. In the range of 600 MB/s other issues may show up (pcie bus contention, hba contention, cpu load). And performance at this level could be just good enough, not requiring any further tuning. Could you recheck with only 4 disks (2 mirror pairs)? If you just get some 350 MB/s it could be the same problem as with my boxes. All sata disks? Unfortunately, I already put my pool into use and can not conveniently destroy it now. The disks I am using are SAS (7200 RPM, 1 GB) but return similar per-disk data rates as the SATA disks I use for the boot pool. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Michael Hase wrote: To work around these caching effects just use a file 2 times the size of ram, iostat then shows the numbers really coming from disk. I always test like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but quite impressive ;-) Ok, the iozone benchmark finally completed. The results do suggest that reading from mirrors substantially improves the throughput. This is interesting since the results differ (better than) from my 'virgin mount' test approach: Command line used: iozone -a -i 0 -i 1 -y 64 -q 512 -n 8G -g 256G KB reclen write rewritereadreread 8388608 64 572933 1008668 6945355 7509762 8388608 128 2753805 2388803 6482464 7041942 8388608 256 2508358 2331419 2969764 3045430 8388608 512 2407497 2131829 3021579 3086763 16777216 64 671365 879080 6323844 6608806 16777216 128 1279401 2286287 6409733 6739226 16777216 256 2382223 2211097 2957624 3021704 16777216 512 2237742 2179611 3048039 3085978 33554432 64 933712 699966 6418428 6604694 33554432 128 459896 431640 6443848 6546043 33554432 256 90 430989 2997615 3026246 33554432 512 427158 430891 3042620 3100287 67108864 64 426720 427167 6628750 6738623 67108864 128 419328 422581 153 6743711 67108864 256 419441 419129 3044352 3056615 67108864 512 431053 417203 3090652 3112296 134217728 64 417668 55434 759351 760994 134217728 128 409383 400433 759161 765120 134217728 256 408193 405868 763892 766184 134217728 512 408114 403473 761683 766615 268435456 64 418910 55239 768042 768498 268435456 128 408990 399732 763279 766882 268435456 256 413919 399386 760800 764468 268435456 512 410246 403019 766627 768739 Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem: Disconnected command timeout for Target X
On Tue, 17 Jul 2012, Roberto Scudeller wrote: Hi Bob, Thanks for the answers. How do I test your theory? I would use 'dd' to see if it is possible to transfer data from one of the problem devices. Gain physical access to the system and check the signal and power cables to these devices closely. Use 'iostat -xe' to see what error counts have accumulated. Also 'iostat -E'. In this case, I use common disks SATA 2, not Nearline SAS (NL SATA) or SAS. Do you think the disks SATA are the problem? There have been reports of congestion leading to timeouts and resets when SATA disks are on expanders. There have also been reports that one failing disk can cause problems when on expanders. Regardless, if this system has been previously operating fine for some time, these errors would indicate a change in the hardware shared by all these devices. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Stefan Ring wrote: I wouldn't expect mirrored read to be faster than single-disk read, because the individual disks would need to read small chunks of data with holes in-between. Regardless of the holes being read or not, the disk will spin at the same speed. It is normal for reads from mirrors to be faster than for a single disk because reads can be scheduled from either disk, with different I/Os being handled in parallel. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Stefan Ring wrote: It is normal for reads from mirrors to be faster than for a single disk because reads can be scheduled from either disk, with different I/Os being handled in parallel. That assumes that there *are* outstanding requests to be scheduled in parallel, which would only happen with multiple readers or a large read-ahead buffer. That is true. Zfs tries to detect the case of sequential reads and requests to read more data than the application has already requested. In this case the data may be prefetched from the other disk before the application has requested it. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Michael Hase wrote: This is my understanding of zfs: it should load balance read requests even for a single sequential reader. zfs_prefetch_disable is the default 0. And I can see exactly this scaling behaviour with sas disks and with scsi disks, just not on this sata pool. Is the BIOS configured to use AHCI mode or is it using IDE mode? Are the disks 512 byte/sector or 4K? Maybe it's a corner case which doesn't matter in real world applications? The random seek values in my bonnie output show the expected performance boost when going from one disk to a mirrored configuration. It's just the sequential read/write case, that's different for sata and sas disks. I don't have a whole lot of experience with SATA disks but it is my impression that you might see this sort of performance if the BIOS was configured so that the drives were used as IDE disks. If not that, then there must be a bottleneck in your hardware somewhere. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Michael Hase wrote: So only one thing left: mirror should read 2x I don't think that mirror should necessarily read 2x faster even though the potential is there to do so. Last I heard, zfs did not include a special read scheduler for sequential reads from a mirrored pair. As a result, 50% of the time, a read will be scheduled for a device which already has a read scheduled. If this is indeed true, the typical performance would be 150%. There may be some other scheduling factor (e.g. estimate of busyness) which might still allow zfs to select the right side and do better than that. If you were to add a second vdev (i.e. stripe) then you should see very close to 200% due to the default round-robin scheduling of the writes. It is really difficult to measure zfs read performance due to caching effects. One way to do it is to write a large file (containing random data such as returned from /dev/urandom) to a zfs filesystem, unmount the filesystem, remount the filesystem, and then time how long it takes to read the file once. The reason why this works is because remounting the filesystem restarts the filesystem cache. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Tue, 10 Jul 2012, Edward Ned Harvey wrote: CPU's are not getting much faster. But IO is definitely getting faster. It's best to keep ahead of that curve. It seems that per-socket CPU performance is doubling every year. That seems like faster to me. If server CPU chipsets offer accelleration for some type of standard encryption, then that needs to be considered. The CPU might not need to do the encryption the hard way. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, 11 Jul 2012, Sašo Kiselkov wrote: the hash isn't used for security purposes. We only need something that's fast and has a good pseudo-random output distribution. That's why I looked toward Edon-R. Even though it might have security problems in itself, it's by far the fastest algorithm in the entire competition. If an algorithm is not 'secure' and zfs is not set to verify, doesn't that mean that a knowledgeable user will be able to cause intentional data corruption if deduplication is enabled? A user with very little privilege might be able to cause intentional harm by writing the magic data block before some other known block (which produces the same hash) is written. This allows one block to substitute for another. It does seem that security is important because with a human element, data is not necessarily random. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, 11 Jul 2012, Joerg Schilling wrote: Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Tue, 10 Jul 2012, Edward Ned Harvey wrote: CPU's are not getting much faster. But IO is definitely getting faster. It's best to keep ahead of that curve. It seems that per-socket CPU performance is doubling every year. That seems like faster to me. This would only apply, if you implement a multi threaded hash. While it is true that the per-block hash latency does not improve much with new CPUs (and may even regress), given multiple I/Os at once, hashes may be be computed by different cores and so it seems that total system performance will scale with per-socket CPU performance. Even with a single stream of I/O, multiple zfs blocks will be read or written so mutiple block hashes may be computed at once on different cores. Server OSs like Solaris have been focusing on improving total system throughput rather than single-threaded bandwidth. I don't mean to discount the importance of this effort though. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, 11 Jul 2012, Sašo Kiselkov wrote: The reason why I don't think this can be used to implement a practical attack is that in order to generate a collision, you first have to know the disk block that you want to create a collision on (or at least the checksum), i.e. the original block is already in the pool. At that point, you could write a colliding block which would get de-dup'd, but that doesn't mean you've corrupted the original data, only that you referenced it. So, in a sense, you haven't corrupted the original block, only your own collision block (since that's the copy doesn't get written). This is not correct. If you know the well-known block to be written, then you can arrange to write your collision block prior to when the well-known block is written. Therefore, it is imperative that the hash algorithm make it clearly impractical to take a well-known block and compute a collision block. For example, the well-known block might be part of a Windows anti-virus package, or a Windows firewall configuration, and corrupting it might leave a Windows VM open to malware attack. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris derivate with the best long-term future
On Wed, 11 Jul 2012, Eugen Leitl wrote: It would be interesting to see when zpool versions 28 will be available in the open forks. Particularly encryption is a very useful functionality. Illumos advanced to zpool version 5000 and this is available in the latest OpenIndiana development release. Does that make you happy? As far as which Solaris derivate has the best future, it is clear that Illumos has a lot of development energy right now and there is little reason to believe that this energy will cease. Illumos-derived distributions may come and go but it looks like Illumos has a future, particularly once it frees itself from all Sun-derived binary components. Oracle continues with Solaris 11 and does seem to be funding necessary driver and platform support. User access to Solaris 11 may be abitrarily limited. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, 11 Jul 2012, Sašo Kiselkov wrote: For example, the well-known block might be part of a Windows anti-virus package, or a Windows firewall configuration, and corrupting it might leave a Windows VM open to malware attack. True, but that may not be enough to produce a practical collision for the reason that while you know which bytes you want to attack, these might not line up with ZFS disk blocks (especially the case with Windows VMs which are store in large opaque zvols) - such an attack would require physical access to the machine (at which point you can simply manipulate the blocks directly). I think that well-known blocks are much easier to predict than you say because operating systems, VMs, and application software behave in predictable patterns. However, deriving another useful block which hashes the same should be extremely difficult and any block hashing algorithm needs to assure that. Having an excellent random distribution property is not sufficient if it is relatively easy to compute some other block producing the same hash. It may be useful to compromise a known block even if the compromized result is complete garbage. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, 11 Jul 2012, Richard Elling wrote: The last studio release suitable for building OpenSolaris is available in the repo. See the instructions at http://wiki.illumos.org/display/illumos/How+To+Build+illumos Not correct as far as I can tell. You should re-read the page you referenced. Oracle recinded (or lost) the special Studio releases needed to build the OpenSolaris kernel. The only way I can see to obtain these releases is illegally. However, Studio 12.3 (free download) produces user-space executables which run fine under Illumos. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, 11 Jul 2012, Hung-Sheng Tsao (LaoTsao) Ph.D wrote: Not correct as far as I can tell. You should re-read the page you referenced. Oracle recinded (or lost) the special Studio releases needed to build the OpenSolaris kernel. you can still download 12 12.1 12.2, AFAIK through OTN That is true (and I have done so). Unfortunately the versions offered are not the correct ones to build the OpenSolaris kernel. Special patched versions with particular date stamps are required. The only way that I see to obtain these files any more is via distribution channels primarily designed to perform copyright violations. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Wed, 4 Jul 2012, Nico Williams wrote: Oddly enough the manpages at the Open Group don't make this clear. So I think it may well be advisable to use msync(3C) before munmap() on MAP_SHARED mappings. However, I think all implementors should, and probably all do (Linux even documents that it does) have an implied msync(2) when doing a munmap(2). I really makes no sense at all to have munmap(2) not imply msync(3C). As long as the system has a way to track which dirty pages map to particular files (Solaris historically does), it should not be necessary to synchronize the mapping to the underlying store simply due to munmap. It may be more efficient not do to that. The same pages may be mapped and unmapped many times by applications. In fact, several applications may memory map the same file so they access the same pages and it seems wrong to flush to underlying store simply because one of the applications no longer references the page. Since mmap() on zfs breaks the traditional coherent memory/filesystem that Solaris enjoyed prior to zfs, it may be that some rules should be different when zfs is involved because of its redundant use of memory (zfs ARC and VM page). Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Wed, 4 Jul 2012, Stefan Ring wrote: I really makes no sense at all to have munmap(2) not imply msync(3C). Why not? munmap(2) does basically the equivalent of write(2). In the case of write, that is: a later read from the same location will see the written data, unless another write happens in-between. If power Actually, a write to memory for a memory mapped file is more similar to write(2). If two programs have the same file mapped then the effect on the memory they share is instantaneous because it is the same physical memory. A mmapped file becomes shared memory as soon as it is mapped at least twice. It is pretty common for a system of applications to implement shared memory via memory mapped files with the mapped memory used for read/write. This is a precursor to POSIX's shm_open(3RT) which produces similar functionality without a known file in the filesystem Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Tue, 3 Jul 2012, James Litchfield wrote: Agreed - msync/munmap is the only guarantee. I don't see that the munmap definition assures that anything is written to disk. The system is free to buffer the data in RAM as long as it likes without writing anything at all. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Mon, 2 Jul 2012, Iwan Aucamp wrote: I'm interested in some more detail on how ZFS intent log behaves for updated done via a memory mapped file - i.e. will the ZIL log updates done to an mmap'd file or not ? I would to expect these writes to go into the intent log unless msync(2) is used on the mapping with the MS_SYNC option. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommendation for home NAS external JBOD
On Mon, 18 Jun 2012, Koopmann, Jan-Peter wrote: looks nice! The only thing coming to mind is that according to the specifications the enclosure is 3Gbits only. If I choose to put in a SSD with 6Gbits this would be not optimal. I looked at their site but failed to find 6GBit enclosures. But I will keep looking since sooner or later they will provide it. I browsed the site and saw many 6GBit enclosures. I also saw one with Nexenta (Solaris/zfs appliance) inside. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommendation for home NAS external JBOD
On Mon, 18 Jun 2012, Koopmann, Jan-Peter wrote: I browsed the site and saw many 6GBit enclosures. I also saw one with Nexenta (Solaris/zfs appliance) inside. I found several high end enclosures. Or ones with bundled RAID cards. But the equivalent of the one originally suggested I was not able to find. However after looking at tons of sites for hours I might simply have missed it. If you found one, can you please forward a link? So you want high-end performance at a low-end price? It seems unlikely that you will notice the difference between 3Gbit or 6Gbit for a home application. FLASH-based SSDs seem to burn-out pretty quickly if you don't use them carefully. The situation is getting worse rather than better over time as FLASH geometries get smaller and they try to store more bits in one cell. What was described as a bright new future is starting to look more like an end of the road to me. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommendation for home NAS external JBOD
On Mon, 18 Jun 2012, Carson Gaspar wrote: What makes you think the Barracuda 7200.14 drives report 4k sectors? I gave up looking for 4kn drives, as everything I could find was 512e. I would _love_ to be wrong, as I have 8 4TB Hitachis on backorder that I would gladly replace with 4kn drives, even if I had to drop to 3TB density. Why would you want native 4k drives right now? Not much would work with such drives. Maybe in a dedicated chassis (e.g. the JBOD) they could be of some use. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs
On Tue, 29 May 2012, Iwan Aucamp wrote: - Is there a parameter similar to /proc/sys/vm/swappiness that can control how long unused pages in page cache stay in physical ram if there is no shortage of physical ram ? And if not how long will unused pages stay in page cache stay in physical ram given there is no shortage of physical ram ? Absent pressure for memory, no longer referenced pages will stay in memory forever. They can then be re-referenced in memory. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does resilver/scrub work?
On Mon, 21 May 2012, Jim Klimov wrote: This is so far a relatively raw idea and I've probably missed something. Do you think it is worth pursuing and asking some zfs developers to make a POC? ;) I did read all of your text. :-) This is an interesting idea and could be of some use but it would be wise to test it first a few times before suggesting it as a general course. Zfs is still totally not foolproof. I still see postings from time to time regarding pools which panic/crash the system (probably due to memory corruption). Zfs will try to keep the data compacted at the beginning of the partition so if you have a way to know how far out it extends, then the initial 'dd' could be much faster when the pool is not close to full. Zfs scrub does need to do many more reads than a resilver since it reads all data and metadata copies. Triggering a resilver operation for the specific disk would likely hasten progress. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs_arc_max values
On Thu, 17 May 2012, Paul Kraus wrote: Why are you trying to tune the ARC as _low_ as possible? In my experience the ARC gives up memory readily for other uses. The only place I _had_ to tune the ARC in production was a couple systems running an app that checks for free memory _before_ trying to allocate it. If the ARC has all but 1 GB in use, the app (which is looking for On my system I adjusted the ARC down due to running user-space applications with very bursty short-term large memory usage. Reducing the ARC assured that there would be no contention between zfs ARC and the applications. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migration of a Thumper to bigger HDDs
On Fri, 18 May 2012, Jim Klimov wrote: Would there be substantial issues if we start out making and filling the new raidz3 8+3 pool in SXCE snv_129 (with zpool v22) or snv_130, and later upgrade the big zpool along with the major OS migration, that can be avoided by a preemptive upgrade to oi_151a or later (oi_151a3?) Perhaps, some known pool corruption issues or poor data layouts in older ZFS software releases?.. I can't attest as to potential issues, but the newer software surely fixes many bugs and it is also likely that the data layout improves in newer software. Improved data layout would result in better performance. It seems safest to upgrade the OS before moving a lot of data. Leave a fallback path in case the OS upgrade does not work as expected. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migration of a Thumper to bigger HDDs
On Wed, 16 May 2012, Jim Klimov wrote: Your idea actually evolved for me into another (#7?), which is simple and apparent enough to be ingenious ;) DO use the partitions, but split the 2.73Tb drives into a roughly 2.5Tb partition followed by a 250Gb partition of the same size as vdevs of the original old pool. Then the new drives can replace a dozen of original small disks one by one, in a one-to-one fashion resilvering, with no worsening of the situation in regard of downtime or original/new pools' integrity tradeoffs (in fact, several untrustworthy old disks will be replaced by newer ones). I like this idea since it allows running two complete pools on the same disks without using files. Due to using partitions, the disk write cache will be disabled unless you specifically enable it. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migration of a Thumper to bigger HDDs
You forgot IDEA #6 where you take advantage of the fact that zfs can be told to use sparse files as partitions. This is rather like your IDEA #3 but does not require that disks be partitioned. This opens up many possibilities. Whole vdevs can be virtualized to files on (i.e. moved onto) remaining physical vdevs. Then the drives freed up can be replaced with larger drives and used to start a new pool. It might be easier to upgrade the existing drives in the pool first so that there is assured to be vast amounts of free space and the drives get some testing. There is not initially additional risk due to raidz1 in the pool since the drives will be about as full as before. I am not sure what additional risks are involved due to using files. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver restarting several times
On Fri, 11 May 2012, Jim Klimov wrote: Hello all, SHORT VERSION: What conditions can cause the reset of the resilvering process? My lost-and-found disk can't get back into the pool because of resilvers restarting... I recall that with sufficiently old vintage zfs, resilver would restart if a snapshot was taken. What sort of zfs is being used here? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On Mon, 7 May 2012, Edward Ned Harvey wrote: Apparently I pulled it down at some point, so I don't have a URL for you anymore, but I did, and I posted. Long story short, both raidzN and mirror configurations behave approximately the way you would hope they do. That is... Approximately, as compared to a single disk: And I *mean* approximately, Yes, I remember your results. In a few weeks I should be setting up a new system with OpenIndiana and 8 SAS disks. This will give me an opportunity to test again. Last time I got to play was back in Feburary 2008 and I did not bother to test raidz (http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf). Most common benchmarking is sequential read/write and rarely read-file/write-file where 'file' is a megabyte or two and the file is different for each iteration. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slow zfs send
On Mon, 7 May 2012, Karl Rossing wrote: On 12-05-07 12:18 PM, Jim Klimov wrote: During the send you can also monitor zpool iostat 1 and usual iostat -xnz 1 in order to see how busy the disks are and how many IO requests are issued. The snapshots are likely sent in the order of block age (TXG number), which for a busy pool may mean heavy fragmentation and lots of random small IOs.. I have been able to verify that I can get a zfs send at 135MB/sec for a striped pool with 2 internal drives on the same server. I see that there are a huge number of reads and hardy any reads. Are you SURE that deduplication was not enabled for this pool? This is the sort of behavior that one might expect if deduplication was enabled without enough RAM or L2 read cache. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On Fri, 4 May 2012, Erik Trimble wrote: predictable, and the backing store is still only giving 1 disk's IOPS. The RAIDZ* may, however, give you significantly more throughput (in MB/s) than a single disk if you do a lot of sequential read or write. Has someone done real-world measurements which indicate that raidz* actually provides better sequential read or write than simple mirroring with the same number of disks? While it seems that there should be an advantage, I don't recall seeing posted evidence of such. If there was a measurable advantage, it would be under conditions which are unlikely in the real world. The only thing totally clear to me is that raidz* provides better storage efficiency than mirroring and that raidz1 is dangerous with large disks. Provided that the media reliability is sufficiently high, there are still many performance and operational advantages obtained from simple mirroring (duplex mirroring) with zfs. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance on LSI 9240-8i?
On Fri, 4 May 2012, Rocky Shek wrote: If I were you, I will not use 9240-8I. I will use 9211-8I as pure HBA with IT FW for ZFS. Is there IT FW for the 9240-8i? They seem to use the same SAS chipset. My next system will have 9211-8i with IT FW. Playing it safe. Good enough for Nexenta is good enough for me. Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On Mon, 30 Apr 2012, Ray Van Dolson wrote: I'm trying to run some IOzone benchmarking on a new system to get a feel for baseline performance. Unfortunately, benchmarking with IOzone is a very poor indicator of what performance will be like during normal use. Forcing the system to behave like it is short on memory only tests how the system will behave when it is short on memory. Testing multi-threaded synchronous writes with IOzone might actually mean something if it is representative of your work-load. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On Tue, 1 May 2012, Ray Van Dolson wrote: Testing multi-threaded synchronous writes with IOzone might actually mean something if it is representative of your work-load. Sounds like IOzone may not be my best option here (though it does produce pretty graphs). bonnie++ actually gave me more realistic sounding numbers, and I've been reading good thigns about fio. None of these benchmarks is really useful other than to stress-test your hardware. Assuming that the hardware is working properly, when you intentionally break the cache, IOzone should produce numbers similar to what you could have estimated from hardware specification sheets and an understanding of the algorithms. Sun engineers used 'filebench' to do most of their performance testing because it allowed configuring the behavior to emulate various usage models. You can get it from https://sourceforge.net/projects/filebench/;. Zfs is all about caching so the cache really does need to be included (and not intentionally broken) in any realistic measurement of how the system will behave. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, 25 Apr 2012, Rich Teer wrote: Perhaps I'm being overly simplistic, but in this scenario, what would prevent one from having, on a single file server, /exports/nodes/node[0-15], and then having each node NFS-mount /exports/nodes from the server? Much simplier than your example, and all data is available on all machines/nodes. This solution would limit bandwidth to that available from that single server. With the cluster approach, the objective is for each machine in the cluster to primarily access files which are stored locally. Whole files could be moved as necessary. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Two disks giving errors in a raidz pool, advice needed
On Mon, 23 Apr 2012, Manuel Ryan wrote: Do you guys also think I should change disk 5 first or am I missing something ? From your description, this sounds like the best course of action, but you should look at your system log files to see what sort of issues are being logged. Also consult the output of 'iostat -xe' to see what low-level errors are being logged. I'm not an expert with zfs so any insight to help me replace those disks without loosing too much data would be much appreciated :) If this is really raidz1 then more data is definitely at risk if several disks seem to be failing at once. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 11/ZFS historical reporting
On Mon, 16 Apr 2012, Tomas Forsman wrote: On 16 April, 2012 - Anh Quach sent me these 0,4K bytes: Are there any tools that ship w/ Solaris 11 for historical reporting on things like network activity, zpool iops/bandwidth, etc., or is it pretty much roll-your-own scripts and whatnot? zpool iostat 5 is the closest built-in.. Otherwise, switch from Solaris 11 to SmartOS or Illumos. Lots of good stuff going on there for monitoring and reporting. The dtrace.conf conference seemed like it was pretty interesting. See http://smartos.org/blog/;. Lots more good stuff at http://www.youtube.com/user/deirdres; and elsewhere on Youtube. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate Constellation vs. Hitachi Ultrastar
On Fri, 6 Apr 2012, Marion Hakanson wrote: The only caveat I've found is that the Nearline SAS Seagates go really slow with the Solaris default multipath load-balancing setting (round-robin). Set it to none or some large block value and they go fast. This issue doesn't appear when used with the PERC H800's. If the drives are exposed as individual LUNs, then it may be possible to arrange things so that 1/2 the drives are accessed (by default) down one path, and the other 1/2 down the other. That way you get the effect of load-balancing without the churn which might be caused by dynamic load-balancing. That is what I did for my storage here, but the preferences needed to be configured on the remote end. It is likely possible to configure everything on the host end but Solaris has special support for my drive array so it used the drive array's preferences. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
On Mon, 26 Mar 2012, Mike Gerdts wrote: If file space usage is less than file directory size then it must contain a hole. Even for compressed files, I am pretty sure that Solaris reports the uncompressed space usage. That's not the case. You are right. I should have tested this prior to posting. :-( Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
On Mon, 26 Mar 2012, Andrew Gabriel wrote: I just played and knocked this up (note the stunning lack of comments, missing optarg processing, etc)... Give it a list of files to check... This is a cool program, but programmers were asking (and answering) this same question 20+ years ago before there was anything like SEEK_HOLE. If file space usage is less than file directory size then it must contain a hole. Even for compressed files, I am pretty sure that Solaris reports the uncompressed space usage. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Good tower server for around 1,250 USD?
On Sat, 24 Mar 2012, Sandon Van Ness wrote: This is a very nice chasis IMHO for a desktop machine: http://www.supermicro.com/products/chassis/4U/743/SC743TQ-865-SQ.cfm I own the same chassis. However, when the system was delivered, it was quite loud. The problem was isolated to using the crummy fans that Intel provided with the CPUs. By replacing the Intel fans with better quality fans, now the system is whisper quiet. My system has two 6-core Xeons (E5649) with 48GB of RAM. It is able to run OpenIndiana quite well but is being used to run Linux as a desktop system. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Good tower server for around 1,250 USD?
On Fri, 23 Mar 2012, The Honorable Senator and Mrs. John Blutarsky wrote: Obtaining an approved system seems very difficult. Because of the list being out of date and so the systems are no longer available, or because systems available now don't show up on the list? Sun was slow to update the list and it is not clear if Oracle updates the list at all. great. After reading the horror stories on the list I don't want to take a chance and buy the wrong machine and then have ZFS fail or Oracle tell me they don't support the machine. I can't answer for Oracle. There may be a chicken-and-egg problem since Oracle might not want to answer speculative questions but might be more concrete if you have a system in hand. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Good tower server for around 1,250 USD?
On Thu, 22 Mar 2012, The Honorable Senator and Mrs. John Blutarsky wrote: This will be a do-everything machine. I will use it for development, hosting various apps in zones (web, file server, mail server etc.) and running other systems (like a Solaris 11 test system) in VirtualBox. Ultimately I would like to put it under Solaris support so I am looking for something officially approved. The problem is there are so many systems on the HCL I don't know where to begin. One of the Supermicro super workstations looks Almost all of the systems listed on the HCL are defunct and no longer purchasable except for on the used market. Obtaining an approved system seems very difficult. In spite of this, Solaris runs very well on many non-approved modern systems. I don't know what that means as far as the ability to purchase Solaris support. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
On Thu, 22 Mar 2012, Jim Klimov wrote: I think that a certain Bob F. would disagree, especially when larger native sectors and ashist=12 come into play. Namely, one scenario where this is important is automated storage of thumbnails for websites, or some similar small objects in vast amounts. I don't know about that Bob F. but this Bob F. just took a look and noticed that thumbnail files for full-color images are typically 4KB or a bit larger. Low-color thumbnails can be much smaller. For a very large photo site, it would make sense to replicate just the thumbnails across a number of front-end servers and put the larger files on fewer storage servers because they are requested much less often and stream out better. This would mean that those front-end thumbnail servers would primarily contain small files. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Server upgrade
On Wed, 15 Feb 2012, David Dyer-Bennet wrote: version fits my needs for example.) Upgrading might perhaps save me from changing all the user passwords (half a dozen, not a huge problem) and software packages I've added. (uname -a says SunOS fsfs 5.11 snv_134 i86pc i386 i86pc). Or should I just export my pool and do a from-scratch install of something? (Then recreate the users and install any missing software. I've got some cron jobs, too.) I have read (on the OpenIndiana site) that there is an upgrade path from what you have to OpenIndiana. They describe the procedure to use. OpenIndiana does not yet include encryption support in zfs since encryption support was never released into OpenSolaris. If I was you, I would try the upgrade to OpenIndiana first. The alternative is paid and supported Oracle Solaris 11, which would require a from-scratch install, and may or may not even be an option for you. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk failing? High asvc_t and %b.
On Wed, 1 Feb 2012, Jan Hellevik wrote: The disk in question is c6t70d0 - it shows consistently higher %b and asvc_t than the other disks in the pool. The output is from a 'zfs receive' after about 3 hours. The two c5dx disks are the 'rpool' mirror, the others belong to the 'backup' pool. Are all of the disks the same make and model? What type of chassis are the disks mounted in? Is it possible that the environment that this disk experiences is somehow different than the others (e.g. due to vibration)? Should I be worried? And what other commands can I use to investigate further? It is difficult to say if you should be worried. Be sure to do 'iostat -xe' to see if there are any accumulating errors related to the disk. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk failing? High asvc_t and %b.
On Wed, 1 Feb 2012, Jan Hellevik wrote: Are all of the disks the same make and model? They are different makes - I try to make pairs of different brands to minimise risk. Does your pairing maintain the same pattern of disk type across all the pairings? Some modern disks use 4k sectors while others still use 512 bytes. If the slow disk is a 4k sector model but the others are 512 byte models, then that would certainly explain a difference. Assuming that a couple of your disks are still unused, you could try replacing the suspect drive with an unused drive (via zfs command) to see if the slowness goes away. You could also make that vdev a triple-mirror since it is very easy to add/remove drives from a mirror vdev. Just make sure that your zfs syntax is correct so that you don't accidentally add a single-drive vdev to the pool (oops!). These sorts of things can be tested with zfs commands without physically moving/removing drives or endangering your data. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] need hint on pool setup
On Tue, 31 Jan 2012, Thomas Nau wrote: Dear all We have two JBODs with 20 or 21 drives available per JBOD hooked up to a server. We are considering the following setups: RAIDZ2 made of 4 drives RAIDZ2 made of 6 drives The first option wastes more disk space but can survive a JBOD failure whereas the second is more space effective but the system goes down when a JBOD goes down. Each of the JBOD comes with dual controllers, redundant fans and power supplies so do I need to be paranoid and use option #1? Of course it also gives us more IOPs but high end logging devices should take care of that I think that the answer depends on the impact to your business if data is temporarily not available. If your business can not survive data being temporarily not available (for hours or even a week) then the more conserative approach may be warranted. If you have a service contract which assures that a service tech will show up quickly with replacement hardware in hand, then this may also influence the decision which should be made. Another consideration is that since these JBODs connect to a server, the data will also be unavailable when the server is down. The server being down may in fact be a more significant factor than a JBOD being down. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is your data error rate?
On Wed, 25 Jan 2012, Anonymous Remailer (austria) wrote: I've been watching the heat control issue carefully since I had to take a job offshore (cough reverse H1B cough) in a place without adequate AC and I was able to get them to ship my servers and some other gear. Then I read Intel is guaranteeing their servers will work up to 100 degrees F ambient temps in the pricing wars to sell servers, he who goes green and saves data Most servers seem to be specified to run up to 95 degrees, with some particularly-dense ones specified to only handle 90. Network switching gear is usually specified to handle 105. My own equipment typically experiences up to 83 degrees during the peak of summer (but quite a lot more if the AC fails). Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss