Re: [zfs-discuss] Diagnosing Permanent Errors
Looks like it was RAM. I ran memtest+ 4.00, and it found no problems. I removed 2 of the 3 sticks of RAM, ran a backup, and had no errors. I'm running more extensive tests, but it looks like that was it. A new motherboard, CPU and ECC RAM are on the way to me now. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Tuning the ARC towards LRU
Hello, For desktop use, and presumably rapidly changing non-desktop uses, I find the ARC cache pretty annoying in its behavior. For example this morning I had to hit my launch-terminal key perhaps 50 times (roughly) before it would start completing without disk I/O. There are plenty of other examples as well, such as /var/db/pkg not being pulled aggressively into cache such that pkg_* operations (this is on FreeBSD) are slower than they should (I have to run pkg_info some number of times before *it* will complete without disk I/O too). I would be perfectly happy with pure LRU caching behavior or an approximation thereof, and would therefore like to essentially completely turn off all MFU-like weighting. I have not investigated in great depth so it's possible this represents an implementation problem rather than the actual intended policy of the ARC. If the former, can someone confirm/deny? If the latter, is there some way to tweak it? I have not found one (other than changing the code). Is there any particular reason why such knobs are not exposed? Am I missing something? -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problems with zfs and a STK RAID INT SAS HBA
On 5 apr 2010, at 04.35, Edward Ned Harvey wrote: When running the card in copyback write cache mode, I got horrible performance (with zfs), much worse than with copyback disabled (which I believe should mean it does write-through), when tested with filebench. When I benchmark my disks, I also find that the system is slower with WriteBack enabled. I would not call it much worse, I'd estimate about 10% worse. Yes, I oversimplified - I have been benchmarking with filebench, just running the tests shipped with the OS trimmed a little according to http://www.solarisinternals.com/wiki/index.php/FileBench. For most tests, I typically get a little worse performance with writeback enabled (or copyback, as they called it on this card), maybe about 10 % in average could be about right for these tests too. The interesting part is that with these tests and writeback disabled, on a 4 way stripe of sun stock 2.5 146 GB 1 RPM drives, the test takes 2 hours and 18 minutes (138 minutes) to complete, but with writeback enabled it takes 16 hours 57 minutes (1017 minutes), or over 7.3 times as long time! I can't (yet) explain the large difference in test time and the small diff in test results. Maybe a hardware - or driver - problem has its' part in this. I have made a few simple tests with these cards before and was not really impressed, even with all the bells and whistles turned of they merely seemed to be an IOPS and maybe BW bottleneck, but the above seems just not right. This, naturally, is counterintuitive. I do have an explanation, however, which is partly conjecture: With the WriteBack enabled, when the OS tells the HBA to write something, it seems to complete instantly. So the OS will issue another, and another, and another. The HBA has no knowledge of the underlying pool data structure, so it cannot consolidate the smaller writes into larger sequential ones. It will brainlessly (or less-brainfully) do as it was told, and write the blocks to precisely the addresses that it was instructed to write. Even if those are many small writes, scattered throughout the platters. ZFS is smarter than that. It's able to consolidate a zillion tiny writes, as well as some larger writes, all into a larger sequential transaction. ZFS has flexibility, in choosing precisely how large a transaction it will create, before sending it to disk. One of the variables used to decide how large the transaction should be is ... Is the disk busy writing, right now? If the disks are still busy, I might as well wait a little longer and continue building up my next sequential block of data to write. If it appears to have completed the previous transaction already, no need to wait any longer. Don't let the disks sit idle. Just send another small write to the disk. Long story short, I think, ZFS simply does a better job of write buffering than the HBA could possibly do. So you benefit by disabling the WriteBack, in order to allow ZFS handle that instead. You could think that ZIL transactions could get a speedup by the writeback cache, meaning more sync actions per second, and in some cases that seems to be true, and that the card should be designed to be able to handle intermittent load as the txg completions bursts (typically every 30 seconds), but something strange obviously happens, at least on this setup. (Actually I'd prefer if I could conclude that there is no use for writeback caching HBAs - I'd like these machines to be as stable as they possible can and therefore to be just as plain and simple as possible, and for us to be able to just quickly move the disks if one machine should brake - with some data stuck in some silly writeback cache inside a HBA that may or may not cooperate depending on it's state of mind, mood and the moon phase, that can't be done and I'd need a much more complicated (= error- and mistake-prone) setup. But my tests so far seems just not right and probably can't be used to conclude anything. I'd rather use slogs, and have a few Intel X25-Es to test with, but then I just recently read on this list that X25-Es aren't supported for slog anymore! Maybe because they always have their writeback cache turned on by default and ignore cache flush commands (and that is not a bug - is the design from outer space?), I don't know yet. (Don't know why I am stubbornly fooling around with this intel junk - they right now manage to annoy me with a crappy (or broken) PCI-PCI bridge, a crappy HBA and a crappy SSD drives...)) /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Removing SSDs from pool
Hi all, while setting of our X4140 I have - following suggestions - added two SSDs as log devices as follows zpool add tank log c1t6d0 c1t7d0 I currently have pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 errors: No known data errors pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 mirrorONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 logs c1t6d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 errors: No known data errors We have performance problems especially with FrontBase (relational database) running on this ZFS configuration and need to look for optimizations. • I would like to remove the two SSDs as log devices from the pool and instead add them as a separate pool for sole use by the database to see how this enhences performance. I could certainly do zpool detach tank c1t7d0 to remove one disk from the log mirror. But how can I get back the second SSD? Any experiinces with running database on ZFS pools? What can I do to tune the performance? Smaller block size may be? Thanks a lot, Andreas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS getting slower over time
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Marcus Wilhelmsson I have a problem with my zfs system, it's getting slower and slower over time. When the OpenSolaris machine is rebooted and just started I get about 30-35MB/s in read and write but after 4-8 hours I'm down to maybe 10MB/s and it varies between 4-18MB/s. Now, if i reboot the machine it's all gone and I have perfect speed again. Does it have something to do with the cache? I use a separate SSD as a cache disk. If it is somehow related to the cache, fortunately that's easy to test for. Just remove your log device, or cache device, and see if that helps. But I doubt it. Anyways, here's my setup: OpenSolaris 1.34 dev C2D with 4GB ram 4x 1,5TB WD SATA drives and 1x Corsair 32GB SSD as cache Not knowing much, I'm going to suspect your RAM. 4G doesn't sound like much to me. How large is your filesystem? I think the number of files is probably more relevant than the total number of Gb. Doesn't seem to matter if I copy files locally on the computer or if I use CIFS, still getting the same degredation in speed. Last night I left my workstation copying files to/from the server for about 8 hours and you could see the performance dropping from about 28MB/s down to under 10MB/s after a couple of hours. What are you using to measure the performance? Is it read, or write? Do you have compression or dedupe enabled? Please send your /etc/release file. At least the relevant parts. Also, please send your zpool status ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS getting slower over time
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Marcus Wilhelmsson I have a problem with my zfs system, it's getting slower and slower over time. When the OpenSolaris machine is rebooted and just started I get about 30-35MB/s in read and write but after 4-8 hours I'm down to maybe 10MB/s and it varies between 4-18MB/s. Now, if i reboot the machine it's all gone and I have perfect speed again. Does it have something to do with the cache? I use a separate SSD as a cache disk. If it is somehow related to the cache, fortunately that's easy to test for. Just remove your log device, or cache device, and see if that helps. But I doubt it. I doubt it as well, but it's worth a try. Anyways, here's my setup: OpenSolaris 1.34 dev C2D with 4GB ram 4x 1,5TB WD SATA drives and 1x Corsair 32GB SSD as cache Not knowing much, I'm going to suspect your RAM. 4G doesn't sound like much to me. How large is your filesystem? I think the number of files is probably more relevant than the total number of Gb. Doesn't seem to matter if I copy files locally on the computer or if I use CIFS, still getting the same degredation in speed. Last night I left my workstation copying files to/from the server for about 8 hours and you could see the performance dropping from about 28MB/s down to under 10MB/s after a couple of hours. What are you using to measure the performance? Well, I'm comparing the transfer speed from my WinXP computer and my Mac and see how much it changes over five hours of constant copying (copying large iso-files of about 20GB). Since the system is used as a home NAS there won't be lot's of random I/O, but rather me copying stuff over the network. Any suggestions on how to do a proper performance test is welcome. Is it read, or write? Do you have compression or dedupe enabled? Both read and write, but mostly read. No compression or dedup. Please send your /etc/release file. At least the relevant parts. OpenSolaris Development snv_134 X86 Assembled 01 March 2010 Also, please send your zpool status pool: rpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c3d0s0ONLINE 0 0 0 errors: No known data errors pool: s1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM s1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 cache c4t4d0ONLINE 0 0 0 errors: No known data errors ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing SSDs from pool
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Andreas Höschler I would like to remove the two SSDs as log devices from the pool and instead add them as a separate pool for sole use by the database to see how this enhences performance. I could certainly do zpool detach tank c1t7d0 to remove one disk from the log mirror. But how can I get back the second SSD? If you're running solaris, sorry, you can't remove the log device. You better keep your log mirrored until you can plan for destroying and recreating the pool. Actually, in your example, you don't have a mirror of logs. You have two separate logs. This is fine for opensolaris (zpool =19), but not solaris (presently up to zpool 15). If this is solaris, and *either* one of those SSD's fails, then you lose your pool. If you're running opensolaris, man zpool and look for zpool remove Is the database running locally on the machine? Or at the other end of something like nfs? You should have better performance using your present config than just about any other config ... By enabling the log devices, such as you've done, you're dedicating the SSD's for sync writes. And that's what the database is probably doing. This config should be *better* than dedicating the SSD's as their own pool. Because with the dedicated log device on a stripe of mirrors, you're allowing the spindle disks to do what they're good at (sequential blocks) and allowing the SSD's to do what they're good at (low latency IOPS). If you're running zpool 19 or greater (you can check with zpool update) it should be safe to run with the log device un-mirrored. In which case, you might think about using one SSD as log, and one SSD as cache. That might help. You can verify the behavior of your database, if you run zpool iostat and then do some database stuff, you should see the writes increasing on the log SSD's. If you don't see that, then your DB is not doing sync writes (I bet it is). You don't benefit from dedicated log device unless you're doing sync writes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS getting slower over time
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Marcus Wilhelmsson pool: s1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM s1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 cache c4t4d0ONLINE 0 0 0 With this configuration, you should be pretty good at reading large files, or repeat-reading random files that you've recently read. Your write performance could be lower. And you would have really poor sync write performance. If reading a large sequential file, you should be able to max out Gb Ethernet. But the system you're receiving the file to can only go as fast as a single disk, unless you've got something like a hardware raid controller. Still, you should be able to get approx 60 Mbytes/sec across CIFS, where the bottleneck is your laptop hard drive, where you're receiving the file. The test I would recommend, would be: time cp /Volumes/somemount/somefile.iso . on a mac, and in windows running cygwin, time cp /cygdrive/someletter/somefile.iso . That should be an apples-to-apples test, which would really give you some number you know is accurate. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] no hot spare activation?
While testing a zpool with a different storage adapter using my blkdev device, I did a test which made a disk unavailable -- all attempts to read from it report EIO. I expected my configuration (which is a 3 disk test, with 2 disks in a RAIDZ and a hot spare) to work where the hot spare would automatically be activated. But I'm finding that ZFS does not behave this way -- if only some I/Os are failed, then the hot spare is failed, but if ZFS decides that the label is gone, it takes no attempt to recruit a hot spare. I had added FMA notification to my blkdev driver - it will post device.no_response or device.invalid_state ereports (per the ddi_fm_ereport_post() man page) in certain failure scenarios. I *suspect* the problem is in the FMA notification for zfs-retire, where the event is not being interpreted in a way that ZFS retire can figure out that the drive is toasted. Of course, this is just an educated guess on my part. I'm no ZFS nor FMA expert here. Am I missing something here? Under what conditions can I expect hot spares to be recruited? My zpool status showing the results is below. - Garrett pfexec zpool status pool: rpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 errors: No known data errors pool: testpool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAMESTATE READ WRITE CKSUM testpoolDEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c2t3d0 ONLINE 0 0 0 c2t3d1 UNAVAIL 9 132 0 experienced I/O failures spares c2t3d2AVAIL errors: No known data errors ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS getting slower over time
Alright, I've made the benchmarks and there isn't a difference worth mentioning except that i only get about 30MB/s (to my Mac, which has an SSD as system disk). I've also tried copying to a ram disk with slightly better results. Well, now that I've restarted the server I probably won't see the 5 sec dips until later tonight. Thanks for the help, even though it doesn't seem to make a difference. I'll try to get a screen capture of the dips in performance next time they occur. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 4/4/2010 11:04 PM, Edward Ned Harvey wrote: Actually, It's my experience that Sun (and other vendors) do exactly that for you when you buy their parts - at least for rotating drives, I have no experience with SSD's. The Sun disk label shipped on all the drives is setup to make the drive the standard size for that sun part number. They have to do this since they (for many reasons) have many sources (diff. vendors, even diff. parts from the same vendor) for the actual disks they use for a particular Sun part number. Actually, if there is a fdisk partition and/or disklabel on a drive when it arrives, I'm pretty sure that's irrelevant. Because when I first connect a new drive to the HBA, of course the HBA has to sign and initialize the drive at a lower level than what the OS normally sees. So unless I do some sort of special operation to tell the HBA to preserve/import a foreign disk, the HBA will make the disk blank before the OS sees it anyway. That may be true. Though these days they may be spec'ing the drives to the manufacturer's at an even lower level. So does your HBA have newer firmware now than it did when the first disk was connected? Maybe it's the HBA that is handling the new disks differently now, than it did when the first one was plugged in? Can you down rev the HBA FW? Do you have another HBa that might still have the older Rev you coudltest it on? -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing SSDs from pool
Hi Edward, thanks a lot for your detailed response! From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Andreas Höschler • I would like to remove the two SSDs as log devices from the pool and instead add them as a separate pool for sole use by the database to see how this enhences performance. I could certainly do zpool detach tank c1t7d0 to remove one disk from the log mirror. But how can I get back the second SSD? If you're running solaris, sorry, you can't remove the log device. You better keep your log mirrored until you can plan for destroying and recreating the pool. Actually, in your example, you don't have a mirror of logs. You have two separate logs. This is fine for opensolaris (zpool =19), but not solaris (presently up to zpool 15). If this is solaris, and *either* one of those SSD's fails, then you lose your pool. I run Solaris 10 (not Open Solaris)! You say the log mirror pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 ... logs c1t6d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 does not do me anything good (redundancy-wise)!? Shouldn't I dettach the second drive then and try to use it for something else, may be another machine? I understand it is very dangerous to use SSDs for logs then (no redundancy)!? If you're running opensolaris, man zpool and look for zpool remove Is the database running locally on the machine? Yes! Or at the other end of something like nfs? You should have better performance using your present config than just about any other config ... By enabling the log devices, such as you've done, you're dedicating the SSD's for sync writes. And that's what the database is probably doing. This config should be *better* than dedicating the SSD's as their own pool. Because with the dedicated log device on a stripe of mirrors, you're allowing the spindle disks to do what they're good at (sequential blocks) and allowing the SSD's to do what they're good at (low latency IOPS). OK! I actually have two machines here, one production machine (X4240 with 16 disks, no SSDs) with performance issues and another development machine X4140 with 6 disks and two SDDs configured as shown in my previous mail. The question for me is how to improve the performance of the production machine and whether buying SSDs for this machine is worth the investment. zpool iostat on the development machine with the SSDs gives me capacity operationsbandwidth pool used avail read write read write -- - - - - - - rpool114G 164G 0 4 13.5K 36.0K tank 164G 392G 3131 444K 10.8M -- - - - - - - When I do that on the production machine without SSDs I get pool used avail read write read write -- - - - - - - rpool 98.3G 37.7G 0 7 32.5K 36.9K tank 480G 336G 16 53 1.69M 2.05M -- - - - - - - It is interesting to note that the write bandwidth on the SSD machine is 5 times higher. I take this as an indicaor that the SSDs have some effect. I am still wondering what your if one SSd fails you loe your pool means to me. Would you recommend to dettach one of the SSDs in the development machine and add to o the production machine with zpool add tank log c1t15d0 ?? And how save (reliable) is it to use SSDs for this? I mean when do I have to expect the SSD to fail and thus ruin the pool!? Thanks a lot, Andreas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpxio load-balancing...it doesn't work??
Not true. There are different ways that a storage array, and it's controllers, connect to the host visible front end ports which might be confusing the author but i/o isn't duplicated as he suggests. On 4/4/2010 9:55 PM, Brad wrote: I had always thought that with mpxio, it load-balances IO request across your storage ports but this article http://christianbilien.wordpress.com/2007/03/23/storage-array-bottlenecks/ has got me thinking its not true. The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames are 10 bytes long -) per port. As load balancing software (Powerpath, MPXIO, DMP, etc.) are most of the times used both for redundancy and load balancing, I/Os coming from a host can take advantage of an aggregated bandwidth of two ports. However, reads can use only one path, but writes are duplicated, i.e. a host write ends up as one write on each host port. Is this true? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Why does ARC grow above hard limit?
I would appreciate if somebody can clarify a few points. I am doing some random WRITES (100% writes, 100% random) testing and observe that ARC grows way beyond the hard limit during the test. The hard limit is set 512 MB via /etc/system and I see the size going up to 1 GB - how come is it happening? mdb's ::memstat reports 1.5 GB used - does this include ARC as well or is it separate? I see on the backed only reads (205 MB/s) and almost no writes (1.1 MB/s) - any ides what is being read? --- BEFORE TEST # ~/bin/arc_summary.pl System Memory: Physical RAM: 12270 MB Free Memory : 7108 MB LotsFree: 191 MB ZFS Tunables (/etc/system): set zfs:zfs_prefetch_disable = 1 set zfs:zfs_arc_max = 0x2000 set zfs:zfs_arc_min = 0x1000 ARC Size: Current Size: 136 MB (arcsize) Target Size (Adaptive): 512 MB (c) Min Size (Hard Limit):256 MB (zfs_arc_min) Max Size (Hard Limit):512 MB (zfs_arc_max) ... ::memstat Page SummaryPagesMB %Tot Kernel 800895 3128 25% ZFS File Data 394450 1540 13% Anon 106813 4173% Exec and libs4178160% Page cache 14333550% Free (cachelist)22996891% Free (freelist) 1797511 7021 57% Total 3141176 12270 Physical 3141175 12270 --- DURING THE TEST # ~/bin/arc_summary.pl System Memory: Physical RAM: 12270 MB Free Memory : 6687 MB LotsFree: 191 MB ZFS Tunables (/etc/system): set zfs:zfs_prefetch_disable = 1 set zfs:zfs_arc_max = 0x2000 set zfs:zfs_arc_min = 0x1000 ARC Size: Current Size: 1336 MB (arcsize) Target Size (Adaptive): 512 MB (c) Min Size (Hard Limit):256 MB (zfs_arc_min) Max Size (Hard Limit):512 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 87%446 MB (p) Most Frequently Used Cache Size:12%65 MB (c-p) ARC Efficency: Cache Access Total: 51681761 Cache Hit Ratio: 52% 27056475 [Defined State for buffer] Cache Miss Ratio: 47% 24625286 [Undefined State for Buffer] REAL Hit Ratio: 52% 27056475 [MRU/MFU Hits Only] Data Demand Efficiency:35% Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable) CACHE HITS BY CACHE LIST: Anon: --%Counter Rolled. Most Recently Used: 13%3627289 (mru) [ Return Customer ] Most Frequently Used: 86%23429186 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 17%4657584 (mru_ghost)[ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 32%8712009 (mfu_ghost)[ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:30%8308866 Prefetch Data: 0%0 Demand Metadata:69%18747609 Prefetch Metadata: 0%0 CACHE MISSES BY DATA TYPE: Demand Data:61%15113029 Prefetch Data: 0%0 Demand Metadata:38%9511898 Prefetch Metadata: 0%359 - -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing SSDs from pool
Response below... 2010/4/5 Andreas Höschler ahoe...@smartsoft.de Hi Edward, thanks a lot for your detailed response! From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Andreas Höschler • I would like to remove the two SSDs as log devices from the pool and instead add them as a separate pool for sole use by the database to see how this enhences performance. I could certainly do zpool detach tank c1t7d0 to remove one disk from the log mirror. But how can I get back the second SSD? If you're running solaris, sorry, you can't remove the log device. You better keep your log mirrored until you can plan for destroying and recreating the pool. Actually, in your example, you don't have a mirror of logs. You have two separate logs. This is fine for opensolaris (zpool =19), but not solaris (presently up to zpool 15). If this is solaris, and *either* one of those SSD's fails, then you lose your pool. I run Solaris 10 (not Open Solaris)! You say the log mirror pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 ... logs c1t6d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 does not do me anything good (redundancy-wise)!? Shouldn't I dettach the second drive then and try to use it for something else, may be another machine? No, he did *not* say that a mirrored SLOG has no benefit, redundancy-wise. He said that YOU do *not* have a mirrored SLOG. You have 2 SLOG devices which are striped. And if this machine is running Solaris 10, then you cannot remove a log device because those updates have not made their way into Solaris 10 yet. You need pool version = 19 to remove log devices, and S10 does not currently have patches to ZFS to get to a pool version = 19. If your SLOG above were mirrored, you'd have mirror under logs. And you probably would have log not logs - notice the s at the end meaning plural, meaning multiple independent log devices, not a mirrored pair of logs which would effectively look like 1 device. -- You can choose your friends, you can choose the deals. - Equity Private If Linux is faster, it's a Solaris bug. - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] no hot spare activation?
On Apr 5, 2010, at 11:43 AM, Garrett D'Amore wrote: I see ereport.fs.zfs.io_failure, and ereport.fs.zfs.probe_failure. Also, ereport.io.service.lost and ereport.io.device.inval_state. There is indeed a fault.fs.zfs.device in the list as well. The ereports are not interesting, only the fault. In FMA, ereports contribute to diagnosis, but faults are the only thing that are presented to the user and retire agents. Everything seems to be correct *except* that ZFS isn't automatically doing the replace operation with the hot spare. It feels to me like this is possibly a ZFS bug --- perhaps ZFS is expecting a specific set of FMA faults that only sd delivers? (Recall this is with a different target device.) Yes, it may be a bug. You will have to step through the zfs retire agent to see what goes wrong when it receives the list.suspect event. This code path is tested many, many times every day, so it's not as obvious as this doesn't work. The ZFS retire agent subscribes only to ZFS faults. The underlying driver or other telemetry has no bearing on the diagnosis or associated action. - Eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpxio load-balancing...it doesn't work??
On Sun, 4 Apr 2010, Brad wrote: I had always thought that with mpxio, it load-balances IO request across your storage ports but this article http://christianbilien.wordpress.com/2007/03/23/storage-array-bottlenecks/ has got me thinking its not true. The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames are 10 bytes long -) per port. As load balancing software (Powerpath, MPXIO, DMP, etc.) are most of the times used both for redundancy and load balancing, I/Os coming from a host can take advantage of an aggregated bandwidth of two ports. However, reads can use only one path, but writes are duplicated, i.e. a host write ends up as one write on each host port. Is this true? This text seems strange and wrong since duplicating writes would result in duplicate writes to disks, which could cause corruption if the ordering was not perfectly preserved. Depending on the storage array capabilities, MPXIO could use different strategies. A common strategy is active/standby on a per-LUN level. Even with active/standby, effective load sharing is possible if the storage array can be told to assign preference between a LUN and a port. That is what I have done with my own setup. 1/2 the LUNs have a preference for each port so that with all paths functional, the FC traffic is similar for each FC link. -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing SSDs from pool
Hi Khyron, No, he did *not* say that a mirrored SLOG has no benefit, redundancy-wise. He said that YOU do *not* have a mirrored SLOG. You have 2 SLOG devices which are striped. And if this machine is running Solaris 10, then you cannot remove a log device because those updates have not made their way into Solaris 10 yet. You need pool version = 19 to remove log devices, and S10 does not currently have patches to ZFS to get to a pool version = 19. If your SLOG above were mirrored, you'd have mirror under logs. And you probably would have log not logs - notice the s at the end meaning plural, meaning multiple independent log devices, not a mirrored pair of logs which would effectively look like 1 device. Thanks for the clarification! This is very annoying. My intend was to create a log mirror. I used zpool add tank log c1t6d0 c1t7d0 and this was obviously false. Would zpool add tank mirror log c1t6d0 c1t7d0 have done what I intended to do? If so it seems I have to tear down the tank pool and recreate it from scratc!?. Can I simply use zpool destroy -f tank to do so? Thanks, Andreas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
On Mon, 5 Apr 2010, Peter Schuller wrote: For desktop use, and presumably rapidly changing non-desktop uses, I find the ARC cache pretty annoying in its behavior. For example this morning I had to hit my launch-terminal key perhaps 50 times (roughly) before it would start completing without disk I/O. There are plenty of other examples as well, such as /var/db/pkg not being pulled aggressively into cache such that pkg_* operations (this is on FreeBSD) are slower than they should (I have to run pkg_info some number of times before *it* will complete without disk I/O too). It sounds like you are complaining about how FreeBSD has implemented zfs in the system rather than about zfs in general. These problems don't occur under Solaris. Zfs and the kernel need to agree on how to allocate/free memory, and it seems that Solaris is more advanced than FreeBSD in this area. It is my understanding that FreeBSD offers special zfs tunables to adjust zfs memory usage. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
It sounds like you are complaining about how FreeBSD has implemented zfs in the system rather than about zfs in general. These problems don't occur under Solaris. Zfs and the kernel need to agree on how to allocate/free memory, and it seems that Solaris is more advanced than FreeBSD in this area. It is my understanding that FreeBSD offers special zfs tunables to adjust zfs memory usage. It may be FreeBSD specific, but note that I a not talking about the amount of memory dedicated to the ARC and how it balances with free memory on the system. I am talking about eviction policy. I could be wrong but I didn't think ZFS port made significant changes there. And note that part the of *point* of the ARC (at least according to the original paper, though it was a while since I read it), as opposed to a pure LRU, is to do some weighting on frequency of access, which is exactly consistent with what I'm observing (very quick eviction and/or lack of inserton of data, particularly in the face of unrelated long-term I/O having happened in the background). It would likely also be the desired behavior for longer-running homogenous disk access patterns where optimal use of cache over long period may be more important than immediately reacting to a changing access pattern. So it's not like there is no reason to believe this can be about ARC policy. Why would this *not* occurr on Solaris? It seems to me that it would imply the ARC was broken on Solaris, since it is not *supposed* to be a pure LRU by design. Again, there may very well be a FreeBSD specific issue here that is altering the behavior, and maybe the extremity of it that I am reporting is not supposed to be happening, but I believe the issue is more involved than what you're implying in your response. -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] EON ZFS Storage 0.60.0 based on snv 130, Sun-set release!
Embedded Operating system/Networking (EON), RAM based live ZFS NAS appliance is released on Genunix! This release marks the end of SXCE releases and Sun Microsystems as we know it! It is dubbed the Sun-set release! Many thanks to Al at Genunix.org for download hosting and serving the Opensolaris community. EON Deduplication ZFS storage is available in 32 and 64-bit, CIFS and Samba versions: EON 64-bit x86 CIFS ISO image version 0.60.0 based on snv_130 * eon-0.600-130-64-cifs.iso * MD5: 55c5837985f282f9272f5275163f7d7b * Size: ~93Mb * Released: Monday 05-April-2010 EON 64-bit x86 Samba ISO image version 0.60.0 based on snv_130 * eon-0.600-130-64-smb.iso * MD5: bf095f2187c29fb543285b72266c0295 * Size: ~106Mb * Released: Monday 05-April-2010 EON 32-bit x86 CIFS ISO image version 0.60.0 based on snv_130 * eon-0.600-130-32-cifs.iso * MD5: e2b312feefbfb14792c0d190e7ff69cf * Size: ~59Mb * Released: Monday 05-April-2010 EON 32-bit x86 Samba ISO image version 0.60.0 based on snv_130 * eon-0.600-130-32-smb.iso * MD5: bcf6dc76bc9a22cff1431da20a5c56e2 * Size: ~73Mb * Released: Monday 05-April-2010 EON 64-bit x86 CIFS ISO image version 0.60.0 based on snv_130 (NO HTTPD) * eon-0.600-130-64-cifs-min.iso * MD5: 78b0bb116c0e32a48c473ce1b94e604f * Size: ~87Mb * Released: Monday 05-April-2010 EON 64-bit x86 Samba ISO image version 0.60.0 based on snv_130 (NO HTTPD) * eon-0.600-130-64-smb-min.iso * MD5: e74732c41e4b3a9a06f52779bc9f8352 * Size: ~101Mb * Released: Monday 05-April-2010 New/Changes/Fixes: - Active Directory integration problem resolved - Hotplug errors at boot are being worked on and are safe to ignore. - Updated /mnt/eon0/.exec with new service configuration additions (light, nginx, afpd, and more ...). - Updated ZFS, NFS v3 performance tuning in /etc/system - Added megasys driver. - EON rebooting at grub(since snv_122) in ESXi, Fusion and various versions of VMware workstation. This is related to bug 6820576. Workaround, at grub press e and add on the end of the kernel line -B disable-pcieb=true http://eonstorage.blogspot.com/ http://sites.google.com/site/eonstorage/ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?
I've seen the Nexenta and EON webpages, but I'm not looking to build my own. Is there anything out there I can just buy? -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?
Install nexenta on a dell poweredge ? or one of these http://www.pogolinux.com/products/storage_director On Mon, Apr 5, 2010 at 9:48 PM, Kyle McDonald kmcdon...@egenera.com wrote: I've seen the Nexenta and EON webpages, but I'm not looking to build my own. Is there anything out there I can just buy? -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?
Kyle McDonald writes: I've seen the Nexenta and EON webpages, but I'm not looking to build my own. Is there anything out there I can just buy? In Germany, someone sells preconfigured hardware based on Nexenta: http://www.thomas-krenn.com/de/storage-loesungen/storage-systeme/nexentastor/nexentastor-sc846-unified-storage.html I have no experience with them but I wish them success. :-) Regards -- Volker -- Volker A. Brandt Consulting and Support for Sun Solaris Brandt Brandt Computer GmbH WWW: http://www.bb-c.de/ Am Wiesenpfad 6, 53340 Meckenheim Email: v...@bb-c.de Handelsregister: Amtsgericht Bonn, HRB 10513 Schuhgröße: 45 Geschäftsführer: Rainer J. H. Brandt und Volker A. Brandt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
On Mon, 5 Apr 2010, Peter Schuller wrote: It may be FreeBSD specific, but note that I a not talking about the amount of memory dedicated to the ARC and how it balances with free memory on the system. I am talking about eviction policy. I could be wrong but I didn't think ZFS port made significant changes there. The ARC is designed to use as much memory as is available up to a limit. If the kernel allocator needs memory and there is none available, then the allocator requests memory back from the zfs ARC. Note that some systems have multiple memory allocators. For example, there may be a memory allocator for the network stack, and/or for a filesystem. The FreeBSD kernel is not the same as Solaris. While Solaris uses a common allocator between most of the kernel and zfs, FreeBSD may use different allocators, which are not able to share memory. The space available for zfs might be pre-allocated. I assume that you have already read the FreeBSD ZFS tuning guide (http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem section in the handbook (http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made sure that your system is tuned appropriately. Why would this *not* occurr on Solaris? It seems to me that it would imply the ARC was broken on Solaris, since it is not *supposed* to be a pure LRU by design. Again, there may very well be a FreeBSD specific issue here that is altering the behavior, and maybe the extremity of it that I am reporting is not supposed to be happening, but I believe the issue is more involved than what you're implying in your response. There have been a lot of eyeballs looking at how zfs does its caching, and a ton of benchmarks (mostly focusing on server thoughput) to verify the design. While there can certainly be zfs shortcomings (I have found several) these are few and far between. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?
- Kyle McDonald kmcdon...@egenera.com skrev: I've seen the Nexenta and EON webpages, but I'm not looking to build my own. Is there anything out there I can just buy? I've setup a few systems with supermicro hardware - works well and doesn't cost a whole lot roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
The ARC is designed to use as much memory as is available up to a limit. If the kernel allocator needs memory and there is none available, then the allocator requests memory back from the zfs ARC. Note that some systems have multiple memory allocators. For example, there may be a memory allocator for the network stack, and/or for a filesystem. Yes, but again I am concerned with what the ARC chooses to cache and for how long, not how the ARC balances memory with other parts of the kernel. At least, none of my observations lead me to believe the latter is the problem here. might be pre-allocated. I assume that you have already read the FreeBSD ZFS tuning guide (http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem section in the handbook (http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made sure that your system is tuned appropriately. Yes, I have been tweaking and fiddling and reading off and on since ZFS was originally added to CURRENT. This is not about tuning in that sense. The fact that the little data necessary to start an 'urxvt' instance does not get cached for at least 1-2 seconds on an otherwise mostly idle system is either the result of cache policy, an implementation bug (freebsd or otherwise), or a matter of an *extremely* small cache size. I have observed this behavior for a very long time across versions of both ZFS and FreeBSD, and with different forms of arc sizing tweaks. It's entirely possibly there are FreeBSD issues preventing the ARC to size itself appropriately. What I am saying though is that all indications are that data is not being selected for caching at all, or else is evicted extremely quickly, unless sufficient frequency has been accumulated to, presumably, make the ARC decide to cache the data. This is entirely what I would expect from a caching policy that tries to adapt to long-term access patterns and avoid pre-mature cache eviction by looking at frequency of access. I don't see what it is that is so outlandish about my query. These are fundamental ways in which caches of different types behave, and there is a legitimate reason to not use the same cache eviction policy under all possible workloads. The behavior I am seeing is consistent with a caching policy that tries too hard (for my particular use case) to avoid eviction in the face of short-term changes in access pattern. There have been a lot of eyeballs looking at how zfs does its caching, and a ton of benchmarks (mostly focusing on server thoughput) to verify the design. While there can certainly be zfs shortcomings (I have found several) these are few and far between. That's a very general statement. I am talking about specifics here. For example, you can have mountains of evidence that shows that a plain LRU is optimal (under some conditions). That doesn't change the fact that if I want to avoid a sequential scan of a huge data set to completely evict everything in the cache, I cannot use a plain LRU. In this case I'm looking for the reverse; i.e., increasing the importance of 'recenticity' because my workload is such that it would be more optimal than the behavior I am observing. Benchmarks are irrelevant except insofar as they show that my problem is not with the caching policy, since I am trying to address an empirically observed behavior. I *will* try to look at how the ARC sizes itself, as I'm unclear on several things in the way memory is being reported by FreeBSD, but as far as I can tell these are different issues. Sure, a bigger ARC might hide the behavior I happen to see; but I want the cache to behave in a way where I do not need gigabytes of extra ARC size to lure it into caching the data necessary for 'urxvt' without having to start it 50 times in a row to accumulate statistics. -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
On Apr 5, 2010, at 2:23 PM, Peter Schuller wrote: That's a very general statement. I am talking about specifics here. For example, you can have mountains of evidence that shows that a plain LRU is optimal (under some conditions). That doesn't change the fact that if I want to avoid a sequential scan of a huge data set to completely evict everything in the cache, I cannot use a plain LRU. In simple terms, the ARC is divided into a MRU and MFU side. target size (c) = target MRU size (p) + target MFU size (c-p) On Solaris, to get from the MRU to the MFU side, the block must be read at least once in 62.5 milliseconds. For pure read-once workloads, the data won't to the MFU side and the ARC will behave exactly like an (adaptable) MRU cache. In this case I'm looking for the reverse; i.e., increasing the importance of 'recenticity' because my workload is such that it would be more optimal than the behavior I am observing. Benchmarks are irrelevant except insofar as they show that my problem is not with the caching policy, since I am trying to address an empirically observed behavior. I *will* try to look at how the ARC sizes itself, as I'm unclear on several things in the way memory is being reported by FreeBSD, but as far as I can tell these are different issues. Sure, a bigger ARC might hide the behavior I happen to see; but I want the cache to behave in a way where I do not need gigabytes of extra ARC size to lure it into caching the data necessary for 'urxvt' without having to start it 50 times in a row to accumulate statistics. I'm not convinced you have attributed the observation to the ARC behaviour. Do you have dtrace (or other) data to explain what process is causing the physical I/Os? -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
In simple terms, the ARC is divided into a MRU and MFU side. target size (c) = target MRU size (p) + target MFU size (c-p) On Solaris, to get from the MRU to the MFU side, the block must be read at least once in 62.5 milliseconds. For pure read-once workloads, the data won't to the MFU side and the ARC will behave exactly like an (adaptable) MRU cache. Ok. That differs significantly from my understanding, though in retrospect I should have realized it given that arc stats contain only references to mru and mfu... I previously was under the impression that the ZFS ARC had an LRU:Ish side to complement the MFU side. MRU+MFU changes things. I will have to look into it in better detail to understand the consequences. Is there a paper that describes the ARC as it is implemented in ZFS (since it clearly diverges from the IBM ARC)? I *will* try to look at how the ARC sizes itself, as I'm unclear on several things in the way memory is being reported by FreeBSD, but as For what it's worth I confirmed that the ARC was too small and that there are clearly remaining issues with the interaction between the ARC the rest of the FreeBSD kernel. (I wasn't sure before but I confirmed I Was looking at the right number.) I'll try to monitor more carefully and see if I can figure out when the ARC shrinks and why it doesn't grow back. Informally my observations have always been that things behave great for a while after boot, but degenerate over time. In this case it was sitting at it's minium size, which was 214M. I realize this is far below what is recommended or even designed for, but is it clearly caching *something* and I clearly *could* make it cache urxvt+deps by re-running it several tens of times in rapid succession. I'm not convinced you have attributed the observation to the ARC behaviour. Do you have dtrace (or other) data to explain what process is causing the physical I/Os? In the urxvt case, I am basing my claim on informal observations. I.e., hit terminal launch key, wait for disks to rattle, get my terminal. Repeat. Only by repeating it very many times in very rapid succession am I able to coerce it to be cached such that I can immediately get my terminal. And what I mean by that is that it keeps necessitating disk I/O for a long time, even on rapid successive invocations. But once I have repeated it enough times it seems to finally enter the cache. (No dtrace unfortunately. I confess to not having learned dtrace yet, in spite of thinking it's massively cool.) However, I will of course accept that given the minimal ARC size at the time I am moving completely away from the designed-for use-case. And if that is responsible, it is of course my own fault. Given MRU+MFU I'll have to back off with my claims. Under the (incorrect) assumption of LRU+MFU I felt the behavior was unexpected, even with a small cache size. Given MRU+MFU and without knowing further details right now, I accept that the ARC may fundamentally need a bigger cache size in relation to the working set in order to be effective in the way I am using it here. I was basing my expectations on LRU-style behavior. Thanks! -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
On 04/05/10 15:24, Peter Schuller wrote: In the urxvt case, I am basing my claim on informal observations. I.e., hit terminal launch key, wait for disks to rattle, get my terminal. Repeat. Only by repeating it very many times in very rapid succession am I able to coerce it to be cached such that I can immediately get my terminal. And what I mean by that is that it keeps necessitating disk I/O for a long time, even on rapid successive invocations. But once I have repeated it enough times it seems to finally enter the cache. Are you sure you're not seeing unrelated disk update activity like atime updates, mtime updates on pseudo-terminals, etc., ? I'd want to start looking more closely at I/O traces (dtrace can be very helpful here) before blaming any specific system component for the unexpected I/O. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
On Apr 5, 2010, at 3:24 PM, Peter Schuller wrote: I will have to look into it in better detail to understand the consequences. Is there a paper that describes the ARC as it is implemented in ZFS (since it clearly diverges from the IBM ARC)? There are various blogs, but perhaps the best documentation is in the comments starting at http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#25 -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing SSDs from pool
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Andreas Höschler Thanks for the clarification! This is very annoying. My intend was to create a log mirror. I used zpool add tank log c1t6d0 c1t7d0 and this was obviously false. Would zpool add tank mirror log c1t6d0 c1t7d0 have done what I intended to do? If so it seems I have to tear down the tank pool and recreate it from scratc!?. Can I simply use zpool destroy -f tank to do so? Yes. You're unfortunately in a bad place right now, due to a simple command line error. If you have the ability to destroy and recreate the pool, that's what you should do. If you can't afford the downtime, then you better buy a couple more SSD's, and attach them to the first ones, to mirror them. On a related note, this creates additional work, but I think I'm going to start recommending this now ... If you create a slice on each drive which is slightly smaller than the drive, and use those slices instead of the full device, you might be happy you did some day in the future. I had a mirrored SSD, and one drive failed. The replacement disk is precisely exactly the same, yet for no apparent reason, it appears 0.001Gb smaller than the original, and hence I cannot un-degrade the mirror. Since you can't zpool remove a log device in solaris 10, that's a fatal problem. The only solution is to destroy and recreate the pool. Or buy a new SSD which is definitely larger... Say, 64G instead of the 32G I already have. But that's another thousand bucks, or more. Please see the thread http://opensolaris.org/jive/thread.jspa?threadID=127162tstart=0 One more note. I heard, but I don't remember where, that the OS will refuse to use more than half of the system RAM on ZIL log device anyway. So if you've got 32G of ram, the maximum useful ZIL log device would be 16G. Not sure if that makes any difference ... For performance reasons, suppose you've got 32G ram, and 32G SSD, and you create a 16G or 17G slice to use for log device ... For performance reasons, you should leave the rest of that SSD unused anyway. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
From: Kyle McDonald [mailto:kmcdon...@egenera.com] So does your HBA have newer firmware now than it did when the first disk was connected? Maybe it's the HBA that is handling the new disks differently now, than it did when the first one was plugged in? Can you down rev the HBA FW? Do you have another HBa that might still have the older Rev you coudltest it on? I'm planning to get the support guys more involved tomorrow, so ... things have been pretty stagnant for several days now, I think it's time to start putting more effort into this. Long story short, I don't know yet. But there is one glaring clue: Prior to OS installation, I don't know how to configure the HBA. This means the HBA must have been preconfigured with the factory installed disks, and I followed a different process with my new disks, because I was using the GUI within the OS. My best hope right now is to find some other way to configure the HBA, possibly through the ILOM, but I already searched there and looked at everything. Maybe I have to shutdown (power cycle) the system and attach keyboard monitor. I don't know yet... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] no hot spare activation?
On Apr 5, 2010, at 3:38 AM, Garrett D'Amore wrote: Am I missing something here? Under what conditions can I expect hot spares to be recruited? Hot spares are activated by the zfs-retire agent in response to a list.suspect event containing one of the following faults: fault.fs.zfs.vdev.io fault.fs.zfs.vdev.checksum fault.fs.zfs.device The last of these (fault.fs.zfs.device) is what is diagnosed when a label is corrupted. What software are you runnig? Have you confirmed that you are getting one of these faults? What does 'fmdump -V' show? Does doing a 'zpool replace c2t3d1 c2t3d2' by hand succeed? - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] no hot spare activation?
On 04/ 5/10 05:28 AM, Eric Schrock wrote: On Apr 5, 2010, at 3:38 AM, Garrett D'Amore wrote: Am I missing something here? Under what conditions can I expect hot spares to be recruited? Hot spares are activated by the zfs-retire agent in response to a list.suspect event containing one of the following faults: fault.fs.zfs.vdev.io fault.fs.zfs.vdev.checksum fault.fs.zfs.device The last of these (fault.fs.zfs.device) is what is diagnosed when a label is corrupted. What software are you runnig? Have you confirmed that you are getting one of these faults? What does 'fmdump -V' show? Does doing a 'zpool replace c2t3d1 c2t3d2' by hand succeed? I see ereport.fs.zfs.io_failure, and ereport.fs.zfs.probe_failure. Also, ereport.io.service.lost and ereport.io.device.inval_state. There is indeed a fault.fs.zfs.device in the list as well. Clearly ZFS thinks the device is unavailable (which is accurate). And pfexec zpool replace testpool c2t3d1 c2t3d2 works fine, as shown here: gdam...@tabasco{33} pfexec zpool status pool: rpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 errors: No known data errors pool: testpool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: resilver completed after 0h0m with 0 errors on Mon Apr 5 08:39:57 2010 config: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0DEGRADED 0 0 0 c2t3d0ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c2t3d1 UNAVAIL 9 132 0 cannot open c2t3d2 ONLINE 0 0 0 20.8M resilvered spares c2t3d2 INUSE currently in use errors: No known data errors gdam...@tabasco{34} Everything seems to be correct *except* that ZFS isn't automatically doing the replace operation with the hot spare. It feels to me like this is possibly a ZFS bug --- perhaps ZFS is expecting a specific set of FMA faults that only sd delivers? (Recall this is with a different target device.) - Garrett - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing SSDs from pool
On 04/05/10 11:43, Andreas Höschler wrote: Hi Khyron, No, he did *not* say that a mirrored SLOG has no benefit, redundancy-wise. He said that YOU do *not* have a mirrored SLOG. You have 2 SLOG devices which are striped. And if this machine is running Solaris 10, then you cannot remove a log device because those updates have not made their way into Solaris 10 yet. You need pool version = 19 to remove log devices, and S10 does not currently have patches to ZFS to get to a pool version = 19. If your SLOG above were mirrored, you'd have mirror under logs. And you probably would have log not logs - notice the s at the end meaning plural, meaning multiple independent log devices, not a mirrored pair of logs which would effectively look like 1 device. Thanks for the clarification! This is very annoying. My intend was to create a log mirror. I used zpool add tank log c1t6d0 c1t7d0 and this was obviously false. Would zpool add tank mirror log c1t6d0 c1t7d0 zpool add tank log mirror c1t6d0 c1t7d0 You can also do it on the create: zpool create tank pool devs log mirror c1t6d0 c1t7d0 have done what I intended to do? If so it seems I have to tear down the tank pool and recreate it from scratc!?. Can I simply use zpool destroy -f tank to do so? Shouldn't need the -f Thanks, Andreas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
On Sun, Apr 04, 2010 at 11:46:16PM -0700, Willard Korfhage wrote: Looks like it was RAM. I ran memtest+ 4.00, and it found no problems. Then why do you suspect the ram? Especially with 12 disks, another likely candidate could be an overloaded power supply. While there may be problems showing up in RAM, it may only be happening under the combined load of disks, cpu and memory activity that brings the system into marginal power conditions. Sometimes it may be just one rail that is out of bounds, and other devices are unaffected. If memtest didn't find any problems without the disk and cpu load, that tends to support this hypothesis. So, the memory may not be bad per se, though it's still not ECC and therefore not good either :-) Perhaps you can still find a good use for it elsewhere. I removed 2 of the 3 sticks of RAM, ran a backup, and had no errors. I'm running more extensive tests, but it looks like that was it. A new motherboard, CPU and ECC RAM are on the way to me now. Switching to ECC is a good thing.. but be prepared for possible continued issues (with different detection thaks to ecc) if the root cause is the psu. In fact, ECC memory may draw marginally more power and maybe make the problem worse (the new cpu and motherboard could go either way, depending on your choices). -- Dan. pgpCHuyPvOur2.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpxio load-balancing...it doesn't work??
I'm wondering if the author is talking about cache mirroring where the cache is mirrored between both controllers. If that is the case, is he saying that for every write to the active controlle,r a second write issued on the passive controller to keep the cache mirrored? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: Raid and dedup
Hi Folks: I'm wondering what is the correct flow when both raid5 and de-dup are enabled on a storage volume I think we should do de-dup first and then raid5 ... is that understanding correct? Thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing SSDs from pool
On Mon, Apr 05, 2010 at 07:43:26AM -0400, Edward Ned Harvey wrote: Is the database running locally on the machine? Or at the other end of something like nfs? You should have better performance using your present config than just about any other config ... By enabling the log devices, such as you've done, you're dedicating the SSD's for sync writes. And that's what the database is probably doing. This config should be *better* than dedicating the SSD's as their own pool. Because with the dedicated log device on a stripe of mirrors, you're allowing the spindle disks to do what they're good at (sequential blocks) and allowing the SSD's to do what they're good at (low latency IOPS). Others have addressed the rest of the issues well enough, but I thought I'd respond on this point. What you say is fair, if the db is bound by sync write latency. If it is bound by read latency, you will still suffer. You could add more ssd's as l2arc (and incur more memory overhead), or you could put the whole pool on ssd (and lose its benefit for other pool uses). There are many factors here that will determine the best config, but the current one may well not be the optimal. -- Dan. pgprztnaqaKjB.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: Raid and dedup
On Mon, Apr 05, 2010 at 06:32:13PM -0700, Learner Study wrote: I'm wondering what is the correct flow when both raid5 and de-dup are enabled on a storage volume I think we should do de-dup first and then raid5 ... is that understanding correct? Not really. Strictly speaking, ZFS doesn't do raid5 - assuming you mean one of the raidz levels, pools can be created by assembling disks into one or more raidz groups. Dedup is then performed within the pool, and enabled at a dataset (filesystem) granularity. If you have raid5 in a san or hw controller, you might build a pool on top of the LUNs it presents, and again apply dedup within that pool. -- Dan. pgpCtUzx4It6o.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: Raid and dedup
Hi Jeff: I'm a bit confused...did you say Correct to my orig email or the reply from Daniel...Is there a doc that may explain it better? Thanks! On Mon, Apr 5, 2010 at 6:54 PM, jeff.bonw...@oracle.com jeff.bonw...@oracle.com wrote: Correct. Jeff Sent from my iPhone On Apr 5, 2010, at 6:32 PM, Learner Study learner.st...@gmail.com wrote: Hi Folks: I'm wondering what is the correct flow when both raid5 and de-dup are enabled on a storage volume I think we should do de-dup first and then raid5 ... is that understanding correct? Thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpxio load-balancing...it doesn't work??
The author mentions multipathing software in the blog entry. Kind of hard to mix that up with cache mirroring if you ask me. On 4/5/2010 9:16 PM, Brad wrote: I'm wondering if the author is talking about cache mirroring where the cache is mirrored between both controllers. If that is the case, is he saying that for every write to the active controlle,r a second write issued on the passive controller to keep the cache mirrored? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpxio load-balancing...it doesn't work??
On Mon, Apr 5, 2010 at 8:16 PM, Brad bene...@yahoo.com wrote: I'm wondering if the author is talking about cache mirroring where the cache is mirrored between both controllers. If that is the case, is he saying that for every write to the active controlle,r a second write issued on the passive controller to keep the cache mirrored? He's talking about multipathing, he just has no clue what he's talking about. He specifically calls out applications that are specifically used for multipathing. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: Raid and dedup
On Mon, Apr 05, 2010 at 06:58:57PM -0700, Learner Study wrote: Hi Jeff: I'm a bit confused...did you say Correct to my orig email or the reply from Daniel... Jeff is replying to your mail, not mine. It looks like he's read your question a little differently. By that reading, you are correct, because for a given write to a pool, the data will first be checked for dedup, and then the writes will be sent to the pool devices where raidz (or hw raid5) will be applied. I read the question as more about how to initally set up your pool (ie, as about a sequence of admin commands). Now you have answers for both, whichever you originally intended to ask. :) Is there a doc that may explain it better? Several.. start with the ZFS FAQ and Best Practices Guide. -- Dan. pgpBpJITIAEVU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
It certainly has symptoms that match a marginal power supply, but I measured the power consumption some time ago and found it comfortably within the power supply's capacity. I've also wondered if the RAM is fine, but there is just some kind of flaky interaction of the ram configuration I had with the motherboard. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: Raid and dedup
On Apr 5, 2010, at 6:32 PM, Learner Study wrote: Hi Folks: I'm wondering what is the correct flow when both raid5 and de-dup are enabled on a storage volume I think we should do de-dup first and then raid5 ... is that understanding correct? Yes. If you look at the (somewhat outdated) ZFS Source Tour, you will see that the ZIO layer feeds I/Os to the VDEV layer which is where raidz is implemented. In ZIO, deduplication occurs after compression and checksumming but before space allocation. The checksum and physical size are used for the deduplication table key. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
On Mon, Apr 5, 2010 at 9:39 PM, Willard Korfhage opensola...@familyk.orgwrote: It certainly has symptoms that match a marginal power supply, but I measured the power consumption some time ago and found it comfortably within the power supply's capacity. I've also wondered if the RAM is fine, but there is just some kind of flaky interaction of the ram configuration I had with the motherboard. -- This message posted from opensolaris.org I think the confusion is that you said you ran memtest86+ and the memory tested just fine. Did you remove some memory before running memtest86+ and narrow it down to a certain stick being bad or something? Your post makes it sound as though you found that all of the ram is working perfectly fine. IE: It's not the problem. Also, a low power draw doesn't mean much of anything. The power supply could just be dying. Load wouldn't really matter in that scenario (although a high load will generally help it out the door a bit quicker due to higher heat/etc.). --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
On Mon, Apr 05, 2010 at 09:46:58PM -0500, Tim Cook wrote: On Mon, Apr 5, 2010 at 9:39 PM, Willard Korfhage opensola...@familyk.orgwrote: It certainly has symptoms that match a marginal power supply, but I measured the power consumption some time ago and found it comfortably within the power supply's capacity. I've also wondered if the RAM is fine, but there is just some kind of flaky interaction of the ram configuration I had with the motherboard. I think the confusion is that you said you ran memtest86+ and the memory tested just fine. Did you remove some memory before running memtest86+ and narrow it down to a certain stick being bad or something? Your post makes it sound as though you found that all of the ram is working perfectly fine. Exactly. Also, a low power draw doesn't mean much of anything. The power supply could just be dying. Or just one part of it could be overloaded (like a particular 5v or 12v rail that happens to be shared between too many drives and the m/b), even if the overall draw at the wall is less than the total rating. Sometimes, just moving plugs around can help - or at least show that a better psu is warranted. -- Dan. pgpJLlxR1urcu.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
Memtest didn't show any errors, but between Frank, early in the thread, saying that he had found memory errors that memtest didn't catch, and remove of DIMMs apparently fixing the problem, I too soon jumped to the conclusion it was the memory. Certainly there are other explanations. I see that I have a spare Corsair 620W power supply that I could try. It is a Corsair supply of some wattage in there now. If I recall properly, the steady state power draw is between 150 and 200 watts. By the way, I see that now one of the disks is listed as degraded - too many errors. Is there a good way to identify exactly which of the disks it is? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote: By the way, I see that now one of the disks is listed as degraded - too many errors. Is there a good way to identify exactly which of the disks it is? It's hidden in iostat -E, of all places. -- Dan. pgpB1dUBrSfPC.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
On Tue, Apr 6, 2010 at 12:24 AM, Daniel Carosone d...@geek.com.au wrote: On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote: By the way, I see that now one of the disks is listed as degraded - too many errors. Is there a good way to identify exactly which of the disks it is? It's hidden in iostat -E, of all places. -- Dan. I think he wants to know how to identify which physical drive maps to the dev ID in solaris. The only way I can think of is to run something like DD against the drive to light up the activity LED. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
On Tue, Apr 06, 2010 at 12:29:35AM -0500, Tim Cook wrote: On Tue, Apr 6, 2010 at 12:24 AM, Daniel Carosone d...@geek.com.au wrote: On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote: By the way, I see that now one of the disks is listed as degraded - too many errors. Is there a good way to identify exactly which of the disks it is? It's hidden in iostat -E, of all places. -- Dan. I think he wants to know how to identify which physical drive maps to the dev ID in solaris. The only way I can think of is to run something like DD against the drive to light up the activity LED. or look at the serial numbers printed in iostat -E -- Dan. pgpmo7XmmGf1I.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss