[zfs-discuss] scrub percentage complete decreasing, but without snaps.
I've seen the problems with bug 6343667, but I haven't seen the problem I have at the the moment. I started a scrub of a b72 system that doesn't have any recent snapshots (none since the last scrub) and the % complete is cycling: scrub: scrub in progress, 69.08% done, 0h13m to go scrub: scrub in progress, 46.63% done, 0h28m to go scrub: scrub in progress, 6.36% done, 1h37m to go scrub: scrub in progress, 2.09% done, 1h11m to go scrub: scrub in progress, 0.02% done, 33h17m to go scrub: scrub in progress, 0.00% done, 44h39m to go scrub: scrub in progress, 0.00% done, 43h17m to go scrub: scrub in progress, 0.00% done, 35h6m to go scrub: scrub in progress, 1.97% done, 1h6m to go scrub: scrub in progress, 4.16% done, 1h21m to go scrub: scrub in progress, 3.91% done, 1h15m to go scrub: scrub in progress, 1.62% done, 1h10m to go scrub: scrub in progress, 0.41% done, 2h6m to go scrub: scrub in progress, 0.02% done, 31h18m to go config: NAMESTATE READ WRITE CKSUM export ONLINE 0 0 0 mirrorONLINE 0 0 0 c3d0ONLINE 0 0 0 c1d0ONLINE 0 0 0 mirrorONLINE 0 0 0 c4d0ONLINE 0 0 0 c6d0ONLINE 0 0 0 mirrorONLINE 0 0 0 c7d0ONLINE 0 0 0 c8d0ONLINE 0 0 0 mirrorONLINE 0 0 0 c9d0ONLINE 0 0 0 c10d0 ONLINE 0 0 0 errors: No known data errors ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
On Dec 14, 2007 1:12 AM, can you guess? [EMAIL PROTECTED] wrote: yes. far rarer and yet home users still see them. I'd need to see evidence of that for current hardware. What would constitute evidence? Do anecdotal tales from home users qualify? I have two disks (and one controller!) that generate several checksum errors per day each. I assume that you're referring to ZFS checksum errors rather than to transfer errors caught by the CRC resulting in retries. If so, then the next obvious question is, what is causing the ZFS checksum errors? And (possibly of some help in answering that question) is the disk seeing CRC transfer errors (which show up in its SMART data)? If the disk is not seeing CRC errors, then the likelihood that data is being 'silently' corrupted as it crosses the wire is negligible (1 in 65,536 if you're using ATA disks, given your correction below, else 1 in 4.3 billion for SATA). Controller or disk firmware bugs have been known to cause otherwise undetected errors (though I'm not familiar with any recent examples in normal desktop environments - e.g., the CERN study discussed earlier found a disk firmware bug that seemed only activated by the unusual demands placed on the disk by a RAID controller, and exacerbated by that controller's propensity just to ignore disk time-outs). So, for that matter, have buggy file systems. Flaky RAM can result in ZFS checksum errors (the CERN study found correlations there when it used its own checksum mechanisms). I've also seen intermittent checksum fails that go away once all the cables are wiggled. Once again, a significant question is whether the checksum errors are accompanied by a lot of CRC transfer errors. If not, that would strongly suggest that they're not coming from bad transfers (and while they could conceivably be the result of commands corrupted on the wire, so much more data is transferred compared to command bandwidth that you'd really expect to see data CRC errors too if commands were getting mangled). When you wiggle the cables, other things wiggle as well (I assume you've checked that your RAM is solidly seated). On the other hand, if you're getting a whole bunch of CRC errors, then with only a 16-bit CRC it's entirely conceivable that a few are sneaking by unnoticed. Unlikely, since transfers over those connections have been protected by 32-bit CRCs since ATA busses went to 33 or 66 MB/sec. (SATA has even stronger protection) The ATA/7 spec specifies a 32-bit CRC (older ones used a 16-bit CRC) [1]. Yup - my error: the CRC was indeed introduced in ATA-4 (33 MB/sec. version), but was only 16 bits wide back then. The serial ata protocol also specifies 32-bit CRCs beneath 8/10b coding (1.0a p. 159)[2]. That's not much stronger at all. The extra strength comes more from its additional coverage (commands as well as data). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
... though I'm not familiar with any recent examples in normal desktop environments One example found during early use of zfs in Solaris engineering was a system with a flaky power supply. It seemed to work just fine with ufs but when zfs was installed the sata drives started to shows many ZFS checksum errors. After replacing the powersupply, the system did not detect any more errors. Flaky powersupplies are an important contributor to PC unreliability; they also tend to fail a lot in various ways. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs snapshot leaking data ?
Hello every ZFS gurus I've been using a ZFS server for about one year now (for rsync-based disk backup purpose). The process is quite simple : I backup each fs using rsync. After each filesystem backup, I take a zfs snapshot to freeze read-only the saved data. So I end up with a zfs snapshot for each backup set (one per day). When I do a zfs list -r, I can see all the snapshots with the size occupied by each snapshot. Something proportional to the number of disks blocks that changed since the previous snapshot. I'm surprised to see that the last snapshot is never empty when the snapshot is taken automatically by the backup script. But if I take a snapshot several hours after the backup script has run, the snapshot size is 0. Is there some data missing in the snapshot if I take it right after writing to the filesystem ? Should I wait for some time, so that the zfs buffer cache is written to disk ??? If so, how long ? Has anyone experienced this kind of symptoms ? Thank you for your help. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.
On Dec 14, 2007, at 12:27 AM, Jorgen Lundman wrote: Shawn Ferry wrote: Jorgen, You may want to try running 'bootadm update-archive' Assuming that your boot-archive problem is an out of date boot- archive message at boot and/or doing a clean reboot to let the system try to write an up to date boot-archive. Yeah, it is remembering to do so after something has changed that's hard. In this case, I had to break the mirror to install OpenSolaris. (shame that the CD/DVD, and miniroot, doesn't not have md driver). It would be tempting to add the bootadm update-archive to the boot process, as I would rather have it come up half-assed, than not come up at all. It is part of the shutdown process, you just need to stop crashing :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] JBOD performance
Hi all, we are using the following setup as file server: --- # uname -a SunOS troubadix 5.10 Generic_120011-14 sun4u sparc SUNW,Sun-Fire-280R # prtconf -D System Configuration: Sun Microsystems sun4u Memory size: 2048 Megabytes System Peripherals (Software Nodes): SUNW,Sun-Fire-280R (driver name: rootnex) scsi_vhci, instance #0 (driver name: scsi_vhci) packages SUNW,builtin-drivers deblocker disk-label terminal-emulator obp-tftp SUNW,debug dropins kbd-translator ufs-file-system chosen openprom client-services options, instance #0 (driver name: options) aliases memory virtual-memory SUNW,UltraSPARC-III+ memory-controller, instance #0 (driver name: mc-us3) SUNW,UltraSPARC-III+ memory-controller, instance #1 (driver name: mc-us3) pci, instance #0 (driver name: pcisch) ebus, instance #0 (driver name: ebus) flashprom bbc power, instance #0 (driver name: power) i2c, instance #0 (driver name: pcf8584) dimm-fru, instance #0 (driver name: seeprom) dimm-fru, instance #1 (driver name: seeprom) dimm-fru, instance #2 (driver name: seeprom) dimm-fru, instance #3 (driver name: seeprom) nvram, instance #4 (driver name: seeprom) idprom i2c, instance #1 (driver name: pcf8584) cpu-fru, instance #5 (driver name: seeprom) temperature, instance #0 (driver name: max1617) cpu-fru, instance #6 (driver name: seeprom) temperature, instance #1 (driver name: max1617) fan-control, instance #0 (driver name: tda8444) motherboard-fru, instance #7 (driver name: seeprom) ioexp, instance #0 (driver name: pcf8574) ioexp, instance #1 (driver name: pcf8574) ioexp, instance #2 (driver name: pcf8574) fcal-backplane, instance #8 (driver name: seeprom) remote-system-console, instance #9 (driver name: seeprom) power-distribution-board, instance #10 (driver name: seeprom) power-supply, instance #11 (driver name: seeprom) power-supply, instance #12 (driver name: seeprom) rscrtc beep, instance #0 (driver name: bbc_beep) rtc, instance #0 (driver name: todds1287) gpio, instance #0 (driver name: gpio_87317) pmc, instance #0 (driver name: pmc) parallel, instance #0 (driver name: ecpp) rsc-control, instance #0 (driver name: su) rsc-console, instance #1 (driver name: su) serial, instance #0 (driver name: se) network, instance #0 (driver name: eri) usb, instance #0 (driver name: ohci) scsi, instance #0 (driver name: glm) disk (driver name: sd) tape (driver name: st) sd, instance #12 (driver name: sd) ... ses, instance #29 (driver name: ses) ses, instance #30 (driver name: ses) scsi, instance #1 (driver name: glm) disk (driver name: sd) tape (driver name: st) sd, instance #31 (driver name: sd) sd, instance #32 (driver name: sd) ... ses, instance #46 (driver name: ses) ses, instance #47 (driver name: ses) network, instance #0 (driver name: ce) pci, instance #1 (driver name: pcisch) SUNW,qlc, instance #0 (driver name: qlc) fp (driver name: fp) disk (driver name: ssd) fp, instance #1 (driver name: fp) ssd, instance #1 (driver name: ssd) ssd, instance #0 (driver name: ssd) scsi, instance #0 (driver name: mpt) disk (driver name: sd) tape (driver name: st) sd, instance #0 (driver name: sd) sd, instance #1 (driver name: sd) ... ses, instance #14 (driver name: ses) ses, instance #31 (driver name: ses) os-io iscsi, instance #0 (driver name: iscsi) pseudo, instance #0 (driver name: pseudo) --- The disks reside in a StoreEdge3320 expansion unit connected to the machine's SCSI controller card (LSI1030 U320). We've created a raidz2 pool: --- # zpool status pool: storage_array state: ONLINE scrub: scrub completed with 0 errors on Wed Dec 12 23:38:36 2007 config: NAME STATE READ WRITE CKSUM storage_array ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 errors: No known data errors --- The throughput when
Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Steve, I have a couple of questions and concerns about using ZFS in an environment where the underlying LUNs are replicated at a block level using products like HDS TrueCopy or EMC SRDF. Apologies in advance for the length, but I wanted the explanation to be clear. (I do realise that there are other possibilities such as zfs send/ recv and there are technical and business pros and cons for the various options. I don't want to start a 'which is best' argument :) ) The CoW design of ZFS means that it goes to great lengths to always maintain on-disk self-consistency, and ZFS can make certain assumptions about state (e.g not needing fsck) based on that. This is the basis of my questions. 1) First issue relates to the überblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity. That could in theory result in a corrupt überblock at the secondary. Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently? 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. Normally most replication systems do maintain write ordering, [i] except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of 'catch-up' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write- ordering is lost until the two sites are once again in sync and routine replication restarts. I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. For most implementations of resynchronization, not only are changes resilvered in a block-ordered basis, resynchronization is also done in a single pass over the volume(s). To address the fact that resynchronization happens while additional changes are also being replicated, the concept of a resynchronization point is kept. As this resynchronization point traverse the volume from beginning to end, I/ Os occurring before, or at this point need to be replicated inline, whereas I/Os occurring after this point need to marked such that they will be replicated later in block order. You are quite correct in that the data is not consistent. If a disaster happened during the 'catch-up', and the partially- resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? The state of the partially-resynchronized LUNs are much worse than you know. During active resynchronization, the remote volume contains a mixture of prior write-order consistent data, resilvered block- order data, plus new replicated data. Essentially the partially- resynchronized LUNs are totally inconsistent until such a times as the single pass over all data is 100% complete. For some, but not all replication software, if the 'catch-up' resynchronization failed, read access to the LUNs should be prevented, or a least read access while the LUNs are configured as remote mirrors. Availability Suite's Remote Mirror software (SNDR) marks such volumes as need synchronization and fails all application read and write I/Os. Obviously all filesystems can suffer with this scenario, but ones that expect less from their underlying storage (like UFS) can be fscked, and although data that was being updated is potentially corrupt, existing data should still be OK and usable. My concern is that ZFS will handle this scenario less well. There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn't always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I'd like to better understand the risks/ behaviour of ZFS before starting to work on mitigation strategies. Since Availability Suite is both Remote Mirroring and
[zfs-discuss] LUN configuration for disk-based backups
Hello, We have a StorageTek FLX280 (very similar to a 6140) with 16 750 GB SATA drives that we would like to use for disk-based backups. I am trying to make an (educated) guess at what the best configuration for the LUN's on the FLX280 might be. I've read, or at least skimmed, most of the ZFS Best Practices Guide over at solarisinternals.com, which has some great information; however, I still do not feel like I have a good understanding of the interaction between ZFS and a disk array. More specifically, I am concerned about the number of IOP's that each drive and/or LUN will be able to handle. Seagate lists an average seek time of 9ms, and an average rotational latency of 4.16ms for these drives. By my math, each drive should be capable of 76 IOP's in a worst case scenario; i.e. completely random I/O. These drives support native command queuing, and the controllers on the FLX280 have a battery-backed cache, so I would _assume_ that they are also capable of reordering I/O op's to improve throughput to the disks. So, the question is whether or not worst-case IOP's are even relevant. If my assumptions about the controllers on the FLX280 are correct (documentation?), then it seems like we could use RAID-Z to get the throughput we're looking for. If not, we may have to go with RAID-1. Anyone have any thoughts on this? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD performance
Frank Penczek wrote: The performance is slightly disappointing. Does anyone have a similar setup and can anyone share some figures? Any pointers to possible improvements are greatly appreciated. Use a faster processor or change to a mirrored configuration. raidz2 can become processor bound in the Reed-Soloman calculations for the 2nd parity set. You should be able to see this in mpstat, and to a coarser grain in vmstat. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD performance
The throughput when writing from a local disk to the zpool is around 30MB/s, when writing from a client Err.. sorry, the internal storage would be good old 1Gbit FCAL disks @ 10K rpm. Still, not the fastest around ;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Bugid 6535160
So does anyone have any insight on BugID 6535160? We have verified on a similar system, that ZFS shows big latency in filebench varmail test. We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 ms. http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1 We run Solaris 10u4 on our production systems, don't see any indication of a patch for this. I'll try downloading recent Nevada build and load it on same system and see if the problem has indeed vanished post snv_71. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] LUN configuration for disk-based backups
On Fri, 14 Dec 2007, Andrew Chace wrote: [ reformatted ] Hello, We have a StorageTek FLX280 (very similar to a 6140) with 16 750 GB SATA drives that we would like to use for disk-based backups. I am trying to make an (educated) guess at what the best configuration for the LUN's on the FLX280 might be. I've read, or at least skimmed, most of the ZFS Best Practices Guide over at solarisinternals.com, which has some great information; however, I still do not feel like I have a good understanding of the interaction between ZFS and a disk array. More specifically, I am concerned about the number of IOP's that each drive and/or LUN will be able to handle. Seagate lists an average seek time of 9ms, and an average rotational latency of 4.16ms for these drives. By my math, each drive should be capable of 76 IOP's in a worst case scenario; i.e. completely random I/O. These drives support native command queuing, and the controllers on the FLX280 have a battery-backed cache, so I would _assume_ that they are also capable of reordering I/O op's to improve throughput to the disks. So, the question is whether or not worst-case IOP's are even relevant. If my assumptions about the controllers on the FLX280 are correct (documentation?), then it seems like we could use RAID-Z to get the throughput we're looking for. If not, we may have to go with RAID-1. Anyone have any thoughts on this? Since ZFS makes it so quick/easy to create storage pools and filesystems, the simplist way to determine your optimum config is to conduct a set of experiments - using your data and your applications. Bear in mind that no one ZFS config is ideal for every user data application scenario - you may wish to consider 2 or more storage pools with different configurations that will be a best fit your requirements - given that you may have several different data sets with different characteristics. Then there is ZFS compression - which might really help if your data is highly compressible. Also ensure that you have sufficient network bandwidth into the ZFS backup server. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from sugar-coating school? Sorry - I never attended! :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bugid 6535160
Vincent Fox wrote: So does anyone have any insight on BugID 6535160? We have verified on a similar system, that ZFS shows big latency in filebench varmail test. We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 ms. This is such a big difference it makes me think something else is going on. I suspect one of two possible causes: A) The disk write cache is enabled and volatile. UFS knows nothing of write caches and requires the write cache to be disabled otherwise corruption can occur. B) The write cache is non volatile, but ZFS hasn't been configured to stop flushing it (set zfs:zfs_nocacheflush = 1). Note, ZFS enables the write cache and will flush it as necessary. http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1 We run Solaris 10u4 on our production systems, don't see any indication of a patch for this. I'll try downloading recent Nevada build and load it on same system and see if the problem has indeed vanished post snv_71. Yes please try this. I think it will make a difference but the delta will be small. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bugid 6535160
Vincent Fox wrote: So does anyone have any insight on BugID 6535160? We have verified on a similar system, that ZFS shows big latency in filebench varmail test. We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 ms. This is such a big difference it makes me think something else is going on. I suspect one of two possible causes: A) The disk write cache is enabled and volatile. UFS knows nothing of write caches and requires the write cache to be disabled otherwise corruption can occur. B) The write cache is non volatile, but ZFS hasn't been configured to stop flushing it (set zfs:zfs_nocacheflush = 1). Note, ZFS enables the write cache and will flush it as necessary. http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1 We run Solaris 10u4 on our production systems, don't see any indication of a patch for this. I'll try downloading recent Nevada build and load it on same system and see if the problem has indeed vanished post snv_71. Yes please try this. I think it will make a difference but the delta will be small. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
... though I'm not familiar with any recent examples in normal desktop environments One example found during early use of zfs in Solaris engineering was a system with a flaky power supply. It seemed to work just fine with ufs but when zfs was installed the sata drives started to shows many ZFS checksum errors. After replacing the powersupply, the system did not detect any more errors. Flaky powersupplies are an important contributor to PC unreliability; they also tend to fail a lot in various ways. Thanks - now that you mention it, I think I remember reading about that here somewhere. But did anyone delve into these errors sufficiently to know that they were specifically due to controller or disk firmware bugs (since you seem to be suggesting by the construction of your response above that they were) rather than, say, to RAM errors (if the system in question didn't have ECC RAM, anyway) between checksum generation and disk access on either reads or writes (the CERN study found a correlation even using ECC RAM between detected RAM errors and silent data corruption)? Not that the generation of such otherwise undetected errors due to a flaky PSU isn't interesting in its own right, but this specific sub-thread was about whether poor connections were a significant source of such errors (my comment about controller and disk firmware bugs having been a suggested potential alternative source) - so identifying the underlying mechanisms is of interest as well. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bugid 6535160
) The write cache is non volatile, but ZFS hasn't been configured to stop flushing it (set zfs:zfs_nocacheflush = 1). These are a pair of 2540 with dual-controllers, definitely non-volatile cache. We set the zfs_nocacheflush=1 and that improved things considerably. ZFS filesystem (2540 arrays): fsyncfile3434ops/s 0.0mb/s 17.3ms/op 977us/op-cpu fsyncfile2434ops/s 0.0mb/s 17.8ms/op 981us/op-cpu However still not very good compared to UFS. We turned off ZIL with zil_disable=1 and WOW! ZFS ZIL disabled: fsyncfile3 1148ops/s 0.0mb/s 0.0ms/op 18us/op-cpu fsyncfile2 1148ops/s 0.0mb/s 0.0ms/op 18us/op-cpu Not a good setting to use in production but useful data. Anyhow will take some time to get OpenSolaris onto the system, will report back then. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
On Dec 14, 2007 4:23 AM, can you guess? [EMAIL PROTECTED] wrote: I assume that you're referring to ZFS checksum errors rather than to transfer errors caught by the CRC resulting in retries. Correct. If so, then the next obvious question is, what is causing the ZFS checksum errors? And (possibly of some help in answering that question) is the disk seeing CRC transfer errors (which show up in its SMART data)? The memory is ECC in this machine, and Memtest passed it for five days. The disk was indeed getting some pretty lousy SMART scores, but that doesn't explain the controller issue. This particular controller is a SIIG-branded silicon image 0680 chipset (which is, apparently, a piece of junk - if I'd done my homework I would've bought something else)... but the premise stands. I bought a piece of consumer-level hardware off the shelf, it had corruption issues, and ZFS told me about it when XFS had been silent. Once again, a significant question is whether the checksum errors are accompanied by a lot of CRC transfer errors. If not, that would strongly suggest that they're not coming from bad transfers (and while they could conceivably be the result of commands corrupted on the wire, so much more data is transferred compared to command bandwidth that you'd really expect to see data CRC errors too if commands were getting mangled). When you wiggle the cables, other things wiggle as well (I assume you've checked that your RAM is solidly seated). I don't remember offhand if I got CRC errors with the working controller and drive and bad cabling, sorry. RAM was solid, as mentioned earlier. The extra strength comes more from its additional coverage (commands as well as data). Ah, that explains it. Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is round-robin I/O correct for ZFS?
I'm testing an Iscsi multipath configuration on a T2000 with two disk devices provided by a Netapp filer. Both the T2000 and the Netapp have two ethernet interfaces for Iscsi, going to separate switches on separate private networks. The scsi_vhci devices look like this in `format': 1. c4t60A98000433469764E4A413571444B63d0 NETAPP-LUN-0.2-50.00GB /scsi_vhci/[EMAIL PROTECTED] 2. c4t60A98000433469764E4A41357149432Fd0 NETAPP-LUN-0.2-50.00GB /scsi_vhci/[EMAIL PROTECTED] These are concatenated in the ZFS pool. There are two network paths to each of the two devices, managed by the scsi_vhci driver. The pool looks like this: # zpool status pool: space state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM spaceONLINE 0 0 0 c4t60A98000433469764E4A413571444B63d0 ONLINE 0 0 0 c4t60A98000433469764E4A41357149432Fd0 ONLINE 0 0 0 errors: No known data errors The /kernel/drv/scsi_vhci.conf file, unchanged from the defaut, specifies: load-balance=round-robin; Indeed, when I generate I/O on a ZFS filesystem, I see TCP traffic with `snoop' on both of the Iscsi ethernet interfaces. It certainly appears to be doing round-robin. The I/O are going to the same disk devices, of course, but by two different paths. Is this a correct configuration for ZFS? I assume it's safe, but I thought I should check. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Update: zpool kernel panics.
Hi Folks, Begin forwarded message: From: Edward Irvine [EMAIL PROTECTED] Date: 12 December 2007 8:44:57 AM To: [EMAIL PROTECTED] Subject: Fwd: [zfs-discuss] zpool kernel panics. FYI ... Begin forwarded message: From: James C. McPherson [EMAIL PROTECTED] Date: 12 December 2007 8:06:51 AM To: Edward Irvine [EMAIL PROTECTED] Cc: ZFS Discussions zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] zpool kernel panics. Reply-To: [EMAIL PROTECTED] Hi Eddie, Edward Irvine wrote: Each time the system crashes, it crashes with the same error message. This suggests to me that it is zpool corruption rather than faulty RAM, which is to blame. So - is this particular zpool a lost cause? :\ It's looking that way to me, but I'm definitely no expert. A number of folks have pointed out that this bug may have been fixed in a very recent version (nv-77?) of opensolaris. As a last ditch approach, I'm thinking that I could put the current system disks (sol10u4) aside, do a quick install the latest opensolaris, import the zpool, and do a zpool scrub, export the zpool, shutdown, swap in the sol10u4 disks, reboot, import. Sigh. Does this approach sound plausible? It's definitely worth a shot, as long as you don't have to zpool upgrade in order to do it. OK - this appeared to work: imported the zpool into opensolaris 77, did a zpool scrub - and no kernel panics. Cool! But - after reimporting the zpool back into Solaris10u4 (where it belongs) a zpool scrub still causes a kernel panic - although it seemed to take a bit longer to panic. Same error message as before - panic[cpu1]/thread=2a1015c7cc0: Dec 15 12:49:35 server unix: [ID 361072 kern.notice] zfs: freeing free segment (offset=423713792 size=1024) Note that opensolaris 77 and Solaris10u4 are on the same physical hardware - I'm just booting off different system disks. I pulled your crash dump inside Sun, thankyou, but I haven't had a chance to analyze it so I've passed the details on to more knowledgeable ZFS ppl. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcphttp://www.jmcp.homeunix.com/blog Sigh. This must definitely be a bug. Eddie ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
the next obvious question is, what is causing the ZFS checksum errors? And (possibly of some help in answering that question) is the disk seeing CRC transfer errors (which show up in its SMART data)? The memory is ECC in this machine, and Memtest passed it for five days. The disk was indeed getting some pretty lousy SMART scores, Seagate ATA disks (if that's what you were using) are notorious for this in a couple of specific metrics: they ship from the factory that way. This does not appear to be indicative of any actual problem but rather of error tablulation which they perform differently than other vendors do (e.g., I could imagine that they did something unusual in their burn-in exercising that generated nominal errors, but that's not even speculation, just a random guess). but that doesn't explain the controller issue. This particular controller is a SIIG-branded silicon image 0680 chipset (which is, apparently, a piece of junk - if I'd done my homework I would've bought something else)... but the premise stands. I bought a piece of consumer-level hardware off the shelf, it had corruption issues, and ZFS told me about it when XFS had been silent. Then we've been talking at cross-purposes. Your original response was to my request for evidence that *platter errors that escape detection by the disk's ECC mechanisms* occurred sufficiently frequently to be a cause for concern - and that's why I asked specifically what was causing the errors you saw (to see whether they were in fact the kind for which I had requested evidence). Not that detecting silent errors due to buggy firmware is useless: it clearly saved you from continuing corruption in this case. My impression is that in conventional consumer installations (typical consumers never crack open their case at all, let alone to add a RAID card) controller and disk firmware is sufficiently stable (especially for the limited set of functions demanded of it) that ZFS's added integrity checks may not count for a great deal (save perhaps peace of mind, but typical consumers aren't sufficiently aware of potential dangers to suffer from deficits in that area) - but your experience indicates that when you stray from that mold ZFS's added protection may sometimes be as significant as it was for Robert's mid-range array firmware bugs. And since there indeed was a RAID card involved in the original hypothetical situation under discussion, the fact that I was specifically referring to undetectable *disk* errors was only implied by my subsequent discussion of disk error rates, rather than explicit. The bottom line appears to be that introducing non-standard components into the path between RAM and disk has, at least for some specific subset of those components, the potential to introduce silent errors of the form that ZFS can catch - quite possibly in considerably greater numbers that the kinds of undetected disk errors that I was talking about ever would (that RAID card you were using has a relatively popular low-end chipset, and Robert's mid-range arrays were hardly fly-by-night). So while I'm still not convinced that ZFS offers significant features in the reliability area compared with other open-source *software* solutions, the evidence that it may do so in more sophisticated (but not quite high-end) hardware environments is becoming more persuasive. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is round-robin I/O correct for ZFS?
This is the same configuration we use on 4 separate servers (T2000, two X4100, and a V215). We do use a different iSCSI solution, but we have the same multi path config setup with scsi_vhci. Dual GigE switches on separate NICs both server and iSCSI node side. We suffered from the e1000g interface flapping bug, on two of these systems, and one time a SAN interface went down to stay (until reboot). The vhci multi path performed flawlessly. I scrubbed the pools (one of them is 10TB) and no errors were found, even though we had heavy IO at the time of the NIC failure. I think this configuration is a good one. Jon Gary Mills wrote: I'm testing an Iscsi multipath configuration on a T2000 with two disk devices provided by a Netapp filer. Both the T2000 and the Netapp have two ethernet interfaces for Iscsi, going to separate switches on separate private networks. The scsi_vhci devices look like this in `format': 1. c4t60A98000433469764E4A413571444B63d0 NETAPP-LUN-0.2-50.00GB /scsi_vhci/[EMAIL PROTECTED] 2. c4t60A98000433469764E4A41357149432Fd0 NETAPP-LUN-0.2-50.00GB /scsi_vhci/[EMAIL PROTECTED] These are concatenated in the ZFS pool. There are two network paths to each of the two devices, managed by the scsi_vhci driver. The pool looks like this: # zpool status pool: space state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM spaceONLINE 0 0 0 c4t60A98000433469764E4A413571444B63d0 ONLINE 0 0 0 c4t60A98000433469764E4A41357149432Fd0 ONLINE 0 0 0 errors: No known data errors The /kernel/drv/scsi_vhci.conf file, unchanged from the defaut, specifies: load-balance=round-robin; Indeed, when I generate I/O on a ZFS filesystem, I see TCP traffic with `snoop' on both of the Iscsi ethernet interfaces. It certainly appears to be doing round-robin. The I/O are going to the same disk devices, of course, but by two different paths. Is this a correct configuration for ZFS? I assume it's safe, but I thought I should check. -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD performance
Use a faster processor or change to a mirrored configuration. raidz2 can become processor bound in the Reed-Soloman calculations for the 2nd parity set. You should be able to see this in mpstat, and to a coarser grain in vmstat. Hmm. Is the OP's hardware *that* slow? (I don't know enough about the Sun hardware models) I have a 5-disk raidz2 (cheap SATA) here on my workstation, which is an X2 3800+ (i.e., one of the earlier AMD dual-core offerings). Here's me dd:ing to a file on FreeBSD on ZFS running on that hardware: promraid 741G 387G 0380 0 47.2M promraid 741G 387G 0336 0 41.8M promraid 741G 387G 0424510 51.0M promraid 741G 387G 0441 0 54.5M promraid 741G 387G 0514 0 19.2M promraid 741G 387G 34192 4.12M 24.1M promraid 741G 387G 0341 0 42.7M promraid 741G 387G 0361 0 45.2M promraid 741G 387G 0350 0 43.9M promraid 741G 387G 0370 0 46.3M promraid 741G 387G 1423 134K 51.7M promraid 742G 386G 22329 2.39M 10.3M promraid 742G 386G 28214 3.49M 26.8M promraid 742G 386G 0347 0 43.5M promraid 742G 386G 0349 0 43.7M promraid 742G 386G 0354 0 44.3M promraid 742G 386G 0365 0 45.7M promraid 742G 386G 2460 7.49K 55.5M At this point the bottleneck looks architectural rather than CPU. None of the cores are saturated, and the CPU usage of the ZFS kernel threads is pretty low. I say architectural because writes to the underlying devices are not sustained; it drops to almost zero for certain periods (this is more visible in iostat -x than it is in the zpool statistics). What I think is happening is that ZFS is too late to evict data in the cache, thus blocking the writing process. Once a transaction group with a bunch of data gets committed the application unblocks, but presumably ZFS waits for a little while before resuming writes. Note that this is also being run on plain hardware; it's not even PCI Express. During throughput peaks, but not constantly, the bottleneck is probably the PCI bus. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss