Re: [zfs-discuss] Take Three: PSARC 2007/171 ZFS Separate Intent Log
Neil, many thanks for publishing this doc - it is exactly what I was looking for ! On 7/9/07, Neil Perrin [EMAIL PROTECTED] wrote: Er with attachment this time. So I've attached the accepted proposal. There was (as expected) not much discussion of this case as it was considered an obvious extension. The actual psarc case materials when opened will not have much more info than this. PSARC CASE: 2007/171 ZFS Separate Intent Log SUMMARY: This is a proposal to allow separate devices to be used for the ZFS Intent Log (ZIL). The sole purpose of this is performance. The devices can be disks, solid state drives, nvram drives, or any device that presents a block interface. PROBLEM: The ZIL satisfies the synchronous requirements of POSIX. For instance, databases often require their transactions to be on stable storage on return from the system call. NFS and other applications can also use fsync() to ensure data stability. The speed of the ZIL is therefore essential in determining the latency of writes for these critical applications. Currently the ZIL is allocated dynamically from the pool. It consists of a chain of varying block sizes which are anchored in fixed objects. Blocks are sized to fit the demand and will come from different metaslabs and thus different areas of the disk. This causes more head movement. Furthermore, the log blocks are freed as soon as the intent log transaction (system call) is committed. So a swiss cheesing effect can occur leading to pool fragmentation. PROPOSED SOLUTION: This proposal takes advantage of the greatly faster media speeds of nvram, solid state disks, or even dedicated disks. To this end, additional extensions to the zpool command are defined: zpool create pool pool devices log log devices Creates a pool with a separate log. If more than one log device is specified then writes are load-balanced between devices. It's also possible to mirror log devices. For example a log consisting of two sets of two mirrors could be created thus: zpool create pool pool devices \ log mirror c1t8d0 c1t9d0 mirror c1t10d0 c1t11d0 A raidz/raidz2 log is not supported zpool add pool log log devices Creates a separate log if it doesn't exist, or adds extra devices if it does. zpool remove pool log devices Remove the log devices. If all log devices are removed we revert to placing the log in the pool. Evacuating a log is easily handled by ensuring all txgs are committed. zpool replace pool old log device new log device Replace old log device with new log device. zpool attach pool log device new log device Attaches a new log device to an existing log device. If the existing device is not a mirror then a 2 way mirror is created. If device is part of a two-way log mirror, attaching new_device creates a three-way log mirror, and so on. zpool detach pool log device Detaches a log device from a mirror. zpool status Additionally displays the log devices zpool iostat Additionally shows IO statistics for log devices. zpool export/import Will export and import the log devices. When a separate log that is not mirrored fails then logging will start using chained logs within the main pool. The name log will become a reserved word. Attempts to create a pool with the name log will fail with: cannot create 'log': name is reserved pool name may have been omitted Hot spares cannot replace log devices. -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pseudo file system access to snapshots?
Mike Gerdts wrote: Perhaps a better approach is to create a pseudo file system that looks like: mntpt/pool /@@ /@today /@yesterday /fs /@@ /@2007-06-01 /otherfs /@@ How is this different from cd mntpt/.zfs/snapshot/ ? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pseudo file system access to snapshots?
On 7/11/07, Darren J Moffat [EMAIL PROTECTED] wrote: Mike Gerdts wrote: Perhaps a better approach is to create a pseudo file system that looks like: mntpt/pool /@@ /@today /@yesterday /fs /@@ /@2007-06-01 /otherfs /@@ How is this different from cd mntpt/.zfs/snapshot/ ? mntpt/.zfs/snapshot provides file-level access to the contents of the snapshot. If you back those up, then restore every snapshot, you will potentially be using way more disk space. What I am proposing is that cat mntpt/pool/@snap1 delivers a data stream corresponding to the output of zfs send and that cat mntpnt/pool/@[EMAIL PROTECTED] delivers a data stream corresponding to zfs send -i snap1 snap2. This would allow existing backup tools to perform block level incremental backups. Assuming that writing to the various files is the equivalent of the corresponding zfs receive commands, it provides for block level restores that preserve space efficiency as well. Why? Suppose I have a server with 50 full root zones on it. Each zone has a zonepath at /zones/zonename that is about 8 GB. This implies that I need 400 GB just for zone paths. Using ZFS clones, I can likely trim that down to far less than 100 and probably less than 20. I can't trim it down that far if I don't have a way to restore the system. This restore problem is my key worry in deploying ZFS in the area where I see it as most beneficial. Another solution that would deal with the same problem is block-level deduplication. So far my queries in this area have been met with silence. Hmmm... I just ran into another snag with this. I had been assuming that clones and snapshots were more closely related. But when I tried to send the differences between the source of a clone and a snapshot within that clone I got this message: incremental source must be in same filesystem usage: send [-i snapshot] snapshot Mike -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pseudo file system access to snapshots?
This restore problem is my key worry in deploying ZFS in the area where I see it as most beneficial. Another solution that would deal with the same problem is block-level deduplication. So far my queries in this area have been met with silence. I must have missed your messages on deduplication. But did you see this thread on it? zfs space efficiency, 6/24 - 7/7? We've been thinking about ZFS dedup for some time, and want to do it but have other priorities at the moment. Hmmm... I just ran into another snag with this. I had been assuming that clones and snapshots were more closely related. But when I tried to send the differences between the source of a clone and a snapshot within that clone I got this message: I'm not sure what you mean by more closely related. The only reason we don't support that is because we haven't gotten around to adding the special cases and error checking for it (and I think you're the first person to notice its omission). But it's actually in the works now so stay tuned for an update in a few months. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and IBM's TSM
Our main problem with TSM and ZFS is currently that there seems to be no efficient way to do a disaster restore when the backup resides on tape - due to the large number of filesystems/TSM filespaces. The graphical client (dsmj) does not work at all and with dsmc one has to start a separate restore session for each filespace. This results in a unpractical large number of tape mounts. Hans smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] pool analysis
Richard's blog analyzes MTTDL as a function of N+P+S: http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl But to understand how to best utilize an array with a fixed number of drives, I add the following constraints: - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2} - all sets in an array should be configured similarly - the MTTDL for S sets is equal to (MTTDL for one set)/S I got the following results by varying the NUM_BAYS parameter in the source code below: *_4 bays w/ 300 GB drives having MTBF=4 years_* - can have 1 (2+1) w/ 1 spares providing 600 GB with MTTDL of 5840.00 years - can have 1 (2+2) w/ 0 spares providing 600 GB with MTTDL of 799350.00 years - can have 0 (4+1) w/ 4 spares providing 0 GB with MTTDL of Inf years - can have 0 (4+2) w/ 4 spares providing 0 GB with MTTDL of Inf years - can have 0 (8+1) w/ 4 spares providing 0 GB with MTTDL of Inf years - can have 0 (8+2) w/ 4 spares providing 0 GB with MTTDL of Inf years *_8 bays w/ 300 GB drives having MTBF=4 years_* - can have 2 (2+1) w/ 2 spares providing 1200 GB with MTTDL of 2920.00 years - can have 2 (2+2) w/ 0 spares providing 1200 GB with MTTDL of 399675.00 years - can have 1 (4+1) w/ 3 spares providing 1200 GB with MTTDL of 1752.00 years - can have 1 (4+2) w/ 2 spares providing 1200 GB with MTTDL of 2557920.00 years - can have 0 (8+1) w/ 8 spares providing 0 GB with MTTDL of Inf years - can have 0 (8+2) w/ 8 spares providing 0 GB with MTTDL of Inf years *_12 bays w/ 300 GB drives having MTBF=4 years_* - can have 4 (2+1) w/ 0 spares providing 2400 GB with MTTDL of 365.00 years - can have 3 (2+2) w/ 0 spares providing 1800 GB with MTTDL of 266450.00 years - can have 2 (4+1) w/ 2 spares providing 2400 GB with MTTDL of 876.00 years - can have 2 (4+2) w/ 0 spares providing 2400 GB with MTTDL of 79935.00 years - can have 1 (8+1) w/ 3 spares providing 2400 GB with MTTDL of 486.67 years - can have 1 (8+2) w/ 2 spares providing 2400 GB with MTTDL of 426320.00 years *_16 bays w/ 300 GB drives having MTBF=4 years_* - can have 5 (2+1) w/ 1 spares providing 3000 GB with MTTDL of 1168.00 years - can have 4 (2+2) w/ 0 spares providing 2400 GB with MTTDL of 199837.50 years - can have 3 (4+1) w/ 1 spares providing 3600 GB with MTTDL of 584.00 years - can have 2 (4+2) w/ 4 spares providing 2400 GB with MTTDL of 1278960.00 years - can have 1 (8+1) w/ 7 spares providing 2400 GB with MTTDL of 486.67 years - can have 1 (8+2) w/ 6 spares providing 2400 GB with MTTDL of 426320.00 years *_20 bays w/ 300 GB drives having MTBF=4 years_* - can have 6 (2+1) w/ 2 spares providing 3600 GB with MTTDL of 973.33 years - can have 5 (2+2) w/ 0 spares providing 3000 GB with MTTDL of 159870.00 years - can have 4 (4+1) w/ 0 spares providing 4800 GB with MTTDL of 109.50 years - can have 3 (4+2) w/ 2 spares providing 3600 GB with MTTDL of 852640.00 years - can have 2 (8+1) w/ 2 spares providing 4800 GB with MTTDL of 243.33 years - can have 2 (8+2) w/ 0 spares providing 4800 GB with MTTDL of 13322.50 years *_24 bays w/ 300 GB drives having MTBF=4 years_* - can have 8 (2+1) w/ 0 spares providing 4800 GB with MTTDL of 182.50 years - can have 6 (2+2) w/ 0 spares providing 3600 GB with MTTDL of 133225.00 years - can have 4 (4+1) w/ 4 spares providing 4800 GB with MTTDL of 438.00 years - can have 4 (4+2) w/ 0 spares providing 4800 GB with MTTDL of 39967.50 years - can have 2 (8+1) w/ 6 spares providing 4800 GB with MTTDL of 243.33 years - can have 2 (8+2) w/ 4 spares providing 4800 GB with MTTDL of 213160.00 years While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that /any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 is unnecessary in a practical sense... This conclusion surprises me given the amount of attention people give to double-parity solutions - what am I overlooking? Thanks, Kent _*Source Code*_ (compile with: cc -std:c99 -lm filename) [its more than 80 columns - sorry!] #include stdio.h #include math.h #define NUM_BAYS 24 #define DRIVE_SIZE_GB 300 #define MTBF_YEARS 4 #define MTTR_HOURS_NO_SPARE 16 #define MTTR_HOURS_SPARE 4 int main() { printf(\n); printf(%u bays w/ %u GB drives having MTBF=%u years\n, NUM_BAYS, DRIVE_SIZE_GB, MTBF_YEARS); for (int num_drives=2; num_drives=8; num_drives*=2) { for (int num_parity=1; num_parity=2; num_parity++) { double mttdl; int mtbf_hours = MTBF_YEARS * 365 * 24; int total_num_drives= num_drives + num_parity; int num_instances = NUM_BAYS / total_num_drives; int
Re: [zfs-discuss] pool analysis
Resent as HTML to avoid line-wrapping: Richard's blog analyzes MTTDL as a function of N+P+S: http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl But to understand how to best utilize an array with a fixed number of drives, I add the following constraints: - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2} - all sets in an array should be configured similarly - the MTTDL for S sets is equal to (MTTDL for one set)/S I got the following results by varying the NUM_BAYS parameter in the source code below: _*4 bays w/ 300 GB drives having MTBF=4 years*_ - can have 1 (2+1) w/ 1 spares providing 600 GB with MTTDL of 5840.00 years - can have 1 (2+2) w/ 0 spares providing 600 GB with MTTDL of 799350.00 years - can have 0 (4+1) w/ 4 spares providing 0 GB with MTTDL of Inf years - can have 0 (4+2) w/ 4 spares providing 0 GB with MTTDL of Inf years - can have 0 (8+1) w/ 4 spares providing 0 GB with MTTDL of Inf years - can have 0 (8+2) w/ 4 spares providing 0 GB with MTTDL of Inf years _*8 bays w/ 300 GB drives having MTBF=4 years*_ - can have 2 (2+1) w/ 2 spares providing 1200 GB with MTTDL of 2920.00 years - can have 2 (2+2) w/ 0 spares providing 1200 GB with MTTDL of 399675.00 years - can have 1 (4+1) w/ 3 spares providing 1200 GB with MTTDL of 1752.00 years - can have 1 (4+2) w/ 2 spares providing 1200 GB with MTTDL of 2557920.00 years - can have 0 (8+1) w/ 8 spares providing 0 GB with MTTDL of Inf years - can have 0 (8+2) w/ 8 spares providing 0 GB with MTTDL of Inf years _*12 bays w/ 300 GB drives having MTBF=4 years*_ - can have 4 (2+1) w/ 0 spares providing 2400 GB with MTTDL of 365.00 years - can have 3 (2+2) w/ 0 spares providing 1800 GB with MTTDL of 266450.00 years - can have 2 (4+1) w/ 2 spares providing 2400 GB with MTTDL of 876.00 years - can have 2 (4+2) w/ 0 spares providing 2400 GB with MTTDL of 79935.00 years - can have 1 (8+1) w/ 3 spares providing 2400 GB with MTTDL of 486.67 years - can have 1 (8+2) w/ 2 spares providing 2400 GB with MTTDL of 426320.00 years * _16 bays w/ 300 GB drives having MTBF=4 years_* - can have 5 (2+1) w/ 1 spares providing 3000 GB with MTTDL of 1168.00 years - can have 4 (2+2) w/ 0 spares providing 2400 GB with MTTDL of 199837.50 years - can have 3 (4+1) w/ 1 spares providing 3600 GB with MTTDL of 584.00 years - can have 2 (4+2) w/ 4 spares providing 2400 GB with MTTDL of 1278960.00 years - can have 1 (8+1) w/ 7 spares providing 2400 GB with MTTDL of 486.67 years - can have 1 (8+2) w/ 6 spares providing 2400 GB with MTTDL of 426320.00 years _*20 bays w/ 300 GB drives having MTBF=4 years*_ - can have 6 (2+1) w/ 2 spares providing 3600 GB with MTTDL of 973.33 years - can have 5 (2+2) w/ 0 spares providing 3000 GB with MTTDL of 159870.00 years - can have 4 (4+1) w/ 0 spares providing 4800 GB with MTTDL of 109.50 years - can have 3 (4+2) w/ 2 spares providing 3600 GB with MTTDL of 852640.00 years - can have 2 (8+1) w/ 2 spares providing 4800 GB with MTTDL of 243.33 years - can have 2 (8+2) w/ 0 spares providing 4800 GB with MTTDL of 13322.50 years _*24 bays w/ 300 GB drives having MTBF=4 years*_ - can have 8 (2+1) w/ 0 spares providing 4800 GB with MTTDL of 182.50 years - can have 6 (2+2) w/ 0 spares providing 3600 GB with MTTDL of 133225.00 years - can have 4 (4+1) w/ 4 spares providing 4800 GB with MTTDL of 438.00 years - can have 4 (4+2) w/ 0 spares providing 4800 GB with MTTDL of 39967.50 years - can have 2 (8+1) w/ 6 spares providing 4800 GB with MTTDL of 243.33 years - can have 2 (8+2) w/ 4 spares providing 4800 GB with MTTDL of 213160.00 years While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that /any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 is unnecessary in a practical sense... This conclusion surprises me given the amount of attention people give to double-parity solutions - what am I overlooking? Thanks, Kent _Source Code_ (compile with: cc -std:c99 -lm filename) [its more than 80 columns - sorry!] #include stdio.h #include math.h #define NUM_BAYS 24 #define DRIVE_SIZE_GB 300 #define MTBF_YEARS 4 #define MTTR_HOURS_NO_SPARE 16 #define MTTR_HOURS_SPARE 4 int main() { printf(\n); printf(%u bays w/ %u GB drives having MTBF=%u years\n, NUM_BAYS, DRIVE_SIZE_GB, MTBF_YEARS); for (int num_drives=2; num_drives=8; num_drives*=2) { for (int num_parity=1; num_parity=2; num_parity++) { double mttdl; int mtbf_hours = MTBF_YEARS * 365 * 24; int total_num_drives= num_drives + num_parity; int num_instances = NUM_BAYS / total_num_drives; int num_spares = NUM_BAYS % total_num_drives; double mttr= num_spares==0 ? MTTR_HOURS_NO_SPARE :
Re: [zfs-discuss] pool analysis
While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that /any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 is unnecessary in a practical sense... This conclusion surprises me given the amount of attention people give to double-parity solutions - what am I overlooking? When talking to Netapp, some of their folks have mentioned their DP solution wasn't necessarily so useful for handling near-simultaneous disk loss (although it does do that). But that when a disk failed, it would not be uncommon for reconstruction to be unable to read some data off the remaining disks (perhaps a bad sector or bad data that fails checksum). With 1P, you have to shut down the volume or leave a hole in the filesystem. With 2P, you reconstruct that one read and continue. -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool analysis
Are Netapp using some kind of block checksumming? They provide an option for it, I'm not sure how often it's used. If Netapp doesn't do something like [ZFS checksums], that would explain why there's frequently trouble reconstructing, and point up a major ZFS advantage. Actually, the real problem is uncorrectable errors on drives. On a 1 TB SATA drive, there's a good chance (over 1%) that at least one block will be unreadable once written. Scrubbing tries to catch these, but if an error develops between the last scrub and the need to read the data as part of reconstruction, you're out of luck. This is the big advantage of RAID-6 / RAIDZ2; the combination of a drive failure and a single-block failure on a second drive won't lead to data loss. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool analysis
But to understand how to best utilize an array with a fixed number of drives, I add the following constraints: - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2} - all sets in an array should be configured similarly - the MTTDL for S sets is equal to (MTTDL for one set)/S Yes, these are reasonable and will reduce the problem space, somewhat. Actually, I wish I could get more insight into why N can only be 2, 4,or 8. In contemplating a 16-bay array, I many times think that 3 (3+2) + 1 spare would be perfect, but I have no understanding what N=3 implicates... While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that /any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 is unnecessary in a practical sense... This conclusion surprises me given the amount of attention people give to double-parity solutions - what am I overlooking? You are overlooking statistics :-). As I discuss in http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent the MTBF (F == death) of children aged 5-14 in the US is 4,807 years, but clearly no child will live anywhere close to 4,807 years. Thanks - I hadn't seen that blog entry yet... #define MTTR_HOURS_NO_SPARE 16 I think this is optimistic :-) Not really for me as the array is in my basement - so I assume that I'll swap in a drive when I get home from work ;) There are many more facets of looking at these sorts of analysis, which is why I wrote RAIDoptimizer. Is RAIDoptimizer the name of a spreadsheet you developed - is it publically available? Thanks, Kent ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool analysis
Kent Watsen wrote: #define MTTR_HOURS_NO_SPARE 16 I think this is optimistic :-) Not really for me as the array is in my basement - so I assume that I'll swap in a drive when I get home from work ;) Yes, it's interesting how the parameters for home setups differ from professional (not meaning to denigrate the professionalism of anybodies home network of course). We can run to the store and buy something rather quicker than lots of professional outfits seem to be able to get spares in hand. But what if you're away on business that week? -- David Dyer-Bennet, [EMAIL PROTECTED]; http://dd-b.net/dd-b Pics: http://dd-b.net/dd-b/SnapshotAlbum, http://dd-b.net/photography/gallery Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool analysis
Kent Watsen wrote: But to understand how to best utilize an array with a fixed number of drives, I add the following constraints: - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2} - all sets in an array should be configured similarly - the MTTDL for S sets is equal to (MTTDL for one set)/S Yes, these are reasonable and will reduce the problem space, somewhat. Actually, I wish I could get more insight into why N can only be 2, 4,or 8. In contemplating a 16-bay array, I many times think that 3 (3+2) + 1 spare would be perfect, but I have no understanding what N=3 implicates... There was a discussion a while back which centered around this topic. I don't recall the details, and I think it needs to be revisited, but there was concensus that for the time being, best performace was thus achieved. I'd like to revisit this, since I think best performance is more difficult to predict, due to the dynamic nature of ZFS making it particularly sensitive to the workload. While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that /any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 is unnecessary in a practical sense... This conclusion surprises me given the amount of attention people give to double-parity solutions - what am I overlooking? You are overlooking statistics :-). As I discuss in http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent the MTBF (F == death) of children aged 5-14 in the US is 4,807 years, but clearly no child will live anywhere close to 4,807 years. Thanks - I hadn't seen that blog entry yet... #define MTTR_HOURS_NO_SPARE 16 I think this is optimistic :-) Not really for me as the array is in my basement - so I assume that I'll swap in a drive when I get home from work ;) It is an average value, so you have some leeway there. I work from home, so in theory I should have fast response time :-) There are many more facets of looking at these sorts of analysis, which is why I wrote RAIDoptimizer. Is RAIDoptimizer the name of a spreadsheet you developed - is it publically available? It is a Java application. I plan to open-source it, but that may take a while to get through the process. I'll check to see if there is a way to make it available as a webstart client (which is how I deploy it). -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Compression algorithms - Project Proposal
Adam Leventhal wrote: This is a great idea. I'd like to add a couple of suggestions: It might be interesting to focus on compression algorithms which are optimized for particular workloads and data types, an Oracle database for example. NB. Oracle 11g has builtin compression. In general, for such problems, solving them closer to the application is better. It might be worthwhile to have some sort of adaptive compression whereby ZFS could choose a compression algorithm based on its detection of the type of data being stored. I think there is fertile ground in this area. As CPU threads approach $0, it might be a good idea to use more of them :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] [AVS] Question concerning reverse synchronization of a zpool
Hi, I'm struggling to get a stable ZFS replication using Solaris 10 110/06 (actual patches) and AVS 4.0 for several weeks now. We tried it on VMware first and ended up in kernel panics en masse (yes, we read Jim Dunham's blog articles :-). Now we try on the real thing, two X4500 servers. Well, I have no trouble replicating our kernel panics there, too ... but I think I learned some important things, too. But one problem is still remaining. I have a zpool on host A. Replication to host B works fine. * zpool export tank on the primary - works. * sndradm -d on both servers - works (paranoia mode) * zpool import id on the secondary - works. So far, so good. I chance the contents of the file system, add some files, delete some others ... no problems. The secondary is in production use now, everything is fine. Okay, let's imagine I switched to the secodary host because had a problem with the primary. Now it's repaired, now I want my redundancy back. * sndradm -E -f on both hosts - works. * sndradm -u -r on the primary for refreshing the primary - works. `nicstat` shows me a bit of traffic. Good, let's switch back to the primary. Actual status: zpool is imported on the secondary and NOT imported on the primary. * zpool export tank on the secondary - *kernel panic* Sadly, the machine dies fast, I don't see the kernel panic with `dmesg`. And disabling the replication again later and mounting the zpool on the primary again shows me that the update sync didn't take place, the file system changes I did on the secondary wren't replicated. Exporting the zpool on the secondary works *after* the system rebooted. I uses slices for the zpool, not LUNs, because I think many of my problems were caused by exclusive locking, but it doesn't help with this one. Questions: a) I don't understand why the kernel panics at the moment. the zpool isn't mounted on both systems, the zpool itself seems to be fine after a reboot ... and switching the primary and secondary hosts just for resyncing seems to force a full sync, which isn't an option. b) I'll try a sndradm -m -r the next time ... but I'm not sure if I like that thought. I would accept this if I replaced the primary host with another server, but having to do a 24 TB full sync just because the replication itself had been disabled for a few minutes would be hard to swallow. Or did I do something wrong? c) What performance can I expect from a X4500, 40 disks zpool, when using slices, compared to LUNs? Any experiences? And another thing: I did some experiments with zvols, because I wanted to make desasters and the AVS configuration itself easier to handle - there won't be a full sync after replacing a disk because AVS doesn't see that a hot spare is being used, and hot spares won't be replicated to the secondary host as well although the original drive on the secondary never failed. I used the zvol with UFS and this kind of hardware RAID controller emulation by ZFS works pretty well, just the performance went down the cliff. Sunsolve told me that this is a flushing problem and there's a workaround in Nevada build 53 and higher. Has somebody done a comparison, can you share some experiences? I only have a few days left and I don't waste time on installing Nevada for nothing ... Thanks, Ralf -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss