Re: [zfs-discuss] ZFS and SAN
Christophe Rolland wrote: Hi all we consider using ZFS for various storages (DB, etc). Most features are great, especially the ease of use. Nevertheless, a few questions : - we are using SAN disks, so most JBOD recommandations dont apply, but I did not find many experiences of zpool of a few terabytes on Luns... anybody ? Many X4500 customers have many TBytes of storage under ZFS (JBOD). - we cannot remove a device from a pool. so no way of correcting the attachment of a 200 GB LUN on a 6 TB pool on which oracle runs ... am i the only one worrying ? There is no way to prevent you from running rm -rf / either. In a real production environment, using best practices, you would never type such commands -- you would always script them and test on the test environment prior to rolling into production. - on a sun cluster, luns are seen on both nodes. Can we prevent mistakes like creating a pool on already assigned luns ? for example, veritas wants a force flag. With ZFS i can do : node1: zpool create X add lun1 lun2 node2 : zpool create Y add lun1 lun2 and then, results are unexpected, but pool X will never switch again ;-) resource and zone are dead. We've had some informal discussions on how to do this. Currently, zpool and other commands which manage disk partitioning (eg. format) use libdiskmgt calls to determine if the slices or partitions are in use on the local machine. For shared storage, we won't know if another machine might be using another slice. For Solaris Cluster, we could write an extended protocol for checking with other nodes in the cluster for use. However, even this does not work for the general case, such as a SAN or a SAN with heterogeneous nodes. -- richard - what could be some interesting tools to test IO perfs ? did someone run iozone and publish baseline, modifications and according results ? well, anyway, thanks to zfs team :D This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
It does. The file size is limited to the original creation size, which is 65k for files with 1 data sample. Unfortunately, I have zero experience with dtrace and only a little with truss. I'm relying on the dtrace scripts from people on this thread to get by for now! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and SAN
Hi Robert, thanks for the answer. You are not the only one. It's somewhere on ZFS developers list... yes, i checked this on the whole list. so, lets wait for the feature. Actually it should complain and using -f (force) on the active node, yes. but if we want to reuse the luns on the other node, there is no warning. CR - what could be some interesting tools to test IO Check for filebench (included with recent SXCE). i ll try it. thanks for your answer christophe This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and SAN
Hello Christophe, Friday, February 1, 2008, 7:55:31 PM, you wrote: CR Hi all CR we consider using ZFS for various storages (DB, etc). Most CR features are great, especially the ease of use. CR Nevertheless, a few questions : CR - we are using SAN disks, so most JBOD recommandations dont CR apply, but I did not find many experiences of zpool of a few terabytes on Luns... anybody ? Just works. CR - we cannot remove a device from a pool. so no way of correcting CR the attachment of a 200 GB LUN on a 6 TB pool on which oracle runs CR ... am i the only one worrying ? You are not the only one. It's somewhere on ZFS developers list... CR - on a sun cluster, luns are seen on both nodes. Can we prevent CR mistakes like creating a pool on already assigned luns ? for CR example, veritas wants a force flag. With ZFS i can do : CR node1: zpool create X add lun1 lun2 CR node2 : zpool create Y add lun1 lun2 CR and then, results are unexpected, but pool X will never switch CR again ;-) resource and zone are dead. Actually it should complain and using -f (force) option will have to be used to do something like above. CR - what could be some interesting tools to test IO perfs ? did CR someone run iozone and publish baseline, modifications and according results ? Check for filebench (included with recent SXCE). Also check list archives and various blogs (including mine) for some zfs benchmarks. -- Best regards, Robert Milkowskimailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] UFS on zvol Cache Questions...
Priming the cache for ZFS should work at least after boot When freemem is large; any read block will make it to cache. Post boot when memory is primed with something else (what?) then it gets more difficult for both UFS and ZFS to guess what to keep in caches. Did you try priming ZFS after boot ? So next you seem to suffer because your sequential write to log files appear to displace from the ARC the more useful DB files (I'd be interested to see if this still occur after you've primed the ZFS cache after boot). Note that if your logfile rate is huge (dd like) then ZFS cache management will suffer but that is well on it's way to being fixed. But for DS, I would think that the log rate would be more reasonable and that your storage is able to keep up. That gives ZFS cache management a fighting chance to store the reused data over sequential writes. If the default behavior is not working for you, we'll need to consider the ARC behavior in this case. I don't see why it should not work out of the box. But manual control will come also in the form of this DIO like feature : 6429855 Need way to tell ZFS that caching is a lost cause While we manage to try and solve your problem out the box, you might also have a background process that keeps priming the cache at a low I/O rate. Not a great workaround, but should be effective. -r Brad Diggs writes: Hello Darren, Please find responses in line below... On Fri, 2008-02-08 at 10:52 +, Darren J Moffat wrote: Brad Diggs wrote: I would like to use ZFS but with ZFS I cannot prime the cache and I don't have the ability to control what is in the cache (e.g. like with the directio UFS option). Why do you believe you need that at all ? My application is directory server. The #1 resource that directory needs to make maximum utilization of is RAM. In order to do that, I want to control every aspect of RAM utilization both to safely use as much RAM as possible AND avoid contention among things trying to use RAM. Lets consider the following example. A customer has a 50M entry directory. The sum of the data (db3 files) is approximately 60GB. However, there is another 2GB for the root filesystem, 30GB for the changelog, 1GB for the transaction logs, and 10GB for the informational logs. The system on which directory server will run has only 64GB of RAM. The system is configured with the following partitions: FS Used(GB) Description / 2 root /db60directory data /logs 41changelog, txn logs, and info logs swap 10system swap I prefer to keep the directory db cache and entry caches relatively small. So the db cache is 2GB and the entry cache is 100M. This leaves roughly 63GB of RAM for my 60GB of directory data and Solaris. The only way to ensure that the directory data (/db) is the only thing in the filesystem cache is to set directio on / (root) and (/logs). What do you do to prime the cache with UFS cd ds_instance_dir/db for i in `find . -name '*.db3` do dd if=${i} of=/dev/null done and what benefit do you think it is giving you ? Priming the directory server data into filesystem cache reduces ldap response time for directory data in the filesystem cache. This could mean the difference between a sub ms response time and a response time on the order of tens or hundreds of ms depending on the underlying storage speed. For telcos in particular, minimal response time is paramount. Another common scenario is when we do benchmark bakeoffs with another vendor's product. If the data isn't pre- primed, then ldap response time and throughput will be artificially degraded until the data is primed into either the filesystem or directory (db or entry) cache. Priming via ldap operations can take many hours or even days depending on the number of entries in the directory server. However, priming the same data via dd takes minutes to hours depending on the size of the files. As you know in benchmarking scenarios, time is the most limited resource that we typically have. Thus, priming via dd is much preferred. Lastly, in order to achieve optimal use of available RAM, we use directio for the root (/) and other non-data filesystems. This makes certain that the only data in the filesystem cache is the directory data. Have you tried just using ZFS and found it doesn't perform as you need or are you assuming it won't because it doesn't have directio ? We have done extensive testing with ZFS and love it. The three areas lacking for our use cases are as follows: * No ability to control what is in cache. e.g. no directio * No absolute ability to apply an upper boundary to the amount of RAM consumed by ZFS. I know that the arc cache has a control that
Re: [zfs-discuss] Hardware RAID vs. ZFS RAID
With my (COTS) LSI 1068 and 1078 based controllers I get consistently better performance when I export all disks as jbod (MegaCli - CfgEachDskRaid0). Is that really 'all disks as JBOD'? or is it 'each disk as a single drive RAID0'? single disk raid0: ./MegaCli -CfgEachDskRaid0 Direct -a0 It may not sound different on the surface, but I asked in another thread and others confirmed, that if your RAID card has a battery backed cache giving ZFS many single drive RAID0's is much better than JBOD (using the 'nocacheflush' option may even improve it more.) My understanding is that it's kind of like the best of both worlds. You get the higher number of spindles and vdevs for ZFS to manage, ZFS gets to do the redundancy, and the the HW RAID Cache gives virtually instant acknowledgement of writes, so that ZFS can be on it's way. So I think many RAID0's is not always the same as JBOD. That's not to say that even True JBOD doesn't still have an advantage over HW RAID. I don't know that for sure. I have tried mixing hardware and zfs raid but it just doesn't make sense to use from a performance or redundancy standpoint why we would add those layers of complexity. In this case I'm building nearline so there isn't even a battery attached and I have disabled any caching on the controller. I have a SUN SAS HBA on the way which would be what I would use ultimately for my JBOD attachment. But I think there is a use for HW RAID in ZFS configs which wasn't always the theory I've heard. I have really learned not to do it this way with raidz and raidz2: #zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0 c3t12d0 c3t13d0 c3t14d0 c3t15d0 Why? I know creating raidz's with more than 9-12 devices, but that doesn't cross that threshold. Is there a reason you'd split 8 disks up into 2 groups of 4? What experience led you to this? (Just so I don't have to repeat it. ;) ) I don't know why but with most setups I have tested (8 and 16 drive configs) dividing raid5 into 4 disks per vdev and 5 for a raidz2 perform better. Take a look at my simple dd test (filebench results as soon as I can figure out how to get it working proper with SOL10). = 8 SATA 500gb disk system with LSI 1068 (megaRAID ELP) - no BBU - bash-3.00# zpool history History for 'pool0-raidz': 2008-02-11.16:38:13 zpool create pool0-raidz raidz c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0 bash-3.00# zfs list NAME USED AVAIL REFER MOUNTPOINT pool0-raidz 117K 3.10T 42.6K /pool0-raidz bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo0 bs=8192 count=131072;time sync 131072+0 records in 131072+0 records out real0m1.768s user0m0.080s sys 0m1.688s real0m3.495s user0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo0 of=/pool0-raidz/rw-test.lo0 bs=8192; time sync 131072+0 records in 131072+0 records out real0m6.994s user0m0.097s sys 0m2.827s real0m1.043s user0m0.001s sys 0m0.013s bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo1 bs=8192 count=655360;time sync 655360+0 records in 655360+0 records out real0m24.064s user0m0.402s sys 0m8.974s real0m1.629s user0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo1 of=/pool0-raidz/rw-test.lo1 bs=8192; time sync 655360+0 records in 655360+0 records out real0m40.542s user0m0.476s sys 0m16.077s real0m0.617s user0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo0 of=/dev/null bs=8192; time sync 131072+0 records in 131072+0 records out real0m3.443s user0m0.084s sys 0m1.327s real0m0.013s user0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo1 of=/dev/null bs=8192; time sync 655360+0 records in 655360+0 records out real0m15.972s user0m0.413s sys 0m6.589s real0m0.013s user0m0.001s sys 0m0.012s --- bash-3.00# zpool history History for 'pool0-raidz': 2008-02-11.17:02:16 zpool create pool0-raidz raidz c2t0d0 c2t1d0 c2t2d0 c2t3d0 2008-02-11.17:02:51 zpool add pool0-raidz raidz c2t4d0 c2t5d0 c2t6d0 c2t7d0 bash-3.00# zfs list NAME USED AVAIL REFER MOUNTPOINT pool0-raidz 110K 2.67T 36.7K /pool0-raidz bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo0 bs=8192 count=131072;time sync 131072+0 records in 131072+0 records out real0m1.835s user0m0.079s sys 0m1.687s real0m2.521s user0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo0 of=/pool0-raidz/rw-test.lo0 bs=8192; time sync 131072+0 records in 131072+0 records out real0m2.376s user0m0.084s sys 0m2.291s real0m2.578s user0m0.001s sys 0m0.013s bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo1 bs=8192 count=655360;time sync 655360+0 records in 655360+0 records out real0m19.531s user0m0.404s sys 0m8.731s real0m2.255s user0m0.001s sys
[zfs-discuss] 3ware support
Goodmorning all, can anyone confirm that 3ware raid controllers are indeed not working under Solaris/OpenSolaris? I can't seem to find it in the HCL. We're now using a 3Ware 9550SX as a S-ATA RAID controller. The original plan was to disable all it's RAID functions and use justs the S-ATA controller functionality for ZFS deployment. If indeed 3Ware isn't support, I have to buy a new controller. Any specific controller/brand you can recommend for Solaris? -- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 76 201 1179 E [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scrub halts
The latest changes to the sata and marvell88sx modules have been put back to Solaris Nevada and should be available in the next build (build 84). Hopefully, those of you who use it will find the changes helpful. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [Fwd: Re: Presales support on ZFS]
Enrico, Is there any forecast to improve the efficiency of the replication mechanisms of ZFS ? Fishwork - new NAS release I would take some time to talk with and understand exactly what the customer's expectation are for replication. i would not base my decision on the cost of replicating 10 bytes, regardless of how inefficient it may be. These two documents should help: http://www.sun.com/storagetek/white-papers/data_replication_strategies.pdf http://www.sun.com/storagetek/white-papers/enterprise_continuity.pdf Two key metrics of replication are: Recovery Point Objective (RPO), is the amount of data lost (or less), measured as a unit time. Once a day backups yield a 24 hour RPO, once an hour snapshots yields ~1 hour RPO, asynchronous replication yields zero seconds to a few minutes RPO, and synchronous replication means zero seconds RPO. Recovery Time Objective (RTO), is the amount of time after a failure, until normal operations are restored. Tapes backups could be minutes to hours, local snapshots could be nearly instantaneous, assuming the local site survived the failure. Remote snapshots or replicas could be minutes, hours or days, depending on the amount of data to resynchronize, impacted by network bandwidth and latency. Availability Suite has a unique feature in this last area, called on- demand pull. Assuming that the primary site's volumes are lost, after they have been re-provisioned, a reverse update can be initiated. Besides the background resilvering in the reverse direction being active, eventually restoring all lost data, on-demand pull performs synchronous replication of data blocks on demand, as needed by the filesystem, database or application. Although the performance will be less then synchronous replication, the RTO is quite low. This type of recovery is analogous to loosing one's entire email account, having recovery initiated, but also selected email can be open as needed before the entire volume is restored, using on demand requests to satisfy data blocks for relevant email requests. Jim Considering the solution we are offering to our customer ( 5 remote sites replicating in one central data-center ) with ZFS ( cheapest solution ) I should consider 3 times the network load of a solution based on SNDR-AVS and 3 times the storage space too..correct ? I there any documentation on that ? Thanks Richard Elling ha scritto: Enrico Rampazzo wrote: Hello I'm offering a solution based on our disks where replication and storage management should be made using only ZFS... The test change few bytes on one file ( 10bytes ) and check how many bytes the source sends to target. The customer tried the replication between 2 volume...They compared ZFS replica with true copy replica and they realized the following considerations: 1. ZFS uses a block bigger than HDS true copy ZFS uses dynamic block sizes. Depending on the configuration and workload, just a few disk blocks will change, or a bunch of redundant metadata might change. In either case, changing the ZFS recordsize will make little, if any, change. 2. true copy sends 32Kbytes and ZFS 100K and more changing only 10 file bytes Can we configure ZFS to improve replication efficiencies ? By default, ZFS writes two copies of metadata. I would not recommend reducing this because it will increase your exposure to faults. What may be happening here is that a 10 byte write may cause a metadata change resulting in a minimum of three 512 byte physical blocks being changed. The metadata copies are on spatially diverse, so you may see these three blocks starting at non-contiguous boundaries. If Truecopy sends only 32kByte blocks (speculation), then the remote transfer will be 96kBytes for 3 local, physical block writes. OTOH, ZFS will coalesce writes. So you may be able to update a number of files yet still only replicate 96kBytes through Truecopy. YMMV. Since the customer is performing replication, I'll assume they are very interested in data protection, so keeping the redundant metadata is a good idea. The customer should also be aware that replication at the application level is *always* more efficient than replicating somewhere down the software stack where you lose data context. -- richard The solution should consider 5 remote site replicating on one central data-center. Considering the zfs block overhead the customer is thinking to buy a solution based on traditional storage arrays like HDS entry level arrays ( our 2530/2540 ). If so ..with the ZFS the network traffic, storage space become big problems for the customer infrastructures. Are there any documentation explaining internal ZFS replication mechanism to face the customer doubts ? Thanks Do we need of AVS in our solution to solve the problem ? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org
Re: [zfs-discuss] Real time mirroring
Have you looked at AVS? (http://opensolaris.org/os/project/avs/) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
Is deleting the old files/directories in the ZFS file system sufficient or do I need to destroy/recreate the pool and/or file system itself? I've been doing the former. The former should be sufficient, it's not necessary to destroy the pool. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
I ran this dtrace script and got no output. Any ideas? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Real time mirroring
Well 5 minutes after posting that the resilver completed. However despite it saying that the resilver completed with 0 errors ten minutes ago, the device still shows as unavailable, and my pool is still degraded. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Real time mirroring
Found my first problems with this today. The ZFS mirror appears to work fine, but if you disconnect one of the iSCSI targets it hangs for 5 mins or more. I'm also seeing very concerning behaviour when attempting to re-attach the missing disk. My test scenario is: - Two 35GB iSCSI targets are being shared using ZFS shareiscsi=on - They are imported to a 3rd Solaris box and used to create a mirrored ZFS pool - I use that to mount a NFS share, and connected to that with VMware ESX server My first test was to clone a virtual machine onto the new volume. That appeared to work fine, so I decided to test the mirroring. I started another clone operation then powered down one of the iSCSI targets. Well, the clone operation seemed to hang as soon as I did that, so I ran zpool status to see what was going on. The news wasn't good: That hung too. Nothing happened in either window for a good 5 minutes, then ESX popped up with an error saying the virtual disk is either corrupted or not a supported format, and at the exact same time the zpool status command completed, but showing that all the drives were still ONLINE. I immediately re-ran zpool status, now it reported that one iSCSI was now offline and the pool was running in a degraded state. So, for some reason it's taken 5 minutes for the iSCSI device to go offline, it's locked up ZFS for that entire time, and ZFS reported the wrong status the first time around too. The only good news is that now that ZFS is in a degraded state I can start the clone operation again and it completes fine with just half of the mirror available. Next, I powered on the missing server, checked format /dev/null to ensure the drives had re-connected, and used zpool online to re-attach the missing disk. So far it's taken over an hour to attempt to resilver files from a 10 minute copy, and the progress report is up and down like a yo-yo. The progress reporting from ZFS so far has been: - 2.25% done, 0h13m to go - 7.20% done, 0h12m to go - 6.14% done, 0h8m to go(odd, how does it go down?) ... - 78.50% done, 0h2m to go - 41.67% done, 0h8m to go (huh?) ... - 72.45% done, 0h3m to go - 42.42% done, 0h9m to go Getting concerned now, I'm actually wondering if this is ever going to complete, and I have no idea if these problems are ZFS or iSCSI related. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI target using ZFS filesystem as backing
Ross wrote: Bleh, found out why they weren't appearing. I was just creating a regular ZFS filesystem and setting shareiscsi=on. If you create a volume it works fine... I wonder if that's something that could do with being added to the documentation for shareiscsi? I can see now that all the examples of how to use it are using the zfs create -V command, but can't find anything that explicitly states that shareiscsi needs a fixed size volume. Should ZFS generate an error if somebody tries to set shareiscsi=on for a filesystem that doesn't support that property? My initial reaction was yes, however there is a case where you want to set shareisci=on for a filesystem. Setting it on a filesystem allows for it to be inherited by any volumes created below that point in the hierarchy. Lets take this fictional, but reasonable, dataset hierarchy. tank/volumes/template/solaris tank/volumes/template/linux tank/volumes/template/windows tank/volumes/archive/ tank/volumes/active/host-abc tank/volumes/active/host-xyz tank is the pool name. volumes is a dataset (with canmount=false if you like) template, archive, active are allso datasets (again canmount=false) The actual volumes are: solaris, linux, windows, host-abc, host-xyz So where do we a turn on iscsi sharing ? It could be done at the individual volume layer, or it could be done up at the volumes dataset layer eg: zfs set shareiscsi=on tank/volumes/template/solaris zfs set shareiscsi=on tank/volumes/template/linux zfs set shareiscsi=on tank/volumes/template/windows ... or just do: zfs set shareiscsi=on tank/volumes/ Aside: having canmount=false on tank/volumes may or may not be a good idea but it depends on the local deployment. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [Fwd: Re: Presales support on ZFS]
Is there any forecast to improve the efficiency of the replication mechanisms of ZFS ? Fishwork - new NAS release Considering the solution we are offering to our customer ( 5 remote sites replicating in one central data-center ) with ZFS ( cheapest solution ) I should consider 3 times the network load of a solution based on SNDR-AVS and 3 times the storage space too..correct ? I there any documentation on that ? Thanks Richard Elling ha scritto: Enrico Rampazzo wrote: Hello I'm offering a solution based on our disks where replication and storage management should be made using only ZFS... The test change few bytes on one file ( 10bytes ) and check how many bytes the source sends to target. The customer tried the replication between 2 volume...They compared ZFS replica with true copy replica and they realized the following considerations: 1. ZFS uses a block bigger than HDS true copy ZFS uses dynamic block sizes. Depending on the configuration and workload, just a few disk blocks will change, or a bunch of redundant metadata might change. In either case, changing the ZFS recordsize will make little, if any, change. 2. true copy sends 32Kbytes and ZFS 100K and more changing only 10 file bytes Can we configure ZFS to improve replication efficiencies ? By default, ZFS writes two copies of metadata. I would not recommend reducing this because it will increase your exposure to faults. What may be happening here is that a 10 byte write may cause a metadata change resulting in a minimum of three 512 byte physical blocks being changed. The metadata copies are on spatially diverse, so you may see these three blocks starting at non-contiguous boundaries. If Truecopy sends only 32kByte blocks (speculation), then the remote transfer will be 96kBytes for 3 local, physical block writes. OTOH, ZFS will coalesce writes. So you may be able to update a number of files yet still only replicate 96kBytes through Truecopy. YMMV. Since the customer is performing replication, I'll assume they are very interested in data protection, so keeping the redundant metadata is a good idea. The customer should also be aware that replication at the application level is *always* more efficient than replicating somewhere down the software stack where you lose data context. -- richard The solution should consider 5 remote site replicating on one central data-center. Considering the zfs block overhead the customer is thinking to buy a solution based on traditional storage arrays like HDS entry level arrays ( our 2530/2540 ). If so ..with the ZFS the network traffic, storage space become big problems for the customer infrastructures. Are there any documentation explaining internal ZFS replication mechanism to face the customer doubts ? Thanks Do we need of AVS in our solution to solve the problem ? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scrub halts
On Feb 12, 2008 4:45 AM, Lida Horn [EMAIL PROTECTED] wrote: The latest changes to the sata and marvell88sx modules have been put back to Solaris Nevada and should be available in the next build (build 84). Hopefully, those of you who use it will find the changes helpful. I have indeed found it beneficial. I installed the new drivers on two machines, both of which were intermittently giving errors about device resets. One card did this so often that I believed the card was faulty and I would have to replace either the card or the motherboard. Since installing the new drivers I've had no issues whatsoever with drives on either box. I ran zpool scrubs continuously on the flaky box, replaced a disk with another one, and copied data about in an attempt to replicate the bus errors I had previously seen, to no avail. The other box has been similarly stable, as far as I can tell; I see no messages in the logs and the users haven't complained when I asked them. Thank you for the work you've put into improving the state of these drivers; I meant to email you earlier this week and mention the great strides they have made, but other things took precedence. That, to my mind, is the primary evolution these drivers have made: I don't have to worry about my HBAs any more. Thanks! Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Real time mirroring
Well, I got it working, but not in a tidy way. I'm running HA-ZFS here, so I moved the ZFS pool over to the other node in the cluster. That had exactly the same problems however, the iSCSI disks were unavailable. Then I found an article from November 2006 (http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html), saying that the iSCSI initiator won't reconnect until you reboot. I rebooted one node of the cluster, then swopped ZFS back over to there and Voila! Fully working mirrored storage again. So I guess it's an iSCSI initiator problem in that it doesn't reconnect properly to a rebooted target, but it's not a particularly stable solution at this stage. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI target using ZFS filesystem as backing
Bleh, found out why they weren't appearing. I was just creating a regular ZFS filesystem and setting shareiscsi=on. If you create a volume it works fine... I wonder if that's something that could do with being added to the documentation for shareiscsi? I can see now that all the examples of how to use it are using the zfs create -V command, but can't find anything that explicitly states that shareiscsi needs a fixed size volume. Should ZFS generate an error if somebody tries to set shareiscsi=on for a filesystem that doesn't support that property? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is swap still needed on c0d0s1 to get crash dumps?
Thank you for your info. So with dumpadm I can manage crash-dumps. And if ZFS is not capable of handling those dumps, who cares. Then I will create an extra slice for those purposes. No problem. Roman This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss