Re: [zfs-discuss] ZFS monitoring
On Mon, Feb 11, 2013 at 05:39:27PM +0100, Jim Klimov wrote: On 2013-02-11 17:14, Borja Marcos wrote: On Feb 11, 2013, at 4:56 PM, Tim Cook wrote: The zpool iostat output has all sorts of statistics I think would be useful/interesting to record over time. Yes, thanks :) I think I will add them, I just started with the esoteric ones. Anyway, still there's no better way to read it than running zpool iostat and parsing the output, right? I believe, in this case you'd have to run it as a continuous process and parse the outputs after the first one (overall uptime stat, IIRC). Also note that on problems with ZFS engine itself, zpool may lock up and thus halt your program - so have it ready to abort an outstanding statistics read after a timeout and perhaps log an error. And if pools are imported-exported during work, the zpool iostat output changes dynamically, so you basically need to parse its text structure every time. The zpool iostat -v might be even more interesting though, as it lets you see per-vdev statistics and perhaps notice imbalances, etc... All that said, I don't know if this data isn't also available as some set of kstats - that would probably be a lot better for your cause. Inspect the zpool source to see where it gets its numbers from... and perhaps make and RTI relevant kstats, if they aren't yet there ;) On the other hand, I am not certain how Solaris-based kstats interact or correspond to structures in FreeBSD (or Linux for that matter)?.. I made kstat data available on FreeBSD via 'kstat' sysctl tree: # sysctl kstat -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://tupytaj.pl pgpyFGpZBBFM1.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
On Mon, Dec 19, 2011 at 10:18:05AM +, Darren J Moffat wrote: On 12/18/11 11:52, Pawel Jakub Dawidek wrote: On Thu, Dec 15, 2011 at 04:39:07PM -0700, Cindy Swearingen wrote: Hi Anon, The disk that you attach to the root pool will need an SMI label and a slice 0. The syntax to attach a disk to create a mirrored root pool is like this, for example: # zpool attach rpool c1t0d0s0 c1t1d0s0 BTW. Can you, Cindy, or someone else reveal why one cannot boot from RAIDZ on Solaris? Is this because Solaris is using GRUB and RAIDZ code would have to be licensed under GPL as the rest of the boot code? I'm asking, because I see no technical problems with this functionality. Booting off of RAIDZ (even RAIDZ3) and also from multi-top-level-vdev pools works just fine on FreeBSD for a long time now. Not being forced to have dedicated pool just for the root if you happen to have more than two disks in you box is very convenient. For those of us not familiar with how FreeBSD is installed and boots can you explain how boot works (ie do you use GRUB at all and if so which version and where the early boot ZFS code is). We don't use GRUB, no. We use three stages for booting. Stage 0 is bascially 512 byte of very simple MBR boot loader installed at the begining of the disk that is used to launch stage 1 boot loader. Stage 1 is where we interpret all ZFS (or UFS) structure and read real files. When you use GPT, there is dedicated partition (of type freebsd-boot) where you install gptzfsboot binary (stage 0 looks for GPT partition of type freebsd-boot, loads it and starts the code in there). This partition doesn't contain file system of course, boot0 is too simple to read any file system. The gptzfsboot is where we handle all ZFS-related operations. gptzfsboot is mostly used to find root dataset and load zfsloader from there. The zfsloader is the last stage in booting. It shares the same ZFS-related code as gptzfsboot (but compiled into separate binary), it loads modules and the kernel and starts it. The zfsloader is stored in /boot/ directory on the root dataset. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpRIZh6GXH13.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
On Sun, Dec 18, 2011 at 07:24:27PM +0700, Fajar A. Nugraha wrote: On Sun, Dec 18, 2011 at 6:52 PM, Pawel Jakub Dawidek p...@freebsd.org wrote: BTW. Can you, Cindy, or someone else reveal why one cannot boot from RAIDZ on Solaris? Is this because Solaris is using GRUB and RAIDZ code would have to be licensed under GPL as the rest of the boot code? I'm asking, because I see no technical problems with this functionality. Booting off of RAIDZ (even RAIDZ3) and also from multi-top-level-vdev pools works just fine on FreeBSD for a long time now. Really? How do they do that? Well, the boot code has access to all the disks, so it is just matter of being able to intepret the data, which our boot code can do. In Linux, you can boot from disks with GPT label with grub2, and have / on raidz, but only as long as /boot is on grub2-compatible fs (e.g. single or mirrored zfs pool, ext4, etc). This is not the same. On FreeBSD everything, including root file system and boot directory, can be on RAIDZ. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpe652DxyN2F.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Mon, Dec 12, 2011 at 08:30:56PM +0400, Jim Klimov wrote: 2011-12-12 19:03, Pawel Jakub Dawidek пишет: As I said, ZFS reading path involves no dedup code. No at all. I am not sure if we contradicted each other ;) What I meant was that the ZFS reading path involves reading logical data blocks at some point, consulting the cache(s) if the block is already cached (and up-to-date). These blocks are addressed by some unique ID, and now with dedup there are several pointers to same block. So, basically, reading a file involves reading ZFS metadata, determining data block IDs, fetching them from disk or cache. Indeed, this does not need to be dedup-aware; but if the other chain of metadata blocks points to same data or metadata blocks which were already cached (for whatever reason, not limited to dedup) - this is where the read-speed boost appears. Likewise, if some blocks are not cached, such as metadata needed to determine the second file's block IDs, this incurs disk IOs and may decrease overall speed. Ok, you are right, although in this test, I believe metadata of the other file was already prefetched. I'm using this box for something else now, so can't retest, but the procedure is so easy that everyone is welcome to try it:) -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpNepBs6v1MX.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Sun, Dec 11, 2011 at 04:04:37PM +0400, Jim Klimov wrote: I would not be surprised to see that there is some disk IO adding delays for the second case (read of a deduped file clone), because you still have to determine references to this second file's blocks, and another path of on-disk blocks might lead to it from a separate inode in a separate dataset (or I might be wrong). Reading this second path of pointers to the same cached data blocks might decrease speed a little. As I said, ZFS reading path involves no dedup code. No at all. The proof would be being able to boot from ZFS with dedup turned on eventhough ZFS boot code has 0 dedup code in it. Another proof would be ZFS source code. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpOdlii40IHg.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp And you really work at Oracle?:) The answer is definiately yes. ARC caches on-disk blocks and dedup just reference those blocks. When you read dedup code is not involved at all. Let me show it to you with simple test: Create a file (dedup is on): # dd if=/dev/random of=/foo/a bs=1m count=1024 Copy this file so that it is deduped: # dd if=/foo/a of=/foo/b bs=1m Export the pool so all cache is removed and reimport it: # zpool export foo # zpool import foo Now let's read one file: # dd if=/foo/a of=/dev/null bs=1m 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) We read file 'a' and all its blocks are in cache now. The 'b' file shares all the same blocks, so if ARC caches blocks only once, reading 'b' should be much faster: # dd if=/foo/b of=/dev/null bs=1m 1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) Now look at it, 'b' was read 12.5 times faster than 'a' with no disk activity. Magic?:) -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgp3hvtU1DibZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Wed, Oct 19, 2011 at 08:40:59AM +1100, Peter Jeremy wrote: fsck verifies the logical consistency of a filesystem. For UFS, this includes: used data blocks are allocated to exactly one file, directory entries point to valid inodes, allocated inodes have at least one link, the number of links in an inode exactly matches the number of directory entries pointing to that inode, directories form a single tree without loops, file sizes are consistent with the number of allocated blocks, unallocated data/inodes blocks are in the relevant free bitmaps, redundant superblock data is consistent. It can't verify data. Well said. I'd add that people who insist on ZFS having a fsck are missing the whole point of ZFS transactional model and copy-on-write design. Fsck can only fix known file system inconsistencies in file system structures. Because there is no atomicity of operations in UFS and other file systems it is possible that when you remove a file, your system can crash between removing directory entry and freeing inode or blocks. This is expected with UFS, that's why there is fsck to verify that no such thing happend. In ZFS on the other hand there are no inconsistencies like that. If all blocks match their checksums and you find directory loop or something like that, it is a bug in ZFS, not expected inconsistency. It should be fixed in ZFS and not work-arounded with some fsck for ZFS. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpXffQuNhb6M.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Wed, Oct 19, 2011 at 10:13:56AM -0400, David Magda wrote: On Wed, October 19, 2011 08:15, Pawel Jakub Dawidek wrote: Fsck can only fix known file system inconsistencies in file system structures. Because there is no atomicity of operations in UFS and other file systems it is possible that when you remove a file, your system can crash between removing directory entry and freeing inode or blocks. This is expected with UFS, that's why there is fsck to verify that no such thing happend. Slightly OT, but this non-atomic delay between meta-data updates and writes to the disk is exploited by soft updates with FreeBSD's UFS: http://www.freebsd.org/doc/en/books/handbook/configtuning-disk.html#SOFT-UPDATES It may be of some interest to the file system geeks on the list. Well, soft-updates thanks to careful ordering of operation allow to mount file system even in inconsistent state and run fsck in background, as the only inconsistencies are resource leaks - directory entry will never point at unallocated inode and an inode will never point at unallocated block, etc. This is still not atomic. With recent versions of FreeBSD, soft-updates were extended to journal those resource leaks, so background fsck is not needed anymore. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgp1e542EIuks.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive and ashift
On Tue, Jul 26, 2011 at 03:28:10AM -0700, Fred Liu wrote: The ZFS Send stream is at the DMU layer at this layer the data is uncompress and decrypted - ie exactly how the application wants it. Even the data compressed/encrypted by ZFS will be decrypted? If it is true, will it be any CPU overhead? And ZFS send/receive tunneled by ssh becomes the only way to encrypt the data transmission? Even if zfs send/recv will work with encrypted and compressed data you still need some secure tunneling. Storage encryption is not the same as network traffic encryption. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpJ1ymwQ3QWf.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for Linux?
On Tue, Jun 14, 2011 at 04:15:17PM +0400, Jim Klimov wrote: Hello, A college friend of mine is using Debian Linux on his desktop, and wondered if he could tap into ZFS goodness without adding another server in his small quiet apartment or changing the desktop OS. According to his research, there are some kernel modules for Debian which implement ZFS, or a FUSE variant. Can anyone comment how stable and functional these are? Performance is a secondary issue, as long as it does not lead to system crashes due to timeouts, etc. ;) If you would like to stay with Debian, you can try Debian GNU/kFreeBSD with is Debian userland with FreeBSD kernel thus it should contain ZFS. http://www.debian.org/ports/kfreebsd-gnu/ -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpC4kjFjQYdh.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk replacement need to scan full pool ?
On Tue, Jun 14, 2011 at 11:49:56AM -0700, Bill Sommerfeld wrote: On 06/14/11 04:15, Rasmus Fauske wrote: I want to replace some slow consumer drives with new edc re4 ones but when I do a replace it needs to scan the full pool and not only that disk set (or just the old drive) Is this normal ? (the speed is always slow in the start so thats not what I am wondering about, but that it needs to scan all of my 18.7T to replace one drive) This is normal. The resilver is not reading all data blocks; it's reading all of the metadata blocks which contain one or more block pointers, which is the only way to find all the allocated data (and in the case of raidz, know precisely how it's spread and encoded across the members of the vdev). And it's reading all the data blocks needed to reconstruct the disk to be replaced. Maybe it would be faster to just offline this one disk, use dd(1) to copy entire disk content, disconnect old disk on online the new one. Not sure how well this will work on Solaris as the new disk serial number won't match the one in metadata, but it will surely work on FreeBSD. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpPVu54ziYQA.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
On Sun, Mar 20, 2011 at 01:54:54PM +0700, Fajar A. Nugraha wrote: On Sun, Mar 20, 2011 at 4:05 AM, Pawel Jakub Dawidek p...@freebsd.org wrote: On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote: Newer versions of FreeBSD have newer ZFS code. Yes, we are at v28 at this point (the lastest open-source version). That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...] That's actually not true. There are more FreeBSD committers working on ZFS than on UFS. How is the performance of ZFS under FreeBSD? Is it comparable to that in Solaris, or still slower due to some needed compatibility layer? This compatibility layer is just a bunch of ugly defines, etc. to allow for less code modifications. It introduces no overhead. I made performance comparison between FreeBSD 9 with ZFSv28 and Solaris 11 Express, but I don't think Solaris license allows me to publish the results. But believe me, the results were very surprising:) -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpHCdMIWMoFb.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote: Newer versions of FreeBSD have newer ZFS code. Yes, we are at v28 at this point (the lastest open-source version). That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...] That's actually not true. There are more FreeBSD committers working on ZFS than on UFS. There are vendors who offer NexentaStor on hardware with full commercial support from a single vendor (granted they get backline support from Nexenta, but do you think ixSystems engineers personally fix bugs in FreeBSD?) [...] iXsystems works very closely with the FreeBSD project. They hire or contract quite a few FreeBSD committers (FYI I'm not one of them), so yes, they are definitely in position to fix bugs in FreeBSD, as well as develop new stuff and they do that. Just wanted to clarify few points:) -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpOIZUClc1o8.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM
On Sat, Jan 29, 2011 at 11:31:59AM -0500, Edward Ned Harvey wrote: What is the status of ZFS support for TRIM? [...] I've no idea, but because I wanted to add such support for FreeBSD/ZFS for a while now, I'll share my thoughts. The problem is where to put those operations. ZFS internally have ZIO_TYPE_FREE request, which represents exactly what we need - offset and size to free. It would be best to just pass those requests directly to VDEVs, but we can't do that. There might be transaction group that will never be committed, because of a power failure and we TRIMed blocks that we want to use after boot. Ok, maybe we could just make such operation part of the transaction group? No, we can't do that too. If we start committing transactions and we execute TRIM operations we may still have power failure and TRIM operations on old blocks cannot be undone, so we will get back to invalid data. So why not to move TRIM operations to the next transaction group? That's doable, although we still need to be careful not to TRIM blocks that were freed in the previous transaction group, but are reallocated in the current one (or if we TRIM, we TRIM first and then write). Unfortunately we don't want to TRIM blocks immediately. Take into account disks that are lying about cache flush operation and because of that ZFS tries to keep freed blocks from the few last transaction groups around, so you can forcibly rewind to one of the previous txgs if such corruption occur. My initial idea was to implement 100% reliable TRIM, so that I can implement secure delete using it, eg. if ZFS is placed on top of disk encryption layer, I can implement TRIM in this layer as 'overwrite the given range with random data'. Making TRIM 100% reliable will be very hard, IMHO. But in most cases we don't need TRIM to be so perfect. My current idea is to delay TRIM operation for some number of transaction groups. For example if block is freed in txg=5, I'll send TRIM for it after txg=15 (if it wasn't reassigned in the meantime). This is ok if we crash before we get to txg=15, because the only side-effect is that next write to this range might be a little slower. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpd4hVRMkn1v.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Fri, Jan 14, 2011 at 11:32:58AM -0800, Peter Taps wrote: Ed, Thank you for sharing the calculations. In lay terms, for Sha256, how many blocks of data would be needed to have one collision? Assuming each block is 4K is size, we probably can calculate the final data size beyond which the collision may occur. This would enable us to make the following statement: With Sha256, you need verification to be turned on only if you are dealing with more than xxxT of data. Except that this is wrong question to ask. The question you can ask is How many blocks of data do I need so collision probability is X%?. Also, another related question. Why 256 bits was chosen and not 128 bits or 512 bits? I guess Sha512 may be an overkill. In your formula, how many blocks of data would be needed to have one collision using Sha128? There is no SHA128 and SHA512 has too long hash. Currently the maximum hash ZFS can handle is 32 bytes (256 bits). Wasting another 32 bytes without improving anything in practise wasn't probably worth it. BTW. As for SHA512 being slower it looks like it depends on implementation or SHA512 is faster to compute on 64bit CPU. On my laptop OpenSSL computes SHA256 55% _slower_ than SHA512. If this is a general rule, maybe it will be worth considering using SHA512 truncated to 256 bits to get more speed. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpXQHlrciD1Y.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Sat, Jan 08, 2011 at 12:59:17PM -0500, Edward Ned Harvey wrote: Has anybody measured the cost of enabling or disabling verification? Of course there is no easy answer:) Let me explain how verification works exactly first. You try to write a block. You see that block is already in dedup table (it is already referenced). You read the block (maybe it is in ARC or in L2ARC). You compare read block with what you want to write. Based on the above: 1. If you have dedup on, but your blocks are not deduplicable at all, you will pay no price for verification, as there will be no need to compare anything. 2. If your data is highly deduplicable you will verify often. Now it depends if the data you need to read fits into your ARC/L2ARC or not. If it can be found in ARC, the impact will be small. If your pool is very large and you can't count on ARC help, each write will be turned into a read. Also note an interesting property of dedup: if your data is highly deduplicable you can actually improve performance by avoiding data writes (and just increasing reference count). Let me show you three degenerated tests to compare options. I'm writing 64GB of zeros to a pool with dedup turned off, with dedup turned on and with dedup+verification turned on (I use SHA256 checksum everywhere): # zpool create -O checksum=sha256 tank ada{0,1,2,3} # time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; zpool export tank' 254,11 real 0,07 user40,80 sys # zpool create -O checksum=sha256 -O dedup=on tank ada{0,1,2,3} # time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; zpool export tank' 154,60 real 0,05 user37,10 sys # zpool create -O checksum=sha256 -O dedup=sha256,verify tank ada{0,1,2,3} # time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; zpool export tank' 173,43 real 0,02 user38,41 sys As you can see in second and third test the data is of course in ARC, so the difference here is only because of data comparison (no extra reads are needed) and verification is 12% slower. This is of course silly test, but as you can see dedup (even with verification) is much faster than nodedup case, but this data is highly deduplicable:) # zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT tank 149G 8,58M 149G 0% 524288.00x ONLINE - -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp3iTC1h5dwE.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Fri, Jan 07, 2011 at 03:06:26PM -0800, Brandon High wrote: On Fri, Jan 7, 2011 at 11:33 AM, Robert Milkowski mi...@task.gda.pl wrote: end-up with the block A. Now if B is relatively common in your data set you have a relatively big impact on many files because of one corrupted block (additionally from a fs point of view this is a silent data corruption). Without dedup if you get a single block corrupted silently an impact usually will be relatively limited. A pool can be configures so that a dedup'd block will only be referenced a certain number of times. So if you write out 10,000 identical blocks, it may be written 10 times with each duplicate referenced 1,000 times. The exact number is controlled by the dedupditto property for your pool, and you should set it as your risk tolerance allows. Dedupditto doesn't work exactly that way. You can have at most 3 copies of your block. Dedupditto minimal value is 100. The first copy is created on first write, the second copy is created on dedupditto references and the third copy is created on 'dedupditto * dedupditto' references. So once you reach 1 references of your block ZFS will create three physical copies, not earlier and never more than three. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp8xQSJhrMH1.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Fri, Jan 07, 2011 at 07:33:53PM +, Robert Milkowski wrote: On 01/ 7/11 02:13 PM, David Magda wrote: Given the above: most people are content enough to trust Fletcher to not have data corruption, but are worried about SHA-256 giving 'data corruption' when it comes de-dupe? The entire rest of the computing world is content to live with 10^-15 (for SAS disks), and yet one wouldn't be prepared to have 10^-30 (or better) for dedupe? I think you are do not understand entirely the problem. Lets say two different blocks A and B have the same sha256 checksum, A is already stored in a pool, B is being written. Without verify and dedup enabled B won't be written. Next time you ask for block B you will actually end-up with the block A. Now if B is relatively common in your data set you have a relatively big impact on many files because of one corrupted block (additionally from a fs point of view this is a silent data corruption). [...] All true, that's why verification was mandatory for fletcher, which is not cryptographically strong hash. Until SHA256 is no broken, wasting power for verification is just a waste of resources, which isn't green:) Once SHA256 is broken, verification can be turned on. [...] Without dedup if you get a single block corrupted silently an impact usually will be relatively limited. Except when corruption happens on write, not read, ie. you write data, it is corrupted on the fly, but corrupted version still matches fletcher checksum (the default now). Now every read of this block will return silently corrupted data. Now what if block B is a meta-data block? Metadata is not deduplicated. The point is that a potential impact of a hash collision is much bigger than a single silent data corruption to a block, not to mention that dedup or not all the other possible cases of data corruption are there anyway, adding yet another one might or might not be acceptable. I'm more in opinion that it was mistake that the verification feature wasn't removed along with fletcher-for-dedup removal. It is good to be able to turn on verification once/if SHA256 will be broken - that's the only reason I'll leave it, but I somehow feel that there are bigger chances you can corrupt your data because of extra code complexity coming with verification than because of SHA256 collision. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpDaDkDP6RK3.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recover data from detached ZFS mirror
On Thu, Nov 25, 2010 at 12:45:16AM -0800, maciej kaminski wrote: I've detached disk from a mirrored zpool using zpool detach (not zpool split) command. Is it possible to recover data from that disk? If yes, how? (and how to make it bootable) Take a look at this thread: http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg15620.html Jeff Bonwick provided a tool to recover ZFS label, which will allow to import such detached vdev. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpsKzX4C5hl7.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pools inside pools
On Wed, Sep 22, 2010 at 02:06:27PM +, Markus Kovero wrote: Hi, I'm asking for opinions here, any possible disaster happening or performance issues related in setup described below. Point being to create large pool and smaller pools within where you can monitor easily iops and bandwidth usage without using dtrace or similar techniques. 1. Create pool # zpool create testpool mirror c1t1d0 c1t2d0 2. Create volume inside a pool we just created # zfs create -V 500g testpool/testvolume 3. Create pool from volume we just did # zpool create anotherpool /dev/zvol/dsk/testpool/testvolume After this, anotherpool can be monitored via zpool iostat nicely and compression can be used in testpool to save resources without having compression effect in anotherpool. zpool export/import seems to work, although flag -d needs to be used, are there any caveats in this setup? How writes are handled? Is it safe to create pool consisting several ssd's and use volumes from it as log-devices? Is it even supported? Such configuration was known to cause deadlocks. Even if it works now (which I don't expect to be the case) it will make your data to be cached twice. The CPU utilization will also be much higher, etc. All in all I strongly recommend against such setup. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpvkbwkkIhby.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hang on zpool import (dedup related)
On Sun, Sep 12, 2010 at 11:24:06AM -0700, Chris Murray wrote: Absolutely spot on George. The import with -N took seconds. Working on the assumption that esx_prod is the one with the problem, I bumped that to the bottom of the list. Each mount was done in a second: # zfs mount zp # zfs mount zp/nfs # zfs mount zp/nfs/esx_dev # zfs mount zp/nfs/esx_hedgehog # zfs mount zp/nfs/esx_meerkat # zfs mount zp/nfs/esx_meerkat_dedup # zfs mount zp/nfs/esx_page # zfs mount zp/nfs/esx_skunk # zfs mount zp/nfs/esx_temp # zfs mount zp/nfs/esx_template And those directories have the content in them that I'd expect. Good! So now I try to mount esx_prod, and the influx of reads has started in zpool iostat zp 1 This is the filesystem with the issue, but what can I do now? You could try to snapshot it (but keep it unmounted), then zfs send it and zfs recv it to eg. zp/foo. Use -u option for zfs recv too, then try to mount what you received. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpQxyW0TDNO3.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs upgrade unmounts filesystems
On Thu, Jul 29, 2010 at 12:00:08PM -0600, Cindy Swearingen wrote: Hi Gary, I found a similar zfs upgrade failure with the device busy error, which I believe was caused by a file system mounted under another file system. If this is the cause, I will file a bug or find an existing one. The workaround is to unmount the nested file systems and upgrade them individually, like this: # zfs upgrade space/direct # zfs upgrade space/dcc 'zfs upgrade' unmounts file system first, which makes it hard to upgrade for example root file system. The only work-around I found is to clone root file system (clone is created with most recent version), change root file system to newly created clone, reboot, upgrade original root file system, change root file system back, reboot, destroy clone. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpDwyEEJ9AAb.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...
On Thu, May 06, 2010 at 11:28:37AM +0100, Robert Milkowski wrote: With the put back of: [PSARC/2010/108] zil synchronicity zfs datasets now have a new 'sync' property to control synchronous behaviour. The zil_disable tunable to turn synchronous requests into asynchronous requests (disable the ZIL) has been removed. For systems that use that switch on upgrade you will now see a message on booting: sorry, variable 'zil_disable' is not defined in the 'zfs' module Please update your system to use the new sync property. Here is a summary of the property: --- The options and semantics for the zfs sync property: sync=standard This is the default option. Synchronous file system transactions (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log) and then secondly all devices written are flushed to ensure the data is stable (not cached by device controllers). sync=always For the ultra-cautious, every file system transaction is written and flushed to stable storage by system call return. This obviously has a big performance penalty. sync=disabled Synchronous requests are disabled. File system transactions only commit to stable storage on the next DMU transaction group commit which can be many seconds. This option gives the highest performance, with no risk of corrupting the pool. However, it is very dangerous as ZFS is ignoring the synchronous transaction demands of applications such as databases or NFS. Setting sync=disabled on the currently active root or /var file system may result in out-of-spec behavior or application data loss and increased vulnerability to replay attacks. Administrators should only use this when these risks are understood. The property can be set when the dataset is created, or dynamically, and will take effect immediately. To change the property, an administrator can use the standard 'zfs' command. For example: # zfs create -o sync=disabled whirlpool/milek # zfs set sync=always whirlpool/perrin I read that this property is not inherited and I can't see why. If what I read is up-to-date, could you tell why? -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpnwVhYvicjy.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...
On Thu, May 06, 2010 at 01:15:41PM +0100, Robert Milkowski wrote: On 06/05/2010 13:12, Robert Milkowski wrote: On 06/05/2010 12:24, Pawel Jakub Dawidek wrote: I read that this property is not inherited and I can't see why. If what I read is up-to-date, could you tell why? It is inherited. Sorry for the confusion but there was a discussion if it should or should not be inherited, then we propose that it shouldn't but it was changed again during a PSARC review that it should. And I did a copy'n'paste here. Again, sorry for the confusion. Well, actually I did copy'n'paste a proper page as it doesn't say anything about inheritance. Nevertheless, yes it is inherited. Yes, your e-mail didn't mention that and I wanted to clarify if what I read in PSARC changed or not. Thanks:) -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp3bNocGiTgs.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Status/priority of 6761786
On Thu, Aug 27, 2009 at 01:37:11PM -0600, Dave wrote: Can anyone from Sun comment on the status/priority of bug ID 6761786? Seems like this would be a very high priority bug, but it hasn't been updated since Oct 2008. Has anyone else with thousands of volume snapshots experienced the hours long import process? It might not be direct ZFS fault. I tried to reproduce this on FreeBSD and I was able to import pool with ~2000 ZVOLs and ~1 ZVOL snapshots in few minutes. Those were empty ZVOLs and empty snapshots, so keep that in mind. All in all creating /dev/ entries might be slow in Solaris that's why experience this behaviour when importing ZFS pool with many ZVOLs and many ZVOL snapshots (note that every ZVOL snapshot is a device entry in /dev/zvol/, not like with file systems where snapshots are mounted on .zfs/snapshot/name lookup and not on import time). -- Pawel Jakub Dawidek http://www.wheel.pl p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpTg9d63ool5.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Need 1.5 TB drive size to use for array for testing
On Sat, Aug 22, 2009 at 12:00:42AM -0700, Jason Pfingstmann wrote: Thanks for the reply! The reason I'm not waiting until I have the disks is mostly because it will take me several months to get the funds together and in the meantime, I need the extra space 1 or 2 drives gets me. Since the sparse files will only take up the space in use, if I've migrated 2 of the sparse files to actual disk, I should have enough storage for about 2 TB of data without risking running out of space on the sparse file drive. I know it'll be quirky and I'd need to monitor the sparse file drive closely to insure it doesn't run out of room (or risk unexpected results, possibly complete data loss depending on how ZFS deals with that kind of problem). It doesn't work exactly how you describe. ZFS cannot report back to the file that the given block is free. Because of COW model, if you will modify your pool a lot, blocks will be allocated in the spare files, but they will never be released, so your spare files will only grow. You can end up with quite empty pool and fully populated spare files. As for the idea itself, it did something similar in the past when I was changing pool layout - I created raidz2 vdev with two spare files, which I removed immediately and two disks I saved I used as temporary storage. Once I copied the data to the raidz2 destination pool, I added those two disks into the holes and I let ZFS resliver to do its job. -- Pawel Jakub Dawidek http://www.wheel.pl p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpynGGk6F5W7.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] feature proposal
On Wed, Jul 29, 2009 at 05:34:53PM -0700, Roman V Shaposhnik wrote: On Wed, 2009-07-29 at 15:06 +0300, Andriy Gapon wrote: What do you think about the following feature? Subdirectory is automatically a new filesystem property - an administrator turns on this magic property of a filesystem, after that every mkdir *in the root* of that filesystem creates a new filesystem. The new filesystems have default/inherited properties except for the magic property which is off. Right now I see this as being mostly useful for /home. Main benefit in this case is that various user administration tools can work unmodified and do the right thing when an administrator wants a policy of a separate fs per user But I am sure that there could be other interesting uses for this. This feature request touches upon a very generic observation that my group made a long time ago: ZFS is a wonderful filesystem, the only trouble is that (almost) all the cool features have to be asked for using non-filesystem (POSIX) APIs. Basically everytime you have to do anything with ZFS you have to do it on a host where ZFS runs. The sole exception from this rule is .zfs subdirectory that lets you have access to snapshots without explicit calls to zfs(1M). Basically .zfs subdirectory is your POSIX FS way to request two bits of ZFS functionality. In general, however, we all want more. On the read-only front: wouldn't it be cool to *not* run zfs sends explicitly but have: .zfs/send/snap name .zfs/sendr/from-snap-name-to-snap-name give you the same data automagically? On the read-write front: wouldn't it be cool to be able to snapshot things by: $ mkdir .zfs/snapshot/snap-name ? Are you sure this doesn't work on Solaris/OpenSolaris? From looking at the code you should be able to do exactly that as well as destroy snapshot by rmdir'ing this entry. -- Pawel Jakub Dawidek http://www.wheel.pl p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpZJahRvw8OH.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cleaning user properties
On Mon, Nov 03, 2008 at 11:47:19AM +0100, Luca Morettoni wrote: I have a little question about user properties, I have two filesystems: rpool/export/home/luca and rpool/export/home/luca/src in this two I have one user property, setted with: zfs set net.morettoni:test=xyz rpool/export/home/luca zfs set net.morettoni:test=123 rpool/export/home/luca/src now I need to *clear* (remove) the property from rpool/export/home/luca/src filesystem, but if I use the inherit command I'll get the parent property, any hint to delete it? You can't delete it, it's just how things work. I work-around it by treating empty property and lack of property the same. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp772w99zeEG.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] space_map.c 'ss == NULL' panic strikes back.
Hi. Someone currently reported a 'ss == NULL' panic in space_map.c/space_map_add() on FreeBSD's version of ZFS. I found that this problem was previously reported on Solaris and is already fixed. I verified it and FreeBSD's version have this fix in place... http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/common/fs/zfs/space_map.c?r2=3761r1=3713 I'd really like to help this guy get his data back, so please point me into right direction. We have a crash dump of the panic, BTW. It happened after a spontaneous reboot. Now, the system panics on 'zpool import' immediately. He already tried two things: 1. Importing the pool with 'zpool import -o ro backup'. No luck, it crashes. 2. Importing the pool without mounting file systems (I sent him a patch to zpool, to not mount file systems automatically on pool import). I hoped that maybe only one or more file systems are corrupted, but no, it panics immediately as well. It's the biggest storage machine in there, so there is no way to backup raw disks before starting more experiments, that's why I'm writting here. I've two ideas: 1. Because it happend on system crash or something, we can expect that this is caused by the last change. If so, we could try corrupting most recent uberblock, so ZFS will pick up previous uberblock. 2. Instead of pancing in space_map_add(), we could try to space_map_remove() the offensive entry, eg: - VERIFY(ss == NULL); + if (ss != NULL) { + space_map_remove(sm, ss-ss_start, ss-ss_end); + goto again; + } Both of those ideas can make things worse, so I want to know what damage can be done using those method, or even better, what else (safer) we can try? -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp6Xm9y44G1x.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] About bug 6486493 (ZFS boot incompatible with
On Fri, Oct 05, 2007 at 08:52:17AM +0100, Robert Milkowski wrote: Hello Eric, Thursday, October 4, 2007, 5:54:06 PM, you wrote: ES On Thu, Oct 04, 2007 at 05:22:58AM -0700, Ivan Wang wrote: This bug was rendered moot via 6528732 in build snv_68 (and s10_u5). We now store physical devices paths with the vnodes, so even though the SATA framework doesn't correctly support open by devid in early boot, we But if I read it right, there is still a problem in SATA framework (failing ldi_open_by_devid,) right? If this problem is framework-wide, it might just bite back some time in the future. ES Yes, there is still a bug in the SATA framework, in that ES ldi_open_by_devid() doesn't work early in boot. Opening by device path ES works so long as you don't recable your boot devices. If we had open by ES devid working in early boot, then this wouldn't be a problem. Even if someone re-cables sata disks couldn't we fallback to read zfs label from all available disks and find our pool and import it? FreeBSD's GEOM storage framework implements a method called 'taste'. When new disks arrives (or is closed after last write), GEOM calls taste methods of all storage subsystems and subsystems can try to read their metadata. This is bascially how autoconfiguration happens in FreeBSD for things like software RAID1/RAID3/stripe/and others. It's much easier than what ZFS does: 1. read /etc/zfs/zpool.cache 2. open components by name 3. if there is no such disk goto 5 4. verify diskid (not all disks have an ID) 5. if diskid doesn't match, try to lookup by ID If there are few hundreds of disks, it may slows booting down, but it was never a real problem in FreeBSD. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpnSvy49Vtnr.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing a device with itself doesn't work
On Wed, Oct 03, 2007 at 10:02:03PM +0200, Pawel Jakub Dawidek wrote: On Wed, Oct 03, 2007 at 12:10:19PM -0700, Richard Elling wrote: - # zpool scrub tank # zpool status -v tank pool: tank state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Wed Oct 3 18:45:06 2007 config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz1ONLINE 0 0 0 md0 UNAVAIL 0 0 0 corrupted data md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 errors: No known data errors # zpool replace tank md0 invalid vdev specification use '-f' to override the following errors: md0 is in use (r1w1e1) # zpool replace -f tank md0 invalid vdev specification the following errors must be manually repaired: md0 is in use (r1w1e1) - Well the advice of 'zpool replace' doesn't work. At this point the user is now stuck. There seems to be just no way to now use the existing device md0. In Solaris NV b72, this works as you expect. # zpool replace zwimming /dev/ramdisk/rd1 # zpool status -v zwimming pool: zwimming state: DEGRADED scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 config: NAMESTATE READ WRITE CKSUM zwimmingDEGRADED 0 0 0 raidz1DEGRADED 0 0 0 replacing DEGRADED 0 0 0 /dev/ramdisk/rd1/old FAULTED 0 0 0 corrupted data /dev/ramdisk/rd1 ONLINE 0 0 0 /dev/ramdisk/rd2ONLINE 0 0 0 /dev/ramdisk/rd3ONLINE 0 0 0 errors: No known data errors # zpool status -v zwimming pool: zwimming state: ONLINE scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /dev/ramdisk/rd1 ONLINE 0 0 0 /dev/ramdisk/rd2 ONLINE 0 0 0 /dev/ramdisk/rd3 ONLINE 0 0 0 errors: No known data errors Good to know, but I think it's still a bit of ZFS fault. The error message 'md0 is in use (r1w1e1)' means that something (I'm quite sure it's ZFS) keeps device open. Why does it keeps it open when it doesn't recognize it? Or maybe it tries to open it twice for write (exclusively) when replacing, which is not allowed in GEOM in FreeBSD. I can take a look if this is the former or the latter, but it should be fixed in ZFS itself, IMHO. Ok, it seems that it was fixed in ZFS itself already: /* * If we are setting the vdev state to anything but an open state, then * always close the underlying device. Otherwise, we keep accessible * but invalid devices open forever. We don't call vdev_close() itself, * because that implies some extra checks (offline, etc) that we don't * want here. This is limited to leaf devices, because otherwise * closing the device will affect other children. */ if (vdev_is_dead(vd) vd-vdev_ops-vdev_op_leaf) vd-vdev_ops-vdev_op_close(vd); The ZFS version from FreeBSD-CURRENT doesn't have this code yet, it's only in my perforce branch for now. I'll verify later today if it really fixes the problem and I'll report back if not. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpqzqbHn0DZG.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing a device with itself doesn't work
On Wed, Oct 03, 2007 at 12:10:19PM -0700, Richard Elling wrote: - # zpool scrub tank # zpool status -v tank pool: tank state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Wed Oct 3 18:45:06 2007 config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz1ONLINE 0 0 0 md0 UNAVAIL 0 0 0 corrupted data md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 errors: No known data errors # zpool replace tank md0 invalid vdev specification use '-f' to override the following errors: md0 is in use (r1w1e1) # zpool replace -f tank md0 invalid vdev specification the following errors must be manually repaired: md0 is in use (r1w1e1) - Well the advice of 'zpool replace' doesn't work. At this point the user is now stuck. There seems to be just no way to now use the existing device md0. In Solaris NV b72, this works as you expect. # zpool replace zwimming /dev/ramdisk/rd1 # zpool status -v zwimming pool: zwimming state: DEGRADED scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 config: NAMESTATE READ WRITE CKSUM zwimmingDEGRADED 0 0 0 raidz1DEGRADED 0 0 0 replacing DEGRADED 0 0 0 /dev/ramdisk/rd1/old FAULTED 0 0 0 corrupted data /dev/ramdisk/rd1 ONLINE 0 0 0 /dev/ramdisk/rd2ONLINE 0 0 0 /dev/ramdisk/rd3ONLINE 0 0 0 errors: No known data errors # zpool status -v zwimming pool: zwimming state: ONLINE scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /dev/ramdisk/rd1 ONLINE 0 0 0 /dev/ramdisk/rd2 ONLINE 0 0 0 /dev/ramdisk/rd3 ONLINE 0 0 0 errors: No known data errors Good to know, but I think it's still a bit of ZFS fault. The error message 'md0 is in use (r1w1e1)' means that something (I'm quite sure it's ZFS) keeps device open. Why does it keeps it open when it doesn't recognize it? Or maybe it tries to open it twice for write (exclusively) when replacing, which is not allowed in GEOM in FreeBSD. I can take a look if this is the former or the latter, but it should be fixed in ZFS itself, IMHO. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgprcvACVf6zj.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS (and quota)
On Mon, Oct 01, 2007 at 12:57:05PM +0100, Robert Milkowski wrote: Hello Neil, Thursday, September 27, 2007, 11:40:42 PM, you wrote: NP Roch - PAE wrote: Pawel Jakub Dawidek writes: I'm CCing zfs-discuss@opensolaris.org, as this doesn't look like FreeBSD-specific problem. It looks there is a problem with block allocation(?) when we are near quota limit. tank/foo dataset has quota set to 10m: Without quota: FreeBSD: # dd if=/dev/zero of=/tank/test bs=512 count=20480 time: 0.7s Solaris: # dd if=/dev/zero of=/tank/test bs=512 count=20480 time: 4.5s With quota: FreeBSD: # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480 dd: /tank/foo/test: Disc quota exceeded time: 306.5s Solaris: # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480 write: Disc quota exceeded time: 602.7s CPU is almost entirely idle, but disk activity seems to be high. Yes, as we are near quota limit, each transaction group will accept a small amount as to not overshoot the limit. I don't know if we have the optimal strategy yet. -r NP Aside from the quota perf issue, has any analysis been done as to NP why FreeBSD is over 6X faster than Solaris without quotas? NP Do other perf tests show a similar disparity? NP Is there a difference in dd itself? NP I assume that it was identical hardware and pool config. (I don't see this e-mail in my ZFS inbox, that's why I'm replaying to Robert's e-mail.) Just to clarify. This was entirely different hardware. My e-mail was __only__ about quota performance in ZFS. Please, do not try to use those results for any other purpose. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpJlahVz5fg8.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS (and quota)
I'm CCing zfs-discuss@opensolaris.org, as this doesn't look like FreeBSD-specific problem. It looks there is a problem with block allocation(?) when we are near quota limit. tank/foo dataset has quota set to 10m: Without quota: FreeBSD: # dd if=/dev/zero of=/tank/test bs=512 count=20480 time: 0.7s Solaris: # dd if=/dev/zero of=/tank/test bs=512 count=20480 time: 4.5s With quota: FreeBSD: # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480 dd: /tank/foo/test: Disc quota exceeded time: 306.5s Solaris: # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480 write: Disc quota exceeded time: 602.7s CPU is almost entirely idle, but disk activity seems to be high. Any ideas? -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp0eROCivYe1.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] The ZFS-Man.
Hi. I gave a talk about ZFS during EuroBSDCon 2007, and because it won the the best talk award and some find it funny, here it is: http://youtube.com/watch?v=o3TGM0T1CvE a bit better version is here: http://people.freebsd.org/~pjd/misc/zfs/zfs-man.swf BTW. Inspired by ZFS demos from OpenSolaris page I created few demos of ZFS on FreeBSD: http://youtube.com/results?search_query=freebsd+zfssearch=Search And better versions: http://people.freebsd.org/~pjd/misc/zfs/ -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpe0ibMatzuw.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Evil Tuning Guide
On Mon, Sep 17, 2007 at 03:40:05PM +0200, Roch - PAE wrote: Tuning should not be done in general and Best practices should be followed. So get very much acquainted with this first : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Then if you must, this could soothe or sting : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide So drive carefully. If some LUNs exposed to ZFS are not protected by NVRAM, then this tuning can lead to data loss or application level corruption. However the ZFS pool integrity itself is NOT compromised by this tuning. Are you sure? Once you turn off flushing cache, how can you tell that your disk didn't reorder writes and uberblock was updated before new blocks were written? Will ZFS go the the previous blocks when the newest uberblock points at corrupted data? -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpLDBZ4zRFkC.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Thu, Sep 13, 2007 at 04:58:10AM +, Marc Bevand wrote: Pawel Jakub Dawidek pjd at FreeBSD.org writes: This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. This layout assumes of course that large stripes have been written to the RAIDZ vdev. As you know, the stripe width is dynamic, so it is possible for a single logical block to span only 2 disks (for those who don't know what I am talking about, see the red block occupying LBAs D3 and E3 on page 13 of these ZFS slides [1]). Yes I'm aware of that. To read this logical block (and validate its checksum), only D_0 needs to be read (LBA E3). So in this very specific case, a RAIDZ read operation is as cheap as a RAID5 read operation. [...] If you do single sector writes - yes, but this is very inefficient, because of two reasons: 1. Bandwidth - writting one sector at a time? Come on. 2. Space - when you write one sector and its parity you consume two sectors. You may have more than one parity column in that case, eg. Disk0 Disk1 Disk2 Disk3 Disk4 Disk5 D0 P0 D1 P1 D2 P2 In this case space overhead is the same as in mirror. [...] The existence of these small stripes could explain why RAIDZ doesn't perform as bad as RAID5 in Pawel's benchmark... No, as I said, the smallest block I used was 2kB, which means four 512b blocks plus one 512b of parity - each 2kB block uses all 5 disks. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpvqYkQFVjyQ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote: On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: And here are the results: RAIDZ: Number of READ requests: 4. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 1305213 Requests per second: 75 RAID5: Number of READ requests: 4. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 2749719 Requests per second: 158 I'm a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Hmm, no. The data was organized very differenly on disks. The smallest block size used was 2kB, to ensure each block is written to all disks in RAIDZ configuration. In RAID5 configuration however, 128kB stripe size was used, which means each block was stored on one disk only. Now when you read the data, RAIDZ need to read all disks for each block, and RAID5 needs to read only one disk for each block. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? The bottleneck were definiatelly disks. CPU was like 96% idle. To be honest I expected, just like Jeff, much bigger win for RAID5 case. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpaN8zKnXp9n.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: On 9/10/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote: Hi. I've a prototype RAID5 implementation for ZFS. It only works in non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 performance, as I suspected that RAIDZ, because of full-stripe operations, doesn't work well for random reads issued by many processes in parallel. There is of course write-hole problem, which can be mitigated by running scrub after a power failure or system crash. If I read your suggestion correctly, your implementation is much more like traditional raid-5, with a read-modify-write cycle? My understanding of the raid-z performance issue is that it requires full-stripe reads in order to validate the checksum. [...] No, checksum is independent thing, and this is not the reason why RAIDZ needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read parity. This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. And RAID5 does this: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 D15 P9,12,15 D10 D13 D16 P10,13,16 D11 D14 D17 P11,14,17 As you can see even small block is stored on all disks in RAIDZ, where on RAID5 small block can be stored on one disk only. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp5p7Tq85M8q.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 07:39:56PM -0500, Al Hopper wrote: This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. And RAID5 does this: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 D15 P9,12,15 D10 D13 D16 P10,13,16 D11 D14 D17 P11,14,17 Surely the above is not accurate? You've showing the parity data only being written to disk3. In RAID5 the parity is distributed across all disks in the RAID5 set. What is illustrated above is RAID3. It's actually RAID4 (RAID3 would look the same as RAIDZ, but there are differences in practice), but my point wasn't how the parity is distributed:) Ok, RAID5 once again: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 P9,12,15D15 D10 D13 P10,13,16 D16 D11 D14 P11,14,17 D17 -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpjnuDDD5adp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Tue, Sep 11, 2007 at 08:16:02AM +0100, Robert Milkowski wrote: Are you overwriting old data? I hope you're not... I am, I overwrite parity, this is the whole point. That's why ZFS designers used RAIDZ instead of RAID5, I think. I don't think you should suffer from above problem in ZFS due to COW. I do, because autonomous blocks share the same parity block. If you are not overwriting and you're just writing to new locations from the pool perspective those changes (both new data block and checksum block) won't be active until they are both flushed and uber block is updated... right? Assume 128kB stripe size in RAID5. You have three disks: A, B and C. ZFS writes 128kB at offset 0. This makes RAID5 to write data into disk A and parity into disk C (both at offset 0). Then, ZFS writes 128kB at offset 128kB. RAID5 writes data into disk B (at offset 0) and updates parity on disk C (also at offset 0). As you can see, two independent ZFS blocks share one parity block. COW won't help you here, you would need to be sure that each ZFS transaction goes to a different (and free) RAID5 row. This is I belive the main reason why poor RAID5 wasn't used in the first place. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpVx1begmkQi.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski wrote: Hello Pawel, Excellent job! Now I guess it would be a good idea to get writes done properly, even if it means make them slow (like with SVM). The end result would be - do you want fast wrties/slow reads go ahead with raid-z; if you need fast reads/slow writes go with raid-5. Writes in non-degraded mode already works. Only non-degraded mode doesn't work. My implementation is based on RAIDZ, so I'm planning to support RAID6 as well. btw: I'm just thinking loudly - for raid-5 writes, couldn't you somewhow utilize ZIL to make writes safe? I'm asking because we've got an ability to put zil somewhere else like NVRAM card... The problem with RAID5 is that different blocks share the same parity, which is not the case for RAIDZ. When you write a block in RAIDZ, you write the data and the parity, and then you switch the pointer in uberblock. For RAID5, you write the data and you need to update parity, which also protects some other data. Now if you write the data, but don't update the parity before a crash, you have a whole. If you update you parity before the write and a crash, you have a inconsistent with different block in the same stripe. My idea was to have one sector every 1GB on each disk for a journal to keep list of blocks beeing updated. For example you want to write 2kB of data at offset 1MB. You first store offset+size in this journal, then write data and update parity and then remove offset+size from the journal. Unfortuantely, we would need to flush write cache twice: after offset+size addition and before offset+size removal. We could optimize it by doing lazy removal, eg. wait for ZFS to flush write cache as a part of transaction and then remove old offset+size paris. But I still expect this to give too much overhead. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpKARqkGHZjL.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Bad Blocks Handling
On Mon, Aug 27, 2007 at 10:00:10PM -0700, RL wrote: Hi, Does ZFS flag blocks as bad so it knows to avoid using them in the future? No it doesn't. This would be a really nice feature to have, but currently when ZFS tries to write to a bad sector it simply tries few times and gives up. With COW model this shouldn't be very hard to try to use another block and mark this one as bad, but it's not yet implemented. During testing I had huge numbers of unrecoverable checksum errors, which I resolved by disabling write caching on the disks. After doing this, and confirming the errors had stopped occuring, I removed the test files. A few seconds after removing the test files, I noticed the used space dropped from 16GB to 11GB according to 'df', but it did not appear to ever drop below this value. Is this just normal file system overhead (This is a raidz with 8x 500GB drives), or has ZFS not freed some of the space allocated to bad files? Can you retry your test without write cache starting from recreating the pool? -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpFBsjIFy6F3.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New version of the ZFS test suite released
On Fri, Aug 03, 2007 at 10:56:53PM -0700, Jim Walker wrote: Version 1.8 of the ZFS test suite was released today on opensolaris.org. The ZFS test suite source tarballs, packages and baseline can be downloaded at: http://dlc.sun.com/osol/test/downloads/current/ The ZFS test suite source can be browsed at: http://src.opensolaris.org/source/xref/test/ontest-stc2/src/suites/zfs/ More information on the ZFS test suite is at: http://opensolaris.org/os/community/zfs/zfstestsuite/ Questions about the ZFS test suite can be sent to zfs-discuss at: http://www.opensolaris.org/jive/forum.jspa?forumID=80 Is it in mercurial repository? I'm not able to download it, but maybe I'm using wrong path: % hg clone ssh://[EMAIL PROTECTED]/hg/test/ontest-stc2 test remote: Repository 'hg/test/ontest-stc2' inaccessible: No such file or directory. abort: no suitable response from remote hg! -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpOxUl71BDgf.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import minor bug in snv_64a
On Mon, Jun 25, 2007 at 02:34:21AM -0400, Dennis Clarke wrote: in /usr/src/cmd/zpool/zpool_main.c : at line 680 forwards we can probably check for this scenario : if ( ( altroot != NULL ) ( altroot[0] != '/') ) { (void) fprintf(stderr, gettext(invalid alternate root '%s': must be an absolute path\n), altroot); nvlist_free(nvroot); return (1); } /* some altroot has been specified * * thus altroot[0] and altroot[1] exist */ else if ( ( altroot[0] = '/') ( altroot[1] = '\0') ) { s/=/==/ (void) fprintf(stderr, Do not specify / as alternate root.\n); You need gettext() here. nvlist_free(nvroot); return (1); } not perfect .. but something along those lines. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpKWVUs2EH4y.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS Scalability/performance
On Sat, Jun 23, 2007 at 10:21:14PM -0700, Anton B. Rang wrote: Oliver Schinagl wrote: zo basically, what you are saying is that on FBSD there's no performane issue, whereas on solaris there (can be if write caches aren't enabled) Solaris plays it safe by default. You can, of course, override that safety. FreeBSD plays it safe too. It's just that UFS, and other file systems on FreeBSD, understand write caches and flush at appropriate times. That's not true. None of file systems in FreeBSD understands and flushes disk write cache except for ZFS and UFS+gjournal. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpuN3mkKFpNW.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Scalability/performance
On Wed, Jun 20, 2007 at 01:45:29PM +0200, Oliver Schinagl wrote: Pawel Jakub Dawidek wrote: On Tue, Jun 19, 2007 at 07:52:28PM -0700, Richard Elling wrote: On that note, i have a different first question to start with. I personally am a Linux fanboy, and would love to see/use ZFS on linux. I assume that I can use those ZFS disks later with any os that can work/recognizes ZFS correct? e.g. I can install/setup ZFS in FBSD, and later use it in OpenSolaris/Linux Fuse(native) later? The on-disk format is an available specification and is designed to be platform neutral. We certainly hope you will be able to access the zpools from different OSes (one at a time). Will be nice to not EFI label disks, though:) Currently there is a problem with this - zpool created on Solaris is not recognized by FreeBSD, because FreeBSD claims GPT label is corrupted. On the other hand, creating ZFS on FreeBSD (on a raw disk) can be used under Solaris. I read this earlier, that it's recommended to use a whole disk instead of a partition with zfs, the thing that's holding me back however is the mixture of different sized disks I have. I suppose if I had a 300gb per disk raid-z going on 3 300 disk and one 320gb disk, but only have a partition of 300gb on it (still with me), i could later expand that partition with fdisk and the entire raid-z would then expand to 320gb per disk (assuming the other disks magically gain 20gb, so this is a bad example in that sense :) ) Also what about full disk vs full partition, e.g. make 1 partition to span the entire disk vs using the entire disk. Is there any significant performance penalty? (So not having a disk split into 2 partitions, but 1 disk, 1 partition) I read that with a full raw disk zfs will be beter to utilize the disks write cache, but I don't see how. On FreeBSD (thanks to GEOM) there is no difference what do you have under ZFS. On Solaris, ZFS turns on write cache on disk when whole disk is used. On FreeBSD write cache is enabled by default and GEOM consumers can send write-cache-flush (BIO_FLUSH) request to any GEOM providers. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpZkCuJUZmIl.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Scalability/performance
On Wed, Jun 20, 2007 at 09:48:08AM -0700, Eric Schrock wrote: On Wed, Jun 20, 2007 at 12:45:52PM +0200, Pawel Jakub Dawidek wrote: Will be nice to not EFI label disks, though:) Currently there is a problem with this - zpool created on Solaris is not recognized by FreeBSD, because FreeBSD claims GPT label is corrupted. On the other hand, creating ZFS on FreeBSD (on a raw disk) can be used under Solaris. FYI, the primary reason for using EFI labels is that they are endian-neutral, unlike Solaris VTOC. The secondary reason is that they are simpler and easier to use (at least on Solaris). I'm curious why FreeBSD claims the GPT label is corrupted. Is this because FreeBSD doesn't understand EFI labels, our EFI label is bad, or is there a bug in the FreeBSD EFI implementation? I haven't investigated this yet. FreeBSD should understand EFI, so either the last two or a bug in Solaris EFI implementation:) I seem to recall similar problems on Linux with ZFS/FUSE... -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpd81Zg8xdCo.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preparing to compare Solaris/ZFS and FreeBSD/ZFS performance.
On Thu, May 24, 2007 at 11:20:44AM +0100, Darren J Moffat wrote: Pawel Jakub Dawidek wrote: Hi. I'm all set for doing performance comparsion between Solaris/ZFS and FreeBSD/ZFS. I spend last few weeks on FreeBSD/ZFS optimizations and I think I'm ready. The machine is 1xQuad-core DELL PowerEdge 1950, 2GB RAM, 15 x 74GB-FC-10K accesses via 2x2Gbit FC links. Unfortunately the links to disks are the bottleneck, so I'm going to use not more than 4 disks, probably. I do know how to tune FreeBSD properly, but I don't know much about Solaris tunning. I just upgraded Solaris to: SunOS lab14.wheel.pl 5.11 opensol-20070521 i86pc i386 i86pc I took upgrades from: http://dlc.sun.com/osol/on/downloads/current/ I believe this is a version with some debugging options turned on. How can I turn debug off? Can I or do I need to install something else? What other tunnings should I apply? Don't install from bfu archives instead install a Solaris Express directly from a DVD image. Or if you do want to use bfu because you really want to match your source code revisions up to a given day then you will need to build the ON consolidation yourself and you an the install the non debug bfu archives (note you will need to download the non debug closed bins to do that). Easiest way is to just use a DVD install. Ha, I originally installed from sol-nv-b55b-x86-dvd-iso-[a-e].zip, but then upgraded to OpenSolaris. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpi5q9PL2OoS.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preparing to compare Solaris/ZFS and FreeBSD/ZFS performance.
On Thu, May 24, 2007 at 01:16:32PM +0200, Claus Guttesen wrote: I'm all set for doing performance comparsion between Solaris/ZFS and FreeBSD/ZFS. I spend last few weeks on FreeBSD/ZFS optimizations and I think I'm ready. The machine is 1xQuad-core DELL PowerEdge 1950, 2GB RAM, 15 x 74GB-FC-10K accesses via 2x2Gbit FC links. Unfortunately the links to disks are the bottleneck, so I'm going to use not more than 4 disks, probably. I do know how to tune FreeBSD properly, but I don't know much about Solaris tunning. I just upgraded Solaris to: I have just (re)installed FreeBSD amd64 current with gcc 4.2 with src from May. 21'st on a dual Dell PE 2850. Does the post-gcc-4-2 current include all your zfs-optimizations? I have commented out INVARIANTS, INVARIANTS_SUPPORT, WITNESS and WITNESS_SKIPSPIN in my kernel and recompiled with CPUTYPE=nocona. A few weeks ago I installed FreeBSD but it panicked when I used iozone. So I installed solaris 10 on this box and wanted to keep it that way. But solaris lacks FreeBSD ports ;-) so when current upgraded gcc to 4.2 I re-installed FreeBSD and the box is so far very stable. I have imported a 3.9 GB compressed postgresql dump five times to tune io-performance, have copied 66 GB of data from another server using nfs, installed 117 packages from the ports-collection and it's *very* stable. A default install solaris fares better io-wise compared to a default FreeBSD where writes could pass 100 MB/s (zpool iostat 1) and FreeBSD would write 30-40 MB/s. After adding the following to /boot/loader.conf writes peak at 90-95 MB/s: vm.kmem_size_max=2147483648 vfs.zfs.arc_max=1610612736 Now FreeBSD seems to perfom almost as good as solaris io-wise although I don't have any numbers to justify my statement. I did not import postgresql in solaris as one thing. Copying the 3.9 GB dump from $HOME to a subdir takes 1 min. 13 secs. which is approx. 55 MB/s. Reads peaked at 115 MB/s. The storage is a atabeast with two raid-controllers connected via two qlogic 2300 hba's. Each controller have four raid5-arrays with five 400 GB disks each. zetta~#zpool status pool: disk1 state: ONLINE scrub: scrub completed with 0 errors on Thu May 24 21:39:46 2007 config: NAMESTATE READ WRITE CKSUM disk1 ONLINE 0 0 0 raidz1ONLINE 0 0 0 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 raidz1ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 errors: No known data errors The atabeast is not the fastest storage-provider around but on this machine will primarily be a file- and mail-server. Are there any other tunables on FreeBSD I can look at? There is probably not much you can do to tune sequential I/Os. I'd suggest starting investigation from benchmarking drivers on both systems, by using raw disks (without ZFS). There are some other things you could try to improve different workloads. To improve concurrency you should use shared locks for VFS lookups: # sysctl vfs.lookup_shared=1 This patch also improve concurrency in VFS: http://people.freebsd.org/~pjd/patches/vfs_shared.patch When you want to operate on mmap(2)ed files, you should disable ZIL and remote file systems: # sysctl vfs.zfs.zil_disable=1 # zpool export name # zpool import name I think ZIL should be dataset property, as differences depending on the workload are huge. For example fsx test is like 15 _times_ faster when ZIL is disabled. There are still some things to optimize, like using UMA for memory allocations, but we run out of KVA too fast then. Benchmarking file system is not easy, as there are other subsystems involved, like namecache or VM. fsstress test, which mostly operates on metadata (creates, removes files, directories, renames them, etc.) is 3 times faster on FreeBSD/ZFS than on Solaris/ZFS, but I believe it's mostly because of namecache implementation. Solaris guys should seriously look at improving DNLC or replacing it. Another possibility is VFS, but Solaris VFS is much cleaner, and I somehow don't believe it's slower. fsx is about 20% faster on FreeBSD, this could be VM's fault. Don't take this numbers too seriously - those were only first tries to see where my port is and I was using OpenSolaris for comparsion, which has debugging turned on. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpylZji1JT6R.pgp Description: PGP signature
[zfs-discuss] Preparing to compare Solaris/ZFS and FreeBSD/ZFS performance.
Hi. I'm all set for doing performance comparsion between Solaris/ZFS and FreeBSD/ZFS. I spend last few weeks on FreeBSD/ZFS optimizations and I think I'm ready. The machine is 1xQuad-core DELL PowerEdge 1950, 2GB RAM, 15 x 74GB-FC-10K accesses via 2x2Gbit FC links. Unfortunately the links to disks are the bottleneck, so I'm going to use not more than 4 disks, probably. I do know how to tune FreeBSD properly, but I don't know much about Solaris tunning. I just upgraded Solaris to: SunOS lab14.wheel.pl 5.11 opensol-20070521 i86pc i386 i86pc I took upgrades from: http://dlc.sun.com/osol/on/downloads/current/ I believe this is a version with some debugging options turned on. How can I turn debug off? Can I or do I need to install something else? What other tunnings should I apply? -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpy9uZcwLmAl.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs UFS2 overhead and may be a bug?
On Thu, May 03, 2007 at 02:15:45PM -0700, Bakul Shah wrote: [originally reported for ZFS on FreeBSD but Pawel Jakub Dawid says this problem also exists on Solaris hence this email.] Thanks! Summary: on ZFS, overhead for reading a hole seems far worse than actual reading from a disk. Small buffers are used to make this overhead more visible. I ran the following script on both ZFS and UF2 filesystems. [Note that on FreeBSD cat uses a 4k buffer and md5 uses a 1k buffer. On Solaris you can replace them with dd with respective buffer sizes for this test and you should see similar results.] $ dd /dev/zero bs=1m count=10240 SPACY# 10G zero bytes allocated $ truncate -s 10G HOLEY # no space allocated $ time dd SPACY /dev/null bs=1m # A1 $ time dd HOLEY /dev/null bs=1m # A2 $ time cat SPACY /dev/null # B1 $ time cat HOLEY /dev/null # B2 $ time md5 SPACY # C1 $ time md5 HOLEY # C2 I have summarized the results below. ZFSUFS2 Elapsed System Elapsed System Test dd SPACY bs=1m 110.26 22.52340.38 19.11 A1 dd HOLEY bs=1m 22.44 22.41 24.24 24.13 A2 cat SPACY 119.64 33.04 342.77 17.30 B1 cat HOLEY 222.85 222.08 22.91 22.41 B2 md5 SPACY 210.01 77.46 337.51 25.54 C1 md5 HOLEY 856.39 801.21 82.11 28.31 C2 This is what I see on Solaris (hole is 4GB): # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=128k real 23.7 # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=128k real 21.2 # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=4k real 31.4 # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=4k real 7:32.2 -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpHFXMS6aW7i.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs performance on fuse (Linux) compared to other fs
On Mon, Apr 23, 2007 at 11:42:41PM -0700, Georg-W. Koltermann wrote: So, at this point in time that seems pretty discouraging for an everyday user, on Linux. nobody told, that zfs-fuse is ready for an everyday user at it`s current state ! ;) That's what I found out, wanted to share and get other's opinion on. I did not complain. I thought it might work, it might not, so I tried. BTW last night I tried ZFS on FreeBSD 7. I got a panic when trying to make it import my existing pool at first. [...] Can I see the panic message and backtrace? [...] Then I tried again another way and did get it to recognize it. My simple, non-representative performance measurement was even slower than zfs-fuse (something like 4-5 minutes for the find, no apparent caching effect), and I had many USB read errors along the way as well. It looks like FBSD 7 with ZFS is even more immature than zfs-fuse at this time. That's ok, it is a CVS snapshot of FreeBSD CURRENT after all. First of all CURRENT snapshot comes with a kernel, which contains some heavely debugging options turned on by default. Turning off WITNESS should make the ZFS works few times faster. find was the only test you tried? Currently I'm using ported DNLC namecache, but I've a working code already that uses FreeBSD's namecache and it performs much better for such a test. There were few nits after the import, which are all (or most of them) fixed at this point, but I've a huge number of reports from the users that ZFS works very stable on FreeBSD. If you could reproduce the panic and send me info I'd be grateful. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpDNnOyKEpF0.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS committed to the FreeBSD base.
On Sun, Apr 08, 2007 at 08:03:11AM +0200, Bruno Damour wrote: hello, After csup, buildworld fails for me in libumem. Is this due to zfs import ? Or my config ? Thanks for any clue, i'm dying to try your brand new zfs on amd64 !! Bruno FreeBSD vil1.ruomad.net 7.0-CURRENT FreeBSD 7.0-CURRENT #0: Fri Mar 23 07:33:56 CET 2007 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/VIL1 amd64 make buildworld: === cddl/lib/libumem (all) cc -O2 -fno-strict-aliasing -pipe -march=nocona -I/usr/src/cddl/lib/libumem/../../../compat/opensolaris/lib/libumem -D_SOLARIS_C_SOURCE -c /usr/src/cddl/lib/libumem/umem.c /usr/src/cddl/lib/libumem/umem.c:197: error: redefinition of 'nofail_cb' /usr/src/cddl/lib/libumem/umem.c:30: error: previous definition of 'nofail_cb' was here /usr/src/cddl/lib/libumem/umem.c:199: error: redefinition of `struct umem_cache' /usr/src/cddl/lib/libumem/umem.c:210: error: redefinition of 'umem_alloc' /usr/src/cddl/lib/libumem/umem.c:43: error: previous definition of 'umem_alloc' was here Did you use my previous patches? There is no cddl/lib/libumem/umem.c is HEAD, it was it's old location and it was moved to compat/opensolaris/lib/libumem/. Delete your entire cddl/ directory and recsup. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpdFOvvrz3lO.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS committed to the FreeBSD base.
On Fri, Apr 06, 2007 at 05:54:37AM +0100, Ricardo Correia wrote: I'm interested in the cross-platform portability of ZFS pools, so I have one question: did you implement the Solaris ZFS whole-disk support (specifically, the creation and recognition of the EFI/GPT label)? Unfortunately some tools in Linux (parted and cfdisk) have trouble recognizing the EFI partition created by ZFS/Solaris.. I'm not yet setup to move disks between FreeBSD and Solaris, but my first goal was to integrate it with FreeBSD's GEOM framework. We support cache flushing operations on any GEOM provider (disk, partition, slice, anything disk-like), so bascially currently I treat everything as a whole disk (because I simply can), but don't do any EFI/GPT labeling. I'll try to move data from Solaris' disk to FreeBSD and see what happen. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpIAGy8NZuKt.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS committed to the FreeBSD base.
On Fri, Apr 06, 2007 at 01:29:11PM +0200, Pawel Jakub Dawidek wrote: On Fri, Apr 06, 2007 at 05:54:37AM +0100, Ricardo Correia wrote: I'm interested in the cross-platform portability of ZFS pools, so I have one question: did you implement the Solaris ZFS whole-disk support (specifically, the creation and recognition of the EFI/GPT label)? Unfortunately some tools in Linux (parted and cfdisk) have trouble recognizing the EFI partition created by ZFS/Solaris.. I'm not yet setup to move disks between FreeBSD and Solaris, but my first goal was to integrate it with FreeBSD's GEOM framework. We support cache flushing operations on any GEOM provider (disk, partition, slice, anything disk-like), so bascially currently I treat everything as a whole disk (because I simply can), but don't do any EFI/GPT labeling. I'll try to move data from Solaris' disk to FreeBSD and see what happen. First try: GEOM: ad6: corrupt or invalid GPT detected. GEOM: ad6: GPT rejected -- may not be recoverable. :) -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp7PdZhefXya.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS committed to the FreeBSD base.
On Sat, Apr 07, 2007 at 12:39:14AM +0200, Bruno Damour wrote: Thanks, fantasticly interesting ! Currently ZFS is only compiled as kernel module and is only available for i386 architecture. Amd64 should be available very soon, the other archs will come later, as we implement needed atomic operations. I'm waiting eagerly to amd64 version Missing functionality. - There is no support for ACLs and extended attributes. Is this planned ? Does that means I cannot use it as a basis for a full-featured samba share ? It is planned, but it's not trivial. Does samba support NFSv4-style ACLs? -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpbbjVRCmVwa.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Something like spare sectors...
Hi. What do you think about adding functionality similar to disk's spare sectors - if a sector die, a new one is assigned from the spare sectors pool. This will be very helpful especially for laptops, where you have only one disk. I simulated returning EIO for one sector from a one-disk pool and as you know system paniced: panic: ZFS: I/O failure (write on unknown off 0: zio 0xc436d400 [L0 zvol object] 2000L/2000P DVA[0]=0:4000:2000 fletcher2 uncompressed LE contiguous birth=11 fill=1 cksum=90519dcb617667ac:e96316f8a73d7efc:8ca812fc04509f9b:9b9632c6959cbd71): error 5 From what I saw, ZFS retried to write to this sector once again before panicing, but why not just try another block? And maybe remember the problematic block somewhere. Of course this won't safe us when read operation fails, but should work quite well for writes. Not sure how vdev_mirror works exactly, ie. if it needs both mirror components to be identical or if the only guaranty is that they have the same data, but not exactly in the same place. If the latter, proposed mechanism could be also used as a part of the self-healing process, I think. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpsXiWvpsM1G.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS committed to the FreeBSD base.
Hi. I'm happy to inform that the ZFS file system is now part of the FreeBSD operating system. ZFS is available in the HEAD branch and will be available in FreeBSD 7.0-RELEASE as an experimental feature. Commit log: Please welcome ZFS - The last word in file systems. ZFS file system was ported from OpenSolaris operating system. The code in under CDDL license. I'd like to thank all SUN developers that created this great piece of software. Supported by: Wheel LTD (http://www.wheel.pl/) Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/) Supported by: Sentex (http://www.sentex.net/) Limitations. Currently ZFS is only compiled as kernel module and is only available for i386 architecture. Amd64 should be available very soon, the other archs will come later, as we implement needed atomic operations. Missing functionality. - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via iSCSI is also not supported at this point. This should be fixed in the future, we may also add support for sharing ZVOLs over ggate. - There is no support for ACLs and extended attributes. - There is no support for booting off of ZFS file system. Other than that, ZFS should be fully-functional. Enjoy! -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpOhwEO3qYF2.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] User-defined properties.
Hi. How a user-defined property can be removed? I can't find a way... -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpUfHYws9m6z.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] User-defined properties.
On Sun, Apr 01, 2007 at 12:03:36PM -0700, Eric Schrock wrote: You should be able to get rid of it with 'zfs inherit'. It's not exactly intuitive, but it matches the native property behavior. If you have any advice for improving documentation, plese let us know. Indeed, but I was more looking for something as simple as 'zfs del property filesystem'. Your method won't work in this situation: # zfs create tank/foo # zfs create tank/foo/bar # zfs set org.freebsd:test=test tank/foo # zfs get -r org.freebsd:test tank/foo NAME PROPERTY VALUE SOURCE tank/foo org.freebsd:test test local tank/foo/bar org.freebsd:test test inherited from tank/foo Now how to remove it only from tank/foo/bar? Let's assume that I've many datasets under tank/foo/ I don't want to remove the property from tank/foo and add it to each dataset. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp6rG6hugf38.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] User-defined properties.
On Sun, Apr 01, 2007 at 02:20:29PM -0700, Eric Schrock wrote: This can't be done due to the way ZFS property inheritance works in the DSL. You can explicitly set it to the empty string, but you can't unset the property alltogether. This is exactly why the 'zfs get -s local' option exists, so you can find only locally-set properties. Ok, thanks! -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpJFnq1tRBAI.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote: Again, thanks to devids, the autoreplace code would not kick in here at all. You would end up with an identical pool. Eric, maybe I'm missing something, but why ZFS depend on devids at all? As I understand it, devid is something that never change for a block device, eg. disk serial number, but on the other hand it is optional, so we can rely on the fact it's always there (I mean for all block devices we use). Why we simply not forget about devids and just focus on on-disk metadata to detect pool components? The only reason I see is performance. This is probably why /etc/zfs/zpool.cache is used as well. In FreeBSD we have the GEOM infrastructure for storage. Each storage device (disk, partition, mirror, etc.) is simply a GEOM provider. If GEOM provider appears (eg. disk is inserted, partition is configured) all interested parties are informed about this I can 'taste' the provider by reading metadata specific for them. The same when provider goes away - all interested parties are informed and can react accordingly. We don't see any performance problems related to the fact that each disk that appears is read by many GEOM classes. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpm6A6Tnggir.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Fri, Mar 23, 2007 at 11:31:03AM +0100, Pawel Jakub Dawidek wrote: On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote: Again, thanks to devids, the autoreplace code would not kick in here at all. You would end up with an identical pool. Eric, maybe I'm missing something, but why ZFS depend on devids at all? As I understand it, devid is something that never change for a block device, eg. disk serial number, but on the other hand it is optional, so we can rely on the fact it's always there (I mean for all block devices s/can/can't/ we use). Why we simply not forget about devids and just focus on on-disk metadata to detect pool components? The only reason I see is performance. This is probably why /etc/zfs/zpool.cache is used as well. In FreeBSD we have the GEOM infrastructure for storage. Each storage device (disk, partition, mirror, etc.) is simply a GEOM provider. If GEOM provider appears (eg. disk is inserted, partition is configured) all interested parties are informed about this I can 'taste' the provider by reading metadata specific for them. The same when provider goes away - all interested parties are informed and can react accordingly. We don't see any performance problems related to the fact that each disk that appears is read by many GEOM classes. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpMjTSvCwNLk.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] suggestion: directory promotion to filesystem
On Wed, Feb 21, 2007 at 10:11:43AM -0800, Matthew Ahrens wrote: Adrian Saul wrote: Not hard to work around - zfs create and a mv/tar command and it is done... some time later. If there was say a zfs graft directory newfs command, you could just break of the directory as a new filesystem and away you go - no copying, no risking cleaning up the wrong files etc. Yep, this idea was previously discussed on this list -- search for zfs split and see the following RFE: 6400399 want zfs split zfs join was also discussed but I don't think it's especially feasible or useful. 'zfs join' can be hard because of inode number collisions, but may be useful. Imagine a situation that you have the following file systems: /tank /tank/foo /tank/bar and you want to move huge amount of data from /tank/foo to /tank/bar. If you use mv/tar/dump it will copy entire data. Much faster will be to 'zfs join tank tank/foo zfs join tank tank/bar' then just mv the data and 'zfs split' them back:) -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpU7idVrPav6.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: crypto properties (Was: Re: [zfs-discuss] ZFS inode equivalent)
On Fri, Feb 02, 2007 at 08:46:34AM +, Darren J Moffat wrote: Pawel Jakub Dawidek wrote: On Thu, Feb 01, 2007 at 11:00:07AM +, Darren J Moffat wrote: Neil Perrin wrote: No it's not the final version or even the latest! The current on disk format version is 3. However, it hasn't diverged much and the znode/acl stuff hasn't changed. and it will get updated as part of zfs-crypto, I just haven't done so yet because I'm not finished designing yet. Do you consider adding a new property type (next to readonly and inherit) - a oneway property? Such propery could be only set if the dataset has no children, no snapshots and no data, and once set can't be modified. oneway would be the type of the encryption property. On the other hand you may still want to support encryption algorithm change and most likely key change. I'm not sure I understand what you are asking for. I'm sorry it seems I started my explanations from too deep. I started to play with encryption on my own by creating a crypto compression algorithm. Currently there are few types of property (readonly, inherited, etc.), but non of them seems to be suitable for encryption. When you enable encryption there should be no data, or you know that existing data is going to be encrypted and plaintext data securely removed automatically. Of course the later is much more complex to implement. My current plan is that once set the encryption property that describes which algorithm (mechanism actually: algorithm, key length and mode, eg aes-128-ccm) can not be changed, it would be inherited by any clones. Creating new child file systems rooted in an encrypted filesystem you would be allowed to turn if off (I'd like to have a policy like the acl one here) but by default it would be inherited. Right. I forget that a dataset created under another dataset doesn't share data with the parent. Key change is a very difficult problem because in some cases it can mean rewritting all previous data, in other cases it just means start using the new key now but keep the old one. Which is correct depends on why you are doing a key change. Key change for data at rest is a very different problem space from rekey in a network protocol. Key change is nice and algorithm change possibility is also nice in case the one you use become broken. What I'm doing in geli (my disk encryption software for FreeBSD) is to use random, strong master key, which is encrypted by user's passphrase, keyfiles, etc. This is nice because changing user's passphrase doesn't affect the master key, thus doesn't cost any I/O operations. Another nice thing about it is that you can have many copies of the master key protected by different passphrases. For example two persons can decrypt your data: you and security officer in your company. On the other hand, changing the master key should also be possible. A good starting point IMHO will be to support user's passphrase (keyfile, etc.) change (without touching the master key) and document changing the master key, algorithm, key length, etc. via eg. local zfs send/recv. In theory the algorithm could be different per dnode_phys_t just like checksum/compression are today, however having aes-128 on one dnode and aes-256 on another causes a problem because you also need different keys for them, it gets even more complex if you consider the algorithm mode and if you choose completely different algorithms. Having a different algorithm and key length will certainly be possible for different filesystems though (eg root with aes-128 and home with aes-256). Maybe keys should be pool's properties? You add new key to the pool and then assign selected key to the given datasets. You can then unlock the key using zpool(1M) or you'll be asked to unlock all keys used by a dataset when you want to mount/attach it (file system or zvol). Once the key is unlocked, the remaining datasets that use the same key can be mounted/attached automatically. Just a thought... -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpnlZmnidmyi.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs rewrite?
On Fri, Jan 26, 2007 at 06:08:50PM -0800, Darren Dunham wrote: What do you guys think about implementing 'zfs/zpool rewrite' command? It'll read every block older than the date when the command was executed and write it again (using standard ZFS COW mechanism, simlar to how resilvering works, but the data is read from the same disk it is written to= ). #1 How do you control I/O overhead? The same way it is handled for scrub and resilver. #2 Snapshot blocks are never rewritten at the moment. Most of your suggestions seem to imply working on the live data, but doing that for snapshots as well might be tricky. Good point, see below. 3. I created file system with huge amount of data, where most of the data is read-only. I change my server from intel to sparc64 machine. Adaptive endianess only change byte order to native on write and because file system is mostly read-only, it'll need to byteswap all the time. And here comes 'zfs rewrite'! It's only the metadata that is modified anyway, not the file data. I would hope that this could be done more easily than a full tree rewrite (and again the issue with snapshots). Also, the overhead there probably isn't going to be very high (since the metadata will be cached in most cases). Agreed. Probably in this case there should be rewrite-only-metadata mode. I agree the overhead is probably not high, but on the other hand, I'm quite sure there are workload, which will see the difference, eg. 'find / -name something'. Other than that, I'm guessing something like this will be necessary to implement disk evacuation/removal. If you have to rewrite data from one disk to elsewhere in the pool, then rewriting the entire tree shouldn't be much harder. How did I forget about this one?:) That's right. I belive ZFS will gain such ability at some point and rewrite functionality fits very nice here: mark the disk/mirror/raid-z as no-more-writes and start rewrite process (probably only limited to this entity). To implement such functionality there also has to be a way to migrate snapshot data, so sooner or later there will be a need for moving snapshot blocks. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpsIUZEgB2Q6.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
On Mon, Jan 08, 2007 at 11:00:36AM -0600, [EMAIL PROTECTED] wrote: I have been looking at zfs source trying to get up to speed on the internals. One thing that interests me about the fs is what appears to be a low hanging fruit for block squishing CAS (Content Addressable Storage). I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. [...] I like the idea, but I'd prefer to see such option to be per-pool, not per-filesystem option. I found somewhere in ZFS documentation that clones are nice to use for a large number of diskless stations. That's fine, but after every upgrade, more and more files are updated and fewer and fewer blocks are shared between clones. Having such functionality for the entire pool would be a nice optimization in this case. This doesn't have to be per-pool option actually, but per-filesystem-hierarchy, ie. all file systems under tank/diskless/. I'm not yet sure how you can build the list of hash-to-block mappings for large pools on boot fast... -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpIN0bljATsF.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Differences between ZFS and UFS.
On Sat, Dec 30, 2006 at 11:28:55AM +0100, [EMAIL PROTECTED] wrote: Bascially ZFS pass all my tests (about 3000). I see one problem with UFS and two differences: That's good; do you have those tests published anywhere. I'll publish them once I finish with Linux. They already work for FreeBSD/UFS, FreeBSD/ZFS, Solaris/UFS and Solaris/ZFS. 1. link(2) manual page states that privileged processes can make multiple links to a directory. This looks like a general comment, but it's only true for UFS. Solaris UFS doesn't deal gracefully with that. (Fsck will complain and fix the fs and two fsck passes are generally needed.) An argument can be made to ban this for UFS too. (Some of the other fses do support this, like tmpfs) Maybe it's just worth mentioning in the manual page which file systems support this feature. 2. link(2) in UFS allows to remove directories, but doesn't allow this in ZFS. Link with the target being a directory and the source a any file or only directories? And only as superuer? I'm sorry, I ment unlink(2) here. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpAOy9WubbVA.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Differences between ZFS and UFS.
Hi. Here are some things my file system test suite discovered on Solaris ZFS and UFS. Bascially ZFS pass all my tests (about 3000). I see one problem with UFS and two differences: 1. link(2) manual page states that privileged processes can make multiple links to a directory. This looks like a general comment, but it's only true for UFS. 2. link(2) in UFS allows to remove directories, but doesn't allow this in ZFS. 3. Unsuccessful link(2) can update file's ctime: # fstest mkdir foo 0755 # fstest create foo/bar 0644 # fstest chown foo/bar 65534 -1 # ctime1=`fstest stat foo/bar ctime` # sleep 1 # fstest -u 65534 link foo/bar foo/baz --- this unsuccessful operation updates ctime EACCES # ctime2=`fstest stat ${n0} ctime` # echo $ctime1 $ctime2 1167440797 1167440798 -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpWPytzVTegq.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [security-discuss] Thoughts on ZFS Secure Delete - without using Crypto
On Tue, Dec 19, 2006 at 02:04:37PM +, Darren J Moffat wrote: In case it wasn't clear I am NOT proposing a UI like this: $ zfs bleach ~/Documents/company-finance.odp Instead ~/Documents or ~ would be a ZFS file system with a policy set something like this: # zfs set erase=file:zero Or maybe more like this: # zfs create -o erase=file -o erasemethod=zero homepool/darrenm The goal is the same as the goal for things like compression in ZFS, no application change it is free for the applications. I like the idea, I really do, but it will be s expensive because of ZFS' COW model. Not only file removal or truncation will call bleaching, but every single file system modification... Heh, well, if privacy of your data is important enough, you probably don't care too much about performance. I for one would prefer encryption, which may turns out to be much faster than bleaching and also more secure. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpbmTXIeqnBE.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS goes catatonic when drives go dead?
On Thu, Nov 23, 2006 at 12:09:09PM +0100, Pawel Jakub Dawidek wrote: On Wed, Nov 22, 2006 at 03:38:05AM -0800, Peter Eriksson wrote: There is nothing in the ZFS FAQ about this. I also fail to see how FMA could make any difference since it seems that ZFS is deadlocking somewhere in the kernel when this happens... It works if you wrap all the physical devices inside SVM metadevices and use those for your ZFS/zpool instead. Ie: metainit d101 1 1 c1t5d0s0 metainit d102 1 1 c1t5d1s0 metainit d103 1 1 c1t5d2s0 zpool create foo radz /dev/md/dsk/d101 /dev/md/dsk/d102 /dev/md/dsk/d103 Another unrelated observation - I've noticed that ZFS often works *faster* if I wrap a physical partition inside a metadevice and then feed that to zpool instead of using the raw partition directly with zpool... Example: Testing ZFS on a spare 40GB partition of the boot ATA disk in an Sun Ultra 10/440 gives horrible performance numbers. If I wrap that into a simple metadevice and feed to ZFS things work much faster... Ie: Zpool containing one normal disk partition: # /bin/time mkfile 1G 1G real 2:46.5 user0.4 sys24.1 -- 6MB/s (that was actually the best number I got - the worst was 3:03 minutes) Zpool containing one SVM metadevice containing the same disk partition: #/bin/time mkfile 1G 1G real 1:41.6 user0.3 sys23.3 -- 10MB/s (Idle machine in both cases, mkfile rerun a couple of times, with the same results. I removed the 1G file between reruns of course) It may be because for raw disks ZFS flushes write cache (via DKIOCFLUSHWRITECACHE), which can be expensive operation and highly depend on disks/controllers used. I doubt it does the same for metadevices, but I may be wrong. Oops, you operate on partitions... I think for partitions ZFS disables write cache on disks... Anyway, I'll leave the answer to someone more clueful. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpRnD68YRO9g.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS patches for FreeBSD.
Just to let you know that first set of patches for FreeBSD is now available: http://lists.freebsd.org/pipermail/freebsd-fs/2006-November/002385.html -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp7QLQ4XRMld.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Porting ZFS file system to FreeBSD.
On Tue, Sep 05, 2006 at 10:49:11AM +0200, Pawel Jakub Dawidek wrote: On Tue, Aug 22, 2006 at 12:45:16PM +0200, Pawel Jakub Dawidek wrote: Hi. I started porting the ZFS file system to the FreeBSD operating system. [...] Just a quick note about progress in my work. I needed slow down a bit, but: Here is another update: After way too much time spend on fighting the buffer cache I finally made mmap(2)ed reads/writes to work and (which is also very important) keep regular reads/writes working. Now I'm able to build FreeBSD's kernel and userland with both sources and objects placed on ZFS file system. I also tried to crash it with fsx, fsstress and postmark, but no luck, it works stable. On the other hand I'm quite sure there are many problems in ZPL still, but fixing mmap(2) allows me to move forward. As a said note - ZVOL seems to be full functional. I need to find a way to test ZIL, so if you guys at SUN have some ZIL tests like uncleanly stopped file system, which at mount time will exercise entire ZIL functionality where we can verify that my FS was fixed properly that would be great. PS. There is still a lot to do, so please, don't ask me for patches yet. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgprfZutEQMXa.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On Tue, Sep 12, 2006 at 03:56:00PM -0700, Matthew Ahrens wrote: Matthew Ahrens wrote: [...] Given the overwhelming criticism of this feature, I'm going to shelve it for now. I'd really like to see this feature. You say ZFS should change our view on filesystems, I say be consequent. In ZFS world we create one big pool out of all our disks and create filesystems on top of it. This way we don't have to care about resizing them, etc. But this way we define redundancy at pool level for all our filesystems. It is quite common that we have data we don't really care about as well as data we do care about a lot in the same pool. Before ZFS, I'd just create RAID0 for the former and RAID1 for the latter, but this is not the ZFS way, right? My question is how can I express my intent of defining redundancy level based of the importance of my data, but still following the ZFS way without 'copies' feature? Please reconsider your choice. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpRd16TY8bxr.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Porting ZFS file system to FreeBSD.
On Tue, Aug 22, 2006 at 12:45:16PM +0200, Pawel Jakub Dawidek wrote: Hi. I started porting the ZFS file system to the FreeBSD operating system. [...] Just a quick note about progress in my work. I needed slow down a bit, but: All file system operations seems to work. The only exception are operations needed for mmap(2) to work. Bascially file system works quite stable even under heavy load. I've problem with two assertions I'm hitting when running some heavy regression tests. I've spend a couple of days fighting with snapshots. To be able to implement them I needed to port GFS from Solaris (Generic pseudo-filesystem). Now, snapshots (and clones) seems to work just fine. Some other minor bits like zpool import/export, etc. now also work. File system is not yet marked as MPSAFE (it still operates under the Giant lock). -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpe5oKhmWBol.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Porting ZFS file system to FreeBSD.
Hi. I started porting the ZFS file system to the FreeBSD operating system. There is a lot to do, but I'm making good progress, I think. I'm doing my work in those directories: contrib/opensolaris/ - userland files taken directly from OpenSolaris (libzfs, zpool, zfs and others) sys/contrib/opensolaris/ - kernel files taken directly from OpenSolaris (zfs, taskq, callb and others) compat/opensolaris/ - compatibility userland layer, so I can reduce diffs against vendor files sys/compat/opensolaris/ - compatibility kernel layer, so I can reduce diffs against vendor files (kmem based on malloc(9) and uma(9), mutexes based on our sx(9) locks, condvars based on sx(9) locks and more) cddl/ - FreeBSD specific makefiles for userland bits sys/modules/zfs/ - FreeBSD specific makefile for the kernel module You can find all those on FreeBSD perforce server: http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/pjd/zfsHIDEDEL=NO Ok, so where am I? I ported the userland bits (libzfs, zfs and zpool). I had ztest and libzpool compiling and working as well, but I left them behind for now to focus on kernel bits. I'm building in all (except 2) files into zfs.ko (kernel module). I created new VDEV - vdev_geom, which fits to FreeBSD's GEOM infrastructure, so basically you can use any GEOM provider to build your ZFS pool. VDEV_GEOM is implemented as consumers-only GEOM class. I reimplemented ZVOL to also export storage as GEOM provider. This time it is providers-only GEOM class. This way one can create for example RAID-Z on top of GELI encrypted disks or encrypt ZFS volume. The order is free. Basically you can put UFS on ZFS volumes already and it behaves really stable even under heavy load. Currently I'm working on file system bits (ZPL), which is the most hard part of the entire ZFS port, because it talks to one of the most complex part of the FreeBSD kernel - VFS. I can already mount ZFS-created file systems (with 'zfs create' command), create files/directories, change permissions/owner/etc., list directories content, and perform few other minor operation. Some screenshots: lcf:root:~# uname -a FreeBSD lcf 7.0-CURRENT FreeBSD 7.0-CURRENT #74: Tue Aug 22 03:04:01 UTC 2006 [EMAIL PROTECTED]:/usr/obj/zoo/pjd/lcf/sys/LCF i386 lcf:root:~# zpool create tank raidz /dev/ad4a /dev/ad6a /dev/ad5a lcf:root:~# zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT tank 35,8G 11,7M 35,7G 0% ONLINE - lcf:root:~# zpool status pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz1ONLINE 0 0 0 ad4aONLINE 0 0 0 ad6aONLINE 0 0 0 ad5aONLINE 0 0 0 errors: No known data errors lcf:root:# zfs create -V 10g tank/vol lcf:root:# newfs /dev/zvol/tank/vol lcf:root:# mount /dev/zvol/tank/vol /mnt/test lcf:root:# zfs create tank/fs lcf:root:~# mount -t zfs,ufs tank on /tank (zfs, local) tank/fs on /tank/fs (zfs, local) /dev/zvol/tank/vol on /mnt/test (ufs, local) lcf:root:~# df -ht zfs,ufs FilesystemSizeUsed Avail Capacity Mounted on tank 13G 34K 13G 0%/tank tank/fs13G 33K 13G 0%/tank/fs /dev/zvol/tank/vol9.7G4.0K8.9G 0%/mnt/test lcf:root:~# mkdir /tank/fs/foo lcf:root:~# touch /tank/fs/foo/bar lcf:root:~# chown root:operator /tank/fs/foo /tank/fs/foo/bar lcf:root:~# chmod 500 /tank/fs/foo lcf:root:~# ls -ld /tank/fs/foo /tank/fs/foo/bar dr-x-- 2 root operator 3 22 sie 05:41 /tank/fs/foo -rw-r--r-- 1 root operator 0 22 sie 05:42 /tank/fs/foo/bar The most important missing pieces: - Most of the ZPL layer. - Autoconfiguration. I need implement vdev discovery based on GEOM's taste mechanism. - .zfs/ control directory (entirely commented out for now). And many more, but hey, this is after 10 days of work. PS. Please contact me privately if your company would like to donate to the ZFS effort. Even without sponsorship the work will be finished, but your contributions will allow me to spend more time working on ZFS. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp6CuJUY5xmT.pgp Description: PGP
Re: [zfs-discuss] Porting ZFS file system to FreeBSD.
On Tue, Aug 22, 2006 at 12:22:44PM +0100, Dick Davies wrote: This is fantastic work! How long have you been at it? As I said, 10 days, but this is really far from beeing finished. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgprXMey7FmJf.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [fbsd] Porting ZFS file system to FreeBSD.
On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote: I don't know much about ZFS, but Sun states this is a 128 bits filesystem. How will you handle this in regards to the FreeBSD kernel interface that is already struggling to be 64 bits compliant ? (I'm stating this based on this URL [1], but maybe it's not fully up-to-date.) 128 bits is not my goal, but I do want all the other goodies:) -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpZ0kyAmmAEI.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss