Re: [zfs-discuss] benefits of zfs root over ufs root
On 03/31/10 17:53, Erik Trimble wrote: Brett wrote: Hi Folks, Im in a shop thats very resistant to change. The management here are looking for major justification of a move away from ufs to zfs for root file systems. Does anyone know if there are any whitepapers/blogs/discussions extolling the benefits of zfsroot over ufsroot? Regards in advance Rep I can't give you any links, but here's a short list of advantages: (1) all the standard ZFS advantages over UFS (2) LiveUpgrade/beadm related improvements (a) much faster on ZFS (b) don't need dedicated slice per OS instance, so it's far simpler to have N different OS installs (c) very easy to keep track of which OS instance is installed where WITHOUT having to mount each one (d) huge space savings (snapshots save lots of space on upgrades) (3) much more flexible swap space allocation (no hard-boundary slices) (4) simpler layout of filesystem partitions, and more flexible in changing directory size limits (e.g. /var ) (5) mirroring a boot disk is simple under ZFS - much more complex under SVM/UFS (6) root-pool snapshots make backups trivially easy ZFS root will be the supported root filesystem for Solaris Next; we've been using it for OpenSolaris for a couple of years. - Bart -- Bart Smaalders Solaris Kernel Performance bart.smaald...@oracle.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs diff
On 03/29/10 16:44, Mike Gerdts wrote: On Mon, Mar 29, 2010 at 5:39 PM, Nicolas Williams nicolas.willi...@sun.com wrote: One really good use for zfs diff would be: as a way to index zfs send backups by contents. Or to generate the list of files for incremental backups via NetBackup or similar. This is especially important for file systems will millions of files with relatively few changes. Or to say keep indexing files on your desktop This gives everyone a way to access the changes in a filesystem order (number of files changed) instead of order(number of files extant). - Bart -- Bart Smaalders Solaris Kernel Performance bart.smaald...@oracle.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 - high cpu
On 02/23/10 15:20, Chris Ridd wrote: On 23 Feb 2010, at 19:53, Bruno Sousa wrote: The system becames really slow during the data copy using network, but i copy data between 2 pools of the box i don't notice that issue, so probably i may be hitting some sort of interrupt conflit in the network cards...This system is configured with alot of interfaces, being : 4 internal broadcom gigabit 1 PCIe 4x, Intel Dual Pro gigabit 1 PCIe 4x, Intel 10gbE card 2 PCIe 8x Sun non-raid HBA With all of this, is there any way to check if there is indeed an interrupt conflit or some other type of conflit that leads this high load? I also noticed some messages about acpi..can this acpi also affect the performance of the system? To see what interrupts are being shared: # echo ::interrupts -d | mdb -k Running intrstat might also be interesting. Cheers, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Is this using mpt driver? There's an issue w/ the fix for 6863127 that causes performance problems on larger memory machines, filed as 6908360. - Bart -- Bart Smaalders Solaris Kernel Performance ba...@cyber.eng.sun.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 - high cpu
On 02/24/10 12:57, Bruno Sousa wrote: Yes i'm using the mtp driver . In total this system has 3 HBA's, 1 internal (Dell perc), and 2 Sun non-raid HBA's. I'm also using multipath, but if i disable multipath i have pretty much the same results.. Bruno From what I understand, the fix is expected very soon; your performance is getting killed by the over-aggressive use of bounce buffers... - Bart -- Bart Smaalders Solaris Kernel Performance ba...@cyber.eng.sun.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs related google summer of code ideas - your vote
I would really like to see a feature like 'zfs diff f...@snap1 f...@othersnap' that would report the paths of files that have either been added, deleted, or changed between snapshots. If this could be done at the ZFS level instead of the application level it would be very cool. -- AFAIK, this is being actively developed, w/ a prototype working... - Bart -- Bart Smaalders Solaris Kernel Performance ba...@cyber.eng.sun.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pause Solaris with ZFS compression busy by doing a cp?
Neil Perrin wrote: I also noticed (perhaps by design) that a copy with compression off almost instantly returns, but the writes continue LONG after the cp process claims to be done. Is this normal? Yes this is normal. Unless the application is doing synchronous writes (eg DB) the file will be written to disk at the convenience of the FS. Most fs operate this way. It's too expensive to synchronously write out data, so it's batched up and written asynchronously. Wouldn't closing the file ensure it was written to disk? No. Is that tunable somewhere? No. For ZFS you can use sync(1M) which will force out all transactions for all files in the pool. That is expensive though. Neil. Your application can call f[d]sync when it's done writing the file and before it does the close if it wants all the data on disk. This has been standard operating procedure for many, many years. From TFMP: DESCRIPTION The fsync() function moves all modified data and attributes of the file descriptor fildes to a storage device. When fsync() returns, all in-memory modified copies of buffers associated with fildes have been written to the physical medium. The fsync() function is different from sync(), which schedules disk I/O for all files but returns before the I/O completes. The fsync() function forces all outstanding data operations to synchronized file integrity completion (see fcntl.h(3HEAD) definition of O_SYNC.) ... USAGE The fsync() function should be used by applications that require that a file be in a known state. For example, an application that contains a simple transaction facility might use fsync() to ensure that all changes to a file or files caused by a given transaction were recorded on a storage medium. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Chris Siebenmann wrote: | There are two issues here. One is the number of pools, but the other | is the small amount of RAM in the server. To be honest, most laptops | today come with 2 GBytes, and most servers are in the 8-16 GByte range | (hmmm... I suppose I could look up the average size we sell...) Speaking as a sysadmin (and a Sun customer), why on earth would I have to provision 8 GB+ of RAM on my NFS fileservers? I would much rather have that memory in the NFS client machines, where it can actually be put to work by user programs. This depends entirely on the amount of disk CPU on the fileserver... A Thumper w/ 48 TB of disk and two dual-core CPUS is prob. somewhat under-provisioned w/ 8 GB of RAM. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing known bad disk blocks before zfs encounters them
Richard Elling wrote: David wrote: I have some code that implements background media scanning so I am able to detect bad blocks well before zfs encounters them. I need a script or something that will map the known bad block(s) to a logical block so I can force zfs to repair the bad block from redundant/parity data. I can't find anything that isn't part of a draconian scanning/repair mechanism. Granted the zfs architecture can map physical block X to logical block Y, Z, and other letters of the alphabet .. but I want to go backwards. 2nd part of the question .. assuming I know /dev/dsk/c0t0d0 has an ECC error on block n, and I now have the appropriate storage pool info offset that corresponds to that block, then how do I force the file system to repair the offending block. This was easy to address in LINUX assuming the filesystem was built on the /dev/md driver, because all I had to do is force a read and twiddle with the parameters to force a non-cached I/O and subsequent repair. Just read it. -- richard It seems as if zfs's is too smart for it's own good and won't let me fix something that I know is bad, before zfs has a chance to discover it for itself. :) I think what the OP was saying is that he somehow knows that an unallocated block on the disk is bad, and he'd like to tell ZFS about it ahead of time. But repair implies there's data to read on the disk; ZFS won't read disk blocks it didn't write. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance lower than expected
Bart Van Assche wrote: Hello, I just made a setup in our lab which should make ZFS fly, but unfortunately performance is significantly lower than expected: for large sequential data transfers write speed is about 50 MB/s while I was expecting at least 150 MB/s. Setup - The setup consists of five servers in total: one OpenSolaris ZFS server and four SAN servers. ZFS accesses the SAN servers via iSCSI and IPoIB. * ZFS Server Operating system: OpenSolaris build 78. CPU: Two Intel Xeon CPU's, eight cores in total. RAM: 16 GB. Disks: not relevant for this test. * SAN Servers Operating system: Linux 2.6.22.18 kernel, 64-bit + iSCSI Enterprise Target (IET). IET has been configured such that it performs both read and write caching. CPU: Intel Xeon CPU E5310, 1.60GHz, four cores in total. RAM: two servers with 8 GB RAM, one with 4 GB RAM, one with 2 GB RAM. Disks: 16 disks in total: two disks with the Linux OS and 14 set up in RAID-0 via LVM. The LVM volume is exported via iSCSI and used by ZFS. These SAN servers give excellent performance results when accessed via Linux' open-iscsi initiator. * Network 4x SDR InfiniBand. The raw transfer speed of this network is 8 Gbit/s. Netperf reports 1.6 Gbit/s between the ZFS server and one SAN server (IPoIB, single-threaded). iSCSI transfer speed between the ZFS server and one SAN server is about 150 MB/s. Performance test Software: xdd (see also http://www.ioperformance.com/products.htm). I modified xdd such that the -dio command line option enables O_RSYNC and O_DSYNC in open() instead of calling directio(). Test command: xdd -verbose -processlock -dio -op write -targets 1 testfile -reqsize 1 -blocksize $((2**20)) -mbytes 1000 -passes 3 This test command triggers synchronous writes with a block size of 1 MB (verified this with truss). I am using synchronous writes because these give the same performance results as very large buffered writes (large compared to ZFS' cache size). Write performance reported by xdd for synchronous sequential writes: 50 MB/s, which is lower than expected. Any help with improving the performance of this setup is highly appreciated. Bart Van Assche. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss If I understand this correctly, you've stripped the disks together w/ Linux lvm, then exported a single ISCSI volume to ZFS (or two for mirroring; which isn't clear). I don't know how many concurrent IOs Solaris thinks your ISCSI volumes will handle, but that's one area to examine. The only way to realize full performance is going to be to get ZFS to issue multiple IOs to the ISCSI boxes at once. I'd also suggest just exporting the raw disks to zfs, and have it do the stripping. On 4 commodity 500 GB SATA drives set up w/ RAID Z, my 2.6 Ghz dual core AMD box sustains 100+ MB/sec read or write it happily saturates a GB nic w/ multiple concurrent reads over Samba. W/ 16 drives direct attach you should see close to 500 MB/sec sustained IO throughput. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS partition info makes my system not boot
Frank Bottone wrote: I'm using the latest build of opensolaris express available from opensolaris.org. I had no problems with the install (its an AMD64 x2 3800+, 1gb physical ram, 1 ide drive for the os and 4*250GB sata drives attached to the motherboard - nforce based chipset). I create a zfs pool on the 4 sata drives as a raidZ and the pool works fine. If I reboot with any of the 4 drives connected the system hangs right after all the drives are detected on the post screen. I need to put them in a different system and zero them with dd in order to be able to reconnect them to my server and still have the system boot properly. Any ideas on how I can get around this? It seems like the onboard system itself is getting confused by the metadata ZFS is adding to the drive. The system already has the latest available bios from the manufacturer - I'm not using any hardware raid of any sort. This is likely the BIOS getting confused by the EFI label on the disks. Since there's no newer BIOS available there are two ways around this problem: 1) put a normal label on the disk and give zfs slice 2, or 2) don't have the BIOS do auto-detect on those drives. Many BIOSs let you select None for the disk type; this will allow the system to boot. Solaris has no problem finding the drives even w/o the BIOSs help... - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs 32bits
Brian D. Horn wrote: Take a look at CR 6634371. It's worse than you probably thought. Actually, almost all of the problems noted in that bug are statistics. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Marcus Sundman wrote: Bart Smaalders [EMAIL PROTECTED] wrote: UTF8 is the answer here. If you care about anything more than simple ascii and you work in more than a single locale/encoding, use UTF8. You may not understand the meaning of a filename, but at least you'll see the same characters as the person who wrote it. I think you are a bit confused. A) If you meant that _I_ should use UTF-8 then that alone won't help. Let's say the person who created the file used ISO-8859-1 and named it 'häst', i.e., 0x68e47374. If I then use UTF-8 when displaying the filename my program will be faced with the problem of what to do with the second byte, 0xe4, which can't be decoded using UTF-8. (häst is 0x68c3a47374 in UTF-8, in case someone wonders.) What I mean is very simple: The OS has no way of merging your various encodings. If I create a directory, and have people from around the world create a file in that directory named after themselves in their own character sets, what should I see when I invoke: % ls -l | less in that directory? If you wish to share filenames across locales, I suggest you and everyone else writing to that directory use an encoding that will work across all those locales. The encoding that works well for this on Unix systems is UTF8, since it leaves '/' and NULL alone. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problems replacing a failed drive.
Michael Stalnaker wrote: I have a 24 disk SATA array running on Open Solaris Nevada, b78. We had a drive fail, and I’ve replaced the device but can’t get the system to recognize that I replaced the drive. Zpool status –v shows the failed drive: [EMAIL PROTECTED] ~]$ zpool status -v pool: LogData state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: resilver completed with 0 errors on Wed Feb 27 11:51:45 2008 config: NAME STATE READ WRITE CKSUM LogData DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c0t12d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t8d0 ONLINE 0 0 0 c0t16d0 ONLINE 0 0 0 c0t20d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c0t17d0 ONLINE 0 0 0 c0t20d0 FAULTED 0 0 0 too many errors c0t2d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c0t14d0 ONLINE 0 0 0 c0t18d0 ONLINE 0 0 0 c0t22d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c0t15d0 ONLINE 0 0 0 c0t19d0 ONLINE 0 0 0 c0t23d0 ONLINE 0 0 0 errors: No known data errors I tried doing a zpool clear with no luck: [EMAIL PROTECTED] ~]# zpool clear LogData c0t20d0 [EMAIL PROTECTED] ~]# zpool status -v pool: LogData state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: resilver completed with 0 errors on Wed Feb 27 11:51:45 2008 config: NAME STATE READ WRITE CKSUM LogData DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c0t12d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t8d0 ONLINE 0 0 0 c0t16d0 ONLINE 0 0 0 c0t20d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c0t17d0 ONLINE 0 0 0 c0t20d0 FAULTED 0 0 0 too many errors c0t2d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c0t14d0 ONLINE 0 0 0 c0t18d0 ONLINE 0 0 0 c0t22d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 And I’ve tried zpool replace: [EMAIL PROTECTED] ~]# [EMAIL PROTECTED] ~]# zpool replace -f LogData c0t20d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c0t20d0s0 is part of active ZFS pool LogData. Please see zpool(1M). So.. What am I missing here folks? Any help would be appreciated. Did you pull out the old drive and add a new one in its place hot? What does cfgadm -al report? Your drives should look like this: sata0/0::dsk/c7t0d0disk connectedconfigured ok sata0/1::dsk/c7t1d0disk connectedconfigured ok sata1/0::dsk/c8t0d0disk connectedconfigured ok sata1/1::dsk/c8t1d0disk connectedconfigured ok If c0t20d0 isn't configured, use # cfgadm -c configure sata1/1::dsk/c0t20d0 before attempting the zpool replace. hth - - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Marcus Sundman wrote: I'm unable to find more info about this. E.g., what does reject file names mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don't, then what happens with archives containing utf8-incompatible filenames? Note that the normal ZFS behavior is exactly what you'd expect: you get the filenames you wanted; the same ones back you put in. The trick is that in order to support such things as casesensitivity=false for CIFS, the OS needs to know what characters are uppercase vs lowercase, which means it needs to know about encodings, and reject codepoints which cannot be classified as uppercase vs lowercase. If you're not running a CIFS server, the defaults will allow you to create files w/ utf8 names very happily. : [EMAIL PROTECTED]; cat test Τη γλώσσα μου έδωσαν ελληνική : [EMAIL PROTECTED]; cat `cat test` this is a test w/ a utf8 filename : [EMAIL PROTECTED]; ls -l total 10 -rw-r--r-- 1 bartsstaff 37 Oct 22 15:45 Makefile -rw-r--r-- 1 bartsstaff 0 Oct 22 15:46 bar -rw-r--r-- 1 bartsstaff 0 Oct 22 15:46 foo -rw-r--r-- 1 bartsstaff 55 Feb 27 19:45 test -rw-r--r-- 1 bartsstaff301 Feb 27 19:44 test~ -rw-r--r-- 1 bartsstaff 34 Feb 27 19:46 Τη γλώσσα μου έδωσαν ελληνική : [EMAIL PROTECTED]; df -h . Filesystem size used avail capacity Mounted on zfs/home 228G 136G48G74%/export/home/cyber : [EMAIL PROTECTED]; - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Possible interest for ZFS encryption
Ian Collins wrote: Disk encryption easily defeated, research shows http://www.itpro.co.uk/storage/news/170304/disk-encryption-easily-defeated-research-shows.html Freezing RAM, whatever next? Ian Interesting... although not leaving system suspended to ram and zeroing ram on shutdown would seem simple to implement safeguards. Yes, if someone steals the laptop while you're using it you've got problems :-) - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Plans for swapping to part of a pool
Lori Alt wrote: Darren J Moffat wrote: As part of the ARC inception review for ZFS crypto we were asked to follow up on PSARC/2006/370 which indicates that swap dump will be done using a means other than a ZVOL. Currently I have the ZFS crypto project allowing for ephemeral keys to support using a ZVOL as a swap device. Since it seems that we won't be swapping on ZVOLS I need to find out more how we will be providing swap and dump space in a root pool. The current plan is to provide what we're calling (for lack of a better term. I'm open to suggestions.) a pseudo-zvol. It's preallocated space within the pool, logically concatenated by a driver to appear like a disk or a slice. It's meant to be a low overhead way to emulate a slice within a pool. So no COW or related zfs features are provided, except for the ability to change its size without having to re-partition a disk. A pseudo-zvol will support both swap and dump. It will also be possible to use a slice for swapping, just as is done now with ufs roots. But we're hoping that the overhead of a pseudo-zvol will be low enough that administrators will take advantage of it to simplify installation (it allows a user to dedicate an entire disk to a root pool, without having to carve out part of it for swapping.) Eventually, swapping on true zvols might be supported (the problems with swapping to zvols are considered bugs), but fixing those bugs are a bigger task than we want to take on for the zfs-boot project. We decided on pseudo-zvols as a lower-risk approach for the time being. I suspect that the best answer to encrypted swap is that we do it independently of which filesystem/device is being used as the swap device - ie do it inside the VM system. ' Treat a pseudo-zvol like you would a slice. So these new zvol-like things don't support snapshots, etc, right? I take it they work by allowing overwriting of the data, correct? Are these a zslice? aside For those of us who've been swapping to zvols for some time, can you describe the failure modes? /aside - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raid is very slow???
Orvar Korvar wrote: When I copy that file from ZFS to /dev/null I get this output: real0m0.025s user0m0.002s sys 0m0.007s which can't be correct. Is it wrong of me to use time cp fil fil2 when measuring disk performance? replying to just this part of your message for now cp opens the source file, mmaps it, opens the target file, and does a single write of the entire file contents. /dev/null's write routine doesn't actually do a copy into the kernel, it just returns success. As a result, the source file is never paged into memory. -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slow write speed to ZFS pool (via NFS)
Joe S wrote: I have a couple of performance questions. Right now, I am transferring about 200GB of data via NFS to my new Solaris server. I started this YESTERDAY. When writing to my ZFS pool via NFS, I notice what I believe to be slow write speeds. My client hosts vary between a MacBook Pro running Tiger to a FreeBSD 6.2 Intel server. All clients are connected to the a 10/100/1000 switch. * Is there anything I can tune on my server? * Is the problem with NFS? * Do I need to provide any other information? If you have a lot of small files, doing this sort of thing over NFS can be pretty painful... for a speedup, consider: (cd oldroot on client; tar cf - .) | ssh [EMAIL PROTECTED] '(cd newroot on server; tar xf -)' - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Z-Raid performance with Random reads/writes
michael T sedwick wrote: Given a 1.6TB ZFS Z-Raid consisting 6 disks: And a system that does an extreme amount of small /(20K) /random reads /(more than twice as many reads as writes) / 1) What performance gains, if any does Z-Raid offer over other RAID or Large filesystem configurations? 2) What is any hindrance is Z-Raid to this configuration, given the complete randomness and size of these accesses? Would there be a better means of configuring a ZFS environment for this type of activity? thanks; A 6 disk raidz set is not optimal for random reads, since each disk in the raidz set needs to be accessed to retrieve each item. Note that if the reads are single threaded, this doesn't apply. However, if multiple reads are extant at the same time, configuring the disks as 2 sets of 3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x (approx) total parallel random read throughput. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Z-Raid performance with Random reads/writes
Ian Collins wrote: Bart Smaalders wrote: michael T sedwick wrote: Given a 1.6TB ZFS Z-Raid consisting 6 disks: And a system that does an extreme amount of small /(20K) /random reads /(more than twice as many reads as writes) / 1) What performance gains, if any does Z-Raid offer over other RAID or Large filesystem configurations? 2) What is any hindrance is Z-Raid to this configuration, given the complete randomness and size of these accesses? Would there be a better means of configuring a ZFS environment for this type of activity? thanks; A 6 disk raidz set is not optimal for random reads, since each disk in the raidz set needs to be accessed to retrieve each item. Note that if the reads are single threaded, this doesn't apply. However, if multiple reads are extant at the same time, configuring the disks as 2 sets of 3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x (approx) total parallel random read throughput. I'm not sure why, but when I was testing various configurations with bonnie++, 3 pairs of mirrors did give about 3x the random read performance of a 6 disk raidz, but with 4 pairs, the random read performance dropped by 50%: 3x2 Blockread: 220464 Random read: 1520.1 4x2 Block read: 295747 Random read: 765.3 Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss interesting I wonder if the blocks being read were stripped across two mirror pairs; this would result in having to read 2 sets of mirror pairs, which would produce the reported results... - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best use of 4 drives?
Ian Collins wrote: Rick Mann wrote: BTW, I don't mind if the boot drive fails, because it will be fairly easy to replace, and this server is only mission-critical to me and my friends. So...suggestions? What's a good way to utilize the power and glory of ZFS in a 4x 500 GB system, without unnecessary waste? Bung in (add a USB one if you don't have space) a small boot drive and use all the others for for ZFS. Ian This is how I run my home server w/ 4 500GB drives - a small 40GB IDE drive provides root swap/dump device, the 4 500 GB drives are RAIDZ contain all the data. I ran out of drive bays, so I used one of those 5 1/4 - 3.5 adaptor brackets to hang the boot drive where a second DVD drive would go... - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Best use of 4 drives?
Ian Collins wrote: Rick Mann wrote: Ian Collins wrote: Bung in (add a USB one if you don't have space) a small boot drive and use all the others for for ZFS. Not a bad idea; I'll have to see where I can put one. But, I thought I read somewhere that one can't use ZFS for swap. Or maybe I read this: I wouldn't bother, just spec the machine with enough RAM so swap's only real use is as a dump device. You can always use a swap file if you have to. Ian If you compile stuff (like opensolaris), you'll want swap space. Esp. if you use dmake; 30 parallel C++ compilations can use up a lot of RAM. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on CF/SSDs [was: ZFS - Use h/w raid or not?Thoughts.Considerations.]
Frank Cusack wrote: On May 31, 2007 1:59:04 PM -0700 Richard Elling [EMAIL PROTECTED] wrote: CF cards aren't generally very fast, so the solid state disk vendors are putting them into hard disk form factors with SAS/SATA interfaces. These If CF cards aren't fast, how will putting them into a different form factor make them faster? Well, if I were doing that I'd use DRAM and provide enough on-board capacitance and a small processor to copy the contents of the DRAM to flash on power failure. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: How does ZFS write data to disks?
Bill Moloney wrote: for example, doing sequential 1MB writes to a previously written) zvol (simple catenation of 5 FC drives in a JBOD) and writing 2GB of data induced more than 4GB of IO to the drives (with smaller write sizes this ratio gets progressively worse) How did you measure this? This would imply that rewriting a zvol would be limited at below 50% of disk bandwidth, not something I'm seeing at all. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
Marko Milisavljevic wrote: I missed an important conclusion from j's data, and that is that single disk raw access gives him 56MB/s, and RAID 0 array gives him 961/46=21MB/s per disk, which comes in at 38% of potential performance. That is in the ballpark of getting 45% of potential performance, as I am seeing with my puny setup of single or dual drives. Of course, I don't expect a complex file system to match raw disk dd performance, but it doesn't compare favourably to common file systems like UFS or ext3, so the question remains, is ZFS overhead normally this big? That would mean that one needs to have at least 4-5 way stripe to generate enough data to saturate gigabit ethernet, compared to 2-3 way stripe on a lesser filesystem, a possibly important consideration in SOHO situation. I don't see this on my system, but it has more CPU (dual core 2.6 GHz). It saturates a GB net w/ 4 drives samba, not working hard at all. A thumper does 2 GB/sec w 2 dual core CPUs. Do you have compression enabled? This can be a choke point for weak CPUs. - Bart Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: gzip compression throttles system?
Adam Leventhal wrote: On Wed, May 09, 2007 at 11:52:06AM +0100, Darren J Moffat wrote: Can you give some more info on what these problems are. I was thinking of this bug: 6460622 zio_nowait() doesn't live up to its name Which was surprised to find was fixed by Eric in build 59. Adam It was pointed out by Jürgen Keil that using ZFS compression submits a lot of prio 60 tasks to the system task queues; this would clobber interactive performance. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ---BeginMessage--- with recent bits ZFS compression is now handled concurrently with many CPUs working on different records. So this load will burn more CPUs and acheive it's results (compression) faster. So the observed pauses should be consistent with that of a load generating high system time. The assumption is that compression now goes faster than when is was single threaded. Is this undesirable ? We might seek a way to slow down compression in order to limit the system load. According to this dtrace script #!/usr/sbin/dtrace -s sdt:genunix::taskq-enqueue /((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/ { @where[stack()] = count(); } tick-5s { printa(@where); trunc(@where); } ... I see bursts of ~ 1000 zio_write_compress() [gzip] taskq calls enqueued into the spa_zio_issue taskq by zfs`spa_sync() and its children: 0 76337 :tick-5s ... zfs`zio_next_stage+0xa1 zfs`zio_wait_for_children+0x5d zfs`zio_wait_children_ready+0x20 zfs`zio_next_stage_async+0xbb zfs`zio_nowait+0x11 zfs`dbuf_sync_leaf+0x1b3 zfs`dbuf_sync_list+0x51 zfs`dbuf_sync_indirect+0xcd zfs`dbuf_sync_list+0x5e zfs`dbuf_sync_indirect+0xcd zfs`dbuf_sync_list+0x5e zfs`dnode_sync+0x214 zfs`dmu_objset_sync_dnodes+0x55 zfs`dmu_objset_sync+0x13d zfs`dsl_dataset_sync+0x42 zfs`dsl_pool_sync+0xb5 zfs`spa_sync+0x1c5 zfs`txg_sync_thread+0x19a unix`thread_start+0x8 1092 0 76337 :tick-5s It seems that after such a batch of compress requests is submitted to the spa_zio_issue taskq, the kernel is busy for several seconds working on these taskq entries. It seems that this blocks all other taskq activity inside the kernel... This dtrace script counts the number of zio_write_compress() calls enqueued / execed by the kernel per second: #!/usr/sbin/dtrace -qs sdt:genunix::taskq-enqueue /((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/ { this-tqe = (taskq_ent_t *)arg1; @enq[this-tqe-tqent_func] = count(); } sdt:genunix::taskq-exec-end /((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/ { this-tqe = (taskq_ent_t *)arg1; @exec[this-tqe-tqent_func] = count(); } tick-1s { /* printf(%Y\n, walltimestamp); */ printf(TS(sec): %u\n, timestamp / 10); printa(enqueue %a: [EMAIL PROTECTED], @enq); printa(exec%a: [EMAIL PROTECTED], @exec); trunc(@enq); trunc(@exec); } I see bursts of zio_write_compress() calls enqueued / execed, and periods of time where no zio_write_compress() taskq calls are enqueued or execed. 10# ~jk/src/dtrace/zpool_gzip7.d TS(sec): 7829 TS(sec): 7830 TS(sec): 7831 TS(sec): 7832 TS(sec): 7833 TS(sec): 7834 TS(sec): 7835 enqueue zfs`zio_write_compress: 1330 execzfs`zio_write_compress: 1330 TS(sec): 7836 TS(sec): 7837 TS(sec): 7838 TS(sec): 7839 TS(sec): 7840 TS(sec): 7841 TS(sec): 7842 TS(sec): 7843 TS(sec): 7844 enqueue zfs`zio_write_compress: 1116 execzfs`zio_write_compress: 1116 TS(sec): 7845 TS(sec): 7846 TS(sec): 7847 TS(sec): 7848 TS(sec): 7849 TS(sec): 7850 TS(sec): 7851 TS(sec): 7852 TS(sec): 7853 TS(sec): 7854 TS(sec): 7855 TS(sec): 7856 TS(sec): 7857 enqueue zfs`zio_write_compress: 932 execzfs`zio_write_compress: 932 TS(sec): 7858 TS(sec): 7859 TS(sec): 7860 TS(sec): 7861 TS(sec): 7862 TS(sec): 7863 TS(sec): 7864 TS(sec): 7865 TS(sec): 7866 TS(sec): 7867 enqueue zfs`zio_write_compress: 5 execzfs`zio_write_compress: 5 TS(sec): 7868 enqueue zfs`zio_write_compress: 774 execzfs`zio_write_compress: 774 TS(sec): 7869 TS(sec): 7870 TS(sec): 7871 TS(sec): 7872 TS(sec): 7873 TS(sec): 7874 TS(sec): 7875 TS(sec): 7876 enqueue zfs`zio_write_compress: 653 execzfs`zio_write_compress: 653 TS(sec): 7877 TS(sec): 7878 TS(sec): 7879 TS(sec): 7880 TS(sec): 7881 And a final dtrace script, which monitors scheduler activity while filling a gzip compressed pool: #!/usr/sbin/dtrace -qs sched:::off-cpu, sched:::on-cpu, sched:::remain-cpu, sched:::preempt
Re: [zfs-discuss] Force rewriting of all data, to push stripes onto newly added devices?
Mario Goebbels wrote: I'm just in sort of a scenario, where I've added devices to a pool and would now like the existing data to be spread across the new drives, to increase the performance. Is there a way to do it, like a scrub? Or would I have to have all files to copy over themselves, or similar hacks? Thanks, -mg This requires rewriting the block pointers; it's the same problem as supporting vdev removal. I would guess that they'll be solved at the same time. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gzip compression throttles system?
Ian Collins wrote: I just had a quick play with gzip compression on a filesystem and the result was the machine grinding to a halt while copying some large (.wav) files to it from another filesystem in the same pool. The system became very unresponsive, taking several seconds to echo keystrokes. The box is a maxed out AMD QuadFX, so it should have plenty of grunt for this. Comments? Ian How big were the files, what OS build are you running and how much memory on the system? Were you copying in parallel? - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance model for sustained, contiguous writes?
Adam Lindsay wrote: Hi folks. I'm looking at putting together a 16-disk ZFS array as a server, and after reading Richard Elling's writings on the matter, I'm now left wondering if it'll have the performance we expect of such a server. Looking at his figures, 5x 3-disk RAIDZ sets seems like it *might* be made to do what we want (saturate a GigE link), but not without some tuning Am I right in my understanding of relling's small, random read model? For mirrored configurations, read performance is proportional to the number of disks. Write performance is proportional to the number of mirror sets. For parity configurations, read performance is proportional to the number of RAID sets. Write performance is roughly the same. Clearly, there are elements of the model that don't apply to our sustained read/writes, so does anyone have any guidance (theoretical or empirical) on what we could expect in that arena? I've seen some references to a different ZFS mode of operation for sustained and/or contiguous transfers. What should I know about them? Finally, some requirements I have in speccing up this server: My requirements: . Saturate a 1GigE link for sustained reads _and_ writes ... (long story... let's just imagine uncompressed HD video) . Do it cheaply My strong desires: . ZFS for its reliability, redundancy, flexibility, and ease of use . Maximise the amount of usable space My resources: . a server with 16x 500GB SATA drives usable for RAID What you need to know is what part of your workload is random reads. This will directly determine the number of spindles required. Otherwise, if your workload is sequential reads or writes, you can pretty much just use an average value for disk throughput with your drives and adequate CPU, you'll have absolutely no problems _melting_ a 1GB net. You want to think about how many disk failures you want to handle before things go south... there's always a tension between reliability and storage and performance. Consider 2 striped sets of raidz2 drives - w/ 6+2 drives in each set, you get 12 drives worth of streaming IO (read or write). That will be about 500 MB/sec, rather more than you can get though a 1 GB net. That's the aggregate bandwidth; you should be able to both sink and source data at 1Gb/sec w/o any difficulties at all. If you do a lot of random reads, however, that config will behave like 2 disks in terms of IOPs. To do lots of IOPs, you want to be striped across lots of 2 disk mirror pairs. My guess is if you're doing video, you're doing lots of streaming IO (eg you may be reading 20 files at once, but those files are all being read sequentially). If that's the case, ZFS can do lots of clever prefetching on the write side, ZFS due to its COW behavior will just handle both random and sequentially writes pretty much the same way. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance model for sustained, contiguous writes?
Adam Lindsay wrote: Okay, the way you say it, it sounds like a good thing. I misunderstood the performance ramifications of COW and ZFS's opportunistic write locations, and came up with much more pessimistic guess that it would approach random writes. As it is, I have upper (number of data spindles) and lower (number of disk sets) bounds to deal with. I suppose the available caching memory is what controls the resilience to the demands of random reads? W/ that many drives (16), if you hit in RAM the reads are not really random :-), or they span only a tiny fraction of the available disk space. Are you reading and writing the same file at the same time? Your cache hit rate will be much better then - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance model for sustained, contiguous writes?
Adam Lindsay wrote: Bart Smaalders wrote: Adam Lindsay wrote: Okay, the way you say it, it sounds like a good thing. I misunderstood the performance ramifications of COW and ZFS's opportunistic write locations, and came up with much more pessimistic guess that it would approach random writes. As it is, I have upper (number of data spindles) and lower (number of disk sets) bounds to deal with. I suppose the available caching memory is what controls the resilience to the demands of random reads? W/ that many drives (16), if you hit in RAM the reads are not really random :-), or they span only a tiny fraction of the available disk space. Clearly I hadn't thought that comment through. :) I think my mental model included imagined bottlenecks elsewhere in the system, but I haven't got to discussing those yet. Hmmm... that _was_ prob. more opaque than necessary. What I meant was that you've got something on the order of 5TB or better of disk space; assuming uniformly distributed reads of data and 4 GB of RAM, the odds of hitting in the cache is essentially zero wrt performance. Are you reading and writing the same file at the same time? Your cache hit rate will be much better then Not in the general case. Hmm, but there are some scenarios with multimedia caching boxes, so that could be interesting to leverage eventually. bedankt, adam graag gedaan. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: CAD application not working with zfs
Dirk Jakobsmeier wrote: Hello Basrt, tanks for your answer. The filesystems on different projects are sized between 20 to 400 gb. Those filesystem sizes where no problem on earlier installation (vxfs) and should not be a problem now. I can reproduce this error with the 20 gb filesystem. Regards. Are you using nfsv4 for the mount? Or nfsv3? Some idea of the failing app's system calls just prior to failure may yield the answer as to what's causing the problem. These problems are usually mishandled error conditions... - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CAD application not working with zfs
Dirk Jakobsmeier wrote: Hello, was use several cad applications and with one of those we have problems using zfs. OS and hardware is SunOS 5.10 Generic_118855-36, Fire X4200, the cad application is catia v4. There are several configuration and data files stored on the server and shared via nfs to solaris and aix clients. The application is crashing on the aix client except the server is sharing those files from a ufs filesystem. Has anybody informations in this? What are the sizes of the filesystems being exported? Perhaps the AIX client cannot cope w/ large filesystems? - Basrt -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there any performance problem with hard links in ZFS?
Viktor Turskyi wrote: Is there any performance problem with hard links in ZFS? I have a large storage. There will be near 5 hard links for every file. Is it ok for ZFS? May be some problems with snapshots(every 30 minutes there will be a snapshot creating)? What about difference in speed while working with 5 hardlinks or 5 different files? ps: It would be very useful if you give me some links about hardlinks low-level processing. This message posted from opensolaris.org On my 2 Ghz opteron w/ 2 mirrored zfs disks: cyber% cat test.c #include unistd.h #include stdlib.h #include stdio.h #include strings.h int main(int argc, char *argv[]) { int i; char *filename; char buffer[1024]; if (argc != 2) { fprintf(stderr, usage: %s filename\n, argv[0]); exit(1); } strcpy(buffer, argv[1]); filename = buffer + strlen(filename); for (i = 0; i 5; i++) { sprintf(filename, _%d\n, i); if(link(foo, buffer) 0) { perror(link:); exit(1); } } } cyber% ls testtest.c cyber% cc -o test test.c cyber% mkfile 10k foo cyber% /bin/ptime ./test foo real0.976 user0.039 sys 0.936 cyber% ls | wc 13 50003 538906 cyber% /bin/ptime rm foo_* real1.869 user0.110 sys 1.757 cyber% So it takes just under 1 second to create 50,000 hardlinks to a file; it takes just under 2 seconds to delete 'em w/ rm. It would prob. be faster to use a program to delete them. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: update on zfs boot support
Richard Elling wrote: [sorry for the late reply, the original got stuck in the mail] clarification below... Ian Collins wrote: Thanks for the heads up. I'm building a new file server at the moment and I'd like to make sure I can migrate to ZFS boot when it arrives. My current plan is to create a pool on 4 500GB drives and throw in a small boot drive. Will I be able to drop the boot drive and move / over to the pool when ZFS boot ships? Yes, should be able to, given that you have already had an UFS boot drive running root. Hi, However, this raises another concert that during recent discussions regarding to disk layout of a zfs system (http://www.opensolaris.org/jive/thread.jspa?threadID= 25679tstart=0) it was said that currently we'd better give zfs the whole device (rather than slices) and keep swap off zfs devices for better performance. If the above recommendation still holds, we still have to have a swap device out there othere than devices managed by zfs. is this limited by the design or implementation of zfs? We've updated the wiki to help clarify this confusion. The consensus best practice is to have enough RAM that you don't need to swap. If you need to swap, your life will be sad no matter what your disk config is. For those systems with limited numbers of disks, you really don't have much choice about where swap is located, so keep track of your swap *usage* and adjust the system accordingly. -- richard This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss One thing several of us want to do in Nevada is allocate swap space transparently out of the root pool. Yes, there'd be reservations/allocations, etc. All we need then is a way to have a dedicated dump device in the same pool... - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X2200-M2
Jason J. W. Williams wrote: Hi Brian, To my understanding the X2100 M2 and X2200 M2 are basically the same board OEM'd from Quanta...except the 2200 M2 has two sockets. As to ZFS and their weirdness, it would seem to me that fixing it would be more an issue of the SATA/SCSI driver. I may be wrong here. Actually, what has to happen is that we stop using the SATA chipset in IDE compat mode and write proper SATA drivers for it... and manage the upgrade issues,driver name changes, etc. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Efficiency when reading the same file blocks
Jeff Davis wrote: Given your question are you about to come back with a case where you are not seeing this? Actually, the case where I saw the bad behavior was in Linux using the CFQ I/O scheduler. When reading the same file sequentially, adding processes drastically reduced total disk throughput (single disk machine). Using the Linux anticipatory scheduler worked just fine: no additional I/O costs for more processes. That got me worried about the project I'm working on, and I wanted to understand ZFS's caching behavior better to prove to myself that the problem wouldn't happen under ZFS. Clearly the block will be in cache on the second read, but what I'd like to know is if ZFS will ask the disk to do a long, efficient sequential read of the disk, or whether it will somehow not recognize that the read is sequential because the requests are coming from different processes? This message posted from opensolaris.org ZFS has a pretty clever IO scheduler; it will handle multiple readers of the same file, readers of different files, etc; in each case prefetch is done correctly. It also handles programs that skip blocks... You can see this pretty simply; for small configs (where a single CPU can saturate all the drives) the net throughput of the drives doesn't vary significantly if one is reading a single file or reading 10 files in parallel. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS failed Disk Rebuild time on x4500
I've measured resync on some slow IDE disks (*not* an X4500) at an average of 20 MBytes/s. So if you have a 500 GByte drive, that would resync a 100% full file system in about 7 hours versus 11 days for some other systems My experience is that a set of 80% full 250 MB drives took a bit less than 2 hours each to replace in a 4x raidz config. The majority of space used was taken by large files (isos, music and movie files (yes, I have teenagers)), although there's a large number of small files as well. This makes for a performance of a bit less than 40 MB/sec during resilvering. The system was pretty sluggish during this operation, but it had only got 1GB of RAM, half of which firefox wanted :-/. This was build 55 of Nevada. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Peter Schuller wrote: Hello, Often fsync() is used not because one cares that some piece of data is on stable storage, but because one wants to ensure the subsequent I/O operations are performed after previous I/O operations are on stable storage. In these cases the latency introduced by an fsync() is completely unnecessary. An fbarrier() or similar would be extremely useful to get the proper semantics while still allowing for better performance than what you get with fsync(). My assumption has been that this has not been traditionally implemented for reasons of implementation complexity. Given ZFS's copy-on-write transactional model, would it not be almost trivial to implement fbarrier()? Basically just choose to wrap up the transaction at the point of fbarrier() and that's it. Am I missing something? (I do not actually have a use case for this on ZFS, since my experience with ZFS is thus far limited to my home storage server. But I have wished for an fbarrier() many many times over the past few years...) Hmmm... is store ordering what you're looking for? Eg make sure that in the case of power failure, all previous writes will be visible after reboot if any subsequent write are visible. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Cheap ZFS homeserver.
Tom Buskey wrote: Tom Buskey wrote: As a followup, the system I'm trying to use this on is a dual PII 400 with 512MB. Real low budget. Hmm... that's lower than I would have expected. Something is ikely wrong. These machines do have very limited memory How fast can you DD from the raw device to /dev/null? Roughly 230Mb/s Do you mean ~28MB/sec? Something is definitely bogus. What happens when you do dd from both drives at once? - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FYI: ZFS on USB sticks (from Germany)
Constantin Gonzalez wrote: Hi Richard, Richard Elling wrote: FYI, here is an interesting blog on using ZFS with a dozen USB drives from Constantin. http://blogs.sun.com/solarium/entry/solaris_zfs_auf_12_usb thank you for spotting it :). We're working on translating the video (hope we get the lip-syncing right...) and will then re-release it in an english version. BTW, we've now hosted the video on YouTube so it can be embedded in the blog. Of course, I'll then write an english version of the blog entry with the tech details. Please hang on for a week or two... :). Best regards, Constantin Brilliant video, guys. I particularly liked the fellow in the background with the hardhat and snow shovel :-). The USB stick machinations were pretty cool, too. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: can I use zfs on just a partition?
Tim Cook wrote: I guess I should clarify what I'm doing. Essentially I'd like to have the / and swap on the first 60GB of the disk. Then use the remaining 100GB as a zfs partition to setup zones on. Obviously the snapshots are extremely useful in such a setup :) Does my plan sound feasible from both a usability and performance standpoint? That's exactly how I'm running my ferrari laptop. Works like a charm. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: External drive enclosures + Sun Server for massstorage
Frank Cusack wrote: yes I am an experienced Solaris admin and know all about devfsadm :-) and the older disks command. It doesn't help in this case. I think it's a BIOS thing. Linux and Windows can't see IDE drives that aren't there at boot time either, and on Solaris the SATA controller runs in some legacy mode so I guess that's why you can't see the newly added drive. Unfortunately all my x2100 hardware is in production and I can't readily retest this to verify. -frank This is exactly the issue; some of the simple SATA drives are used in PATA compatibility mode. The ide driver doesn't know a thing about hot anything, so we would need a proper SATA driver for these chips. Since they work (with the exception of hot *) it is difficult to prioritize this work above getting some other piece of hardware working under Solaris. In addition, switching drivers bios configs during upgrade is a non-trivial exercise. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
[EMAIL PROTECTED] wrote: In order to protect the user pages while a DIO is in progress, we want support from the VM that isn't presently implemented. To prevent a page from being accessed by another thread, we have to unmap the TLB/PTE entries and lock the page. There's a cost associated with this, as it may be necessary to cross-call other CPUs. Any thread that accesses the locked pages will block. While it's possible lock pages in the VM today, there isn't a neat set of interfaces the filesystem can use to maintain the integrity of the user's buffers. Without an experimental prototype to verify the design, it's impossible to say whether overhead of manipulating the page permissions is more than the cost of bypassing the cache. Note also that for most applications, the size of their IO operations would often not match the current page size of the buffer, causing additional performance and scalability issues. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X2100 not hotswap
Frank Cusack wrote: It's interesting the topics that come up here, which really have little to do with zfs. I guess it just shows how great zfs is. I mean, you would never have a ufs list that talked about the merits of sata vs sas and what hardware do i buy. Also interesting is that zfs exposes hardware bugs yet I don't think that's what really drives the hardware questions here. Actually, I think it's the easy admin of more that a simple mirror so all of a sudden it's simple to deal with multiple drives, add more later, etc... so connectivity to low end boxes becomes important. Also, of course, SATA is still relatively new and we don't yet have extensive controller support (understatement). - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on PC Based Hardware for NAS?
Elm, Rob wrote: Hello ZFS Discussion Members, I'm looking for help or advice on a project I'm working on: I have a 939 Gigabyte motherboard, with 4 SATAII ports on the nForce4 chipset, and 4 SATA ports off the SIL3114 controller. I recently purchased 5, 320gig SATAII drives... http://tinyurl.com/yf5z9o I wanted to install Solaris86 on this machine and turn it into a NAS server. ZFS looks very attractive, but I don't believe it can be used for a boot drive. How would you setup a system like this? I'd boot on ide for now. In the future, we'll be able to boot from a ZFS mirror, but since most root drives don't get much use, sticking w/ two ide drives there would prob. be fine. There are performance/space/safety tradeoffs to be made. What are your goals wrt to these attributes? I can purchase additional SATA or IDE hard drives...For example, I could get 3 more 320gig SATAII drives, and fill all the SATA ports. And hook up an IDE drive as the system boot drive. Sincerely, You may wish to take a look at my latest blog post: http://blogs.sun.com/barts - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Help understanding some benchmark results
Chris Smith wrote: What build/version of Solaris/ZFS are you using? Solaris 11/06. What block size are you using for writes in bonnie++? I ind performance on streaming writes is better w/ larger writes. I'm afraid I don't know what block size bonnie++ uses by default - I'm not specifying one on the commandline. trussing bonnie will tell you... I just built bonnie++; by default it uses 8k. What happens when you run two threads at once? Does write performance improve? If I use raidz, no (overall throughput is actually nearly halved !). If I use RAID0 (just striped disks, no redundancy) it improves (significantly in some cases). Increasing the blocksize will help. You can do that on bonnie++ like this: ./bonnie++ -d /internal/ -s 8g:128k ... Make sure you don't have compression on Some observations: * This machine only has 32 bit CPUs. Could that be limiting performance ? It will, but it shouldn't be too awful here. You can lower kernelbase to let the kernel have more of the RAM on the machine. You're more likely going to run into problems w/ the front side bus; my experience w/ older Xeons is that one CPU could easily saturate the FSB; using the other would just make things worse. You should not be running into that yet, either, though. Offline one of the CPUs w/ psradm -f 1; reenable w/ psradm -n 1. * A single drive will hit ~60MB/s read and write. Since these are only 7200rpm SATA disks, that's probably all they've got to give. On a good day on the right part of the drive... slowest to fastest sectors can be 2:1 in performance... What can you get with your drives w/ dd to the raw device when not part of a pool? Eg /bin/ptime dd if=/dev/zero of=/dev/dsk/... bs=128k bc=2 You can also do this test to a file to see what the peak is going to be... What kind of write performance do people get out of those honkin' big x4500s ? ~2GB/sec locally, 1 GB/sec over the network. This requires multiple writing threads; a single CPU just isn't fast enough to write 2GB/sec. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help understanding some benchmark results
Chris Smith wrote: G'day, all, So, I've decided to migrate my home server from Linux+swRAID+LVM to Solaris+ZFS, because it seems to hold much better promise for data integrity, which is my primary concern. However, naturally, I decided to do some benchmarks in the process, and I don't understand why the results are what they are. I though I had a reasonable understanding of ZFS, but now I'm not so sure. I've used bonnie++ and a variety of Linux RAID configs below to approximate equivalent ZFS configurations and compare. I do realise they're not exactly the same thing, but it seems to me they're reasonable comparisons and should return at least somewhat similar performance. I also realise bonnie++ is not an especially comprehensive or complex benchmark, but ultimately I don't really care about performance and this was only done out of curiosity. The executive summary is that ZFS write performance appears to be relatively awful (all the time), and it's read performance is relatively good most of the time (with striping, mirroring and raidz[2]'s with fewer numbers of disks). Examples: * 8-disk RAID0 on Linux returns about 190MB/s write and 245MB/sec read, while a ZFS raidz using the same disks returns about 120MB/sec write, but 420MB/sec read. * 16-disk RAID10 on Linux returns 165MB/sec and 440MB/sec write and read, while a ZFS pool with 8 mirrored disks returns 140MB/sec write and 410MB/sec read. * 16-disk RAID6 on Linux returns 126MB/sec write, 162MB/sec read, while a 16-disk raidz2 returns 80MB/sec write and 142MB/sec read. The biggest problem I am having understanding why is it so, is because I was under the impression with ZFS's CoW, etc, that writing (*especially* writes like this, to a raidz array) should be much faster than a regular old-fashioned RAID6 array. I certainly can't complain about the read speed, however - 400-odd MB/sec out of this old beastie is pretty impressive :). Help ? Have I missed something obvious or done something silly ? (Additionally, from the Linux perspective, why are reads so slow ?) What build/version of Solaris/ZFS are you using? What block size are you using for writes in bonnie++? I find performance on streaming writes is better w/ larger writes. What happens when you run two threads at once? Does write performance improve? Does zpool iostat -v 1 report anything interesting during the benchmark? What about iostat -x 1? Is one disk significantly more busy than the others? I have a 4x 500GB disk raidz config w/ a 2.6 GHz dual core at home; this config sustains approx 120 MB/sec on reads and writes on single or multiple streams. I'm running build 55; the box has a SI controller running in PATA compat. mode. One of the challenging aspects of performance work on these sorts of things is separating out drivers vs cpus vs memory bandwidth vs disk behavior vs intrinsic FS behavior. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] use the same zfs filesystem with differnet mountpoint
Fabian Wörner wrote: I think of have solaris and mac os 10.5 on the same machine and mount same filesystem on to differnet point on each os. Is/will it possible or do I have to use sym. links? Since the mount point is stored in the ZFS pool, you'll need to use legacy mounts to do this. This works fine between different Solaris versions; if the MAC folks didn't change their on disk format it might just work between OS-X and Solaris as well. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
[EMAIL PROTECTED] wrote: I have been looking at zfs source trying to get up to speed on the internals. One thing that interests me about the fs is what appears to be a low hanging fruit for block squishing CAS (Content Addressable Storage). I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. I am writing to see if I can get some feedback from people that know the code better than I -- are there any gotchas in my logic? Assumptions: SHA256 hash used (Fletcher2/4 have too many collisions, SHA256 is 2^128 if I remember correctly) SHA256 hash is taken on the data portion of the block as it exists on disk. the metadata structure is hashed separately. In the current metadata structure, there is a reserved bit portion to be used in the future. Description of change: Creates: The filesystem goes through its normal process of writing a block, and creating the checksum. Before the step where the metadata tree is pushed, the checksum is checked against a global checksum tree to see if there is any match. If match exists, insert a metadata placeholder for the block, that references the already existing block on disk, increment a number_of_links pointer on the metadata blocks to keep track of the pointers pointing to this block. free up the new block that was written and check-summed to be used in the future. else if no match, update the checksum tree with the new checksum and continue as normal. Deletes: normal process, except verifying that the number_of_links count is lowered and if it is non zero then do not free the block. clean up checksum tree as needed. What this requires: A new flag in metadata that can tag the block as a CAS block. A checksum tree that allows easy fast lookup of checksum keys. a counter in the metadata or hash tree that tracks links back to blocks. Some additions to the userland apps to push the config/enable modes. Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Val Henson wrote a paper on this topic; there's a copy here: http://infohost.nmt.edu/~val/review/hash.pdf - Bart Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
Anantha N. Srirama wrote: Quick update, since my original post I've confirmed via DTrace (rwtop script in toolkit) that the application is not generating 150MB/S * compressratio of I/O. What then is causing this much I/O in our system? This message posted from opensolaris.org Are you doing random IO? Appending or overwriting? - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in a SAN environment
Jason J. W. Williams wrote: Not sure. I don't see an advantage to moving off UFS for boot pools. :-) -J Except of course that snapshots clones will surely be a nicer way of recovering from adverse administrative events... -= Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hardware planning for storage server
Jakob Praher wrote: hi all, I'd like to build a solid storage server using zfs and opensolaris. The server more or less should have a NAS role, thus using nfsv4 to export the data to other nodes. ... what would be your reasonable advice? First of all, figure out what you need in terms of capacity and IOPS/sec. This will determine the number of spindles, cpus, network adaptors, etc. Keep in mind, for large sequential reads and large writes you can get a significant fraction of the max throughput of the drives; my 4 x 500 GB RAIDZ configuration does 150 MB/sec pretty consistently. If you're doing small random reads or writes, you'll be much more limited by the number of spindles and the way you configure them. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz DEGRADED state
Krzys wrote: my drive did go bad on me, how do I replace it? I am sunning solaris 10 U2 (by the way, I thought U3 would be out in November, will it be out soon? does anyone know? [11:35:14] server11: /export/home/me zpool status -x pool: mypool2 state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: none requested config: NAMESTATE READ WRITE CKSUM mypool2 DEGRADED 0 0 0 raidz DEGRADED 0 0 0 c3t0d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 c3t6d0 UNAVAIL 0 679 0 cannot open errors: No known data errors ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Shut down the machine, replace the drive, reboot and type: zpool replace mypool2 c3t6d0 On earlier versions of ZFS I found it useful to do this at the login prompt; it seemed fairly memory intensive. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there inotify for ZFS?
LingBo Tang wrote: Hi all, As inotify for Linux, is there same mechanism in Solaris for ZFS? I think this functionality is helpful for desktop search engine. I know one engineer of Sun is working on file event monitor, which will provide some information of file events, but is not for search purpose because it might has problem while monitoring a large system. The right way to implement a desktop search engine w/ ZFS is an API that would let you cheaply discover all the files modified after an arbitrary file in that filesystem. Note that a filesystem is capable of being modified far faster than a indexing program can process those modifications. As a result, any notification scheme must either block further filesystem changes (unacceptable), provide infinite storage of pending change notification events (difficult in practice), or provide a means for re-discovering what has changed since the last time the changes were examined. Since the latter mechanism is needed anyway to handle initialization or modifications during periods when the search engine isn't running, making finding modified files cheap seems like the easiest and most robust approach. Since ZFS uses COW semantics, it is possible to provide a means to very cheaply discover files that have been modified since another file in the filesystem. This cost of this discovery may be very roughly order of the number of modified files times the average number of files in a directory times mean modified file directory depth. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Recommended Minimum Hardware for ZFS Fileserver?
Wes Williams wrote: Thanks gents for your replies. I've used to a very large config W2100z and ZFS for awhile but didn't know how low can you go for ZFS to shine, though a 64-bit CPU seems to be the minimum performance threshold. Now that Sun's store is [sort of] working again, I can see some X2100's with the custom configure and a very low starting price of only $450 sans CPU, drives, and memory. Great!! [b]If only we could get a basic X2100-ish designed, custom build priced, server from Sun that could also hold 3-5 drives internally[/b], I could see a bunch of those being used as ZFS file servers. This would also be a good price point for small office and home users since the X4100 is certainly overkill in this application, though I'd wouldn't refuse one offered to me. =) I built my own, using essentially the same mobo (tyan 2865). The Ultra 20 is slightly different, but not enough to matter. I put it in a case that would hold more drives and a larger power supply, and I've got a nice home server w/ a TB of disk (effective space 750GB). Very simple and easy. Right now I'm still using a single disk for /, since I'm worried about safegarding data, not making sure I have max availability. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Where is the ZFS configuration data stored?
Sergey wrote: + a little addition to the original quesion: Imagine that you have a RAID attached to Solaris server. There's ZFS on RAID. And someday you lost your server completely (fired motherboard, physical crash, ...). Is there any way to connect the RAID to some another server and restore ZFS layout (not loosing all data on RAID)? If the RAID controller is undamaged, just hook it up and go; you can import the ZFS pool on another system seamlessly. If the RAID controller gets damaged, you'll need to follow the manufacturer's documentation to restore your data. JBODs are simple, easy and relatively foolproof when used w/ ZFS. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: eric kustarz wrote: I want per pool, per dataset, and per file - where all are done by the filesystem (ZFS), not the application. I was talking about a further enhancement to copies than what Matt is currently proposing - per file copies, but its more work (one thing being we don't have administrative control over files per se). Now if you could do that and make it something that can be set at install time it would get a lot more interesting. When you install Solaris to that single laptop drive you can select files or even directories that have more then one copy in case of a problem down the road. Actually, this is a perfect use case for setting the copies=2 property after installation. The original binaries are quite replaceable; the customizations and personal files created later on are not. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem with ZFS's performance
Josip Gracin wrote: Hello! Could somebody please explain the following bad performance of a machine running ZFS. I have a feeling it has something to with the way ZFS uses memory, because I've checked with ::kmastat and it shows that ZFS uses huge amounts of memory which I think is killing the performance of the machine. This is the test program: #include malloc.h #include stdio.h int main() { char *buf = calloc(51200,1); if ( buf == NULL ) { printf(Allocation failed.\n); } return 0; } I've run the test program on the following two different machines, both under light load: Machine A is AMD64 3000+ (2.0GHz), 1GB RAM running snv_46. - Machine B is Pentium4,2.4GHz, 512MB RAM running Linux. Execution times on several consecutive runs are: Machine A time ./a.out ./a.out 0.49s user 1.39s system 2% cpu 1:03.25 total ./a.out 0.48s user 1.28s system 3% cpu 50.691 total ./a.out 0.48s user 1.27s system 4% cpu 38.225 total ./a.out 0.48s user 1.24s system 5% cpu 30.694 total ./a.out 0.47s user 1.20s system 5% cpu 28.640 total ./a.out 0.47s user 1.23s system 6% cpu 28.210 total ./a.out 0.47s user 1.21s system 6% cpu 27.700 total ./a.out 0.47s user 1.19s system 9% cpu 17.875 total ./a.out 0.46s user 1.15s system 12% cpu 12.784 total On machine B [the first run took approx. 10 seconds, I forgot to paste it] ./a.out 0.14s user 0.89s system 27% cpu 3.711 total ./a.out 0.13s user 0.87s system 25% cpu 3.926 total ./a.out 0.11s user 0.90s system 29% cpu 3.456 total ./a.out 0.11s user 0.91s system 29% cpu 3.435 total ./a.out 0.10s user 0.91s system 38% cpu 2.597 total ./a.out 0.11s user 0.93s system 35% cpu 2.913 total ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss There are several things going on here, and part of that may well be the memory utilization of ZFS. Have you tried the same experiment when not using ZFS? Keep in mind that Solaris doesn't always use the most efficient strategies for paging applications... this is something we're actively working on fixing as part of the VM work going on... -Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Write cache
Jesus Cea wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Neil Perrin wrote: I suppose if you know the disk only contains zfs slices then write caching could be manually enabled using format -e - cache - write_cache - enable When will we have write cache control over ATA/SATA drives? :-). A method of controlling write cache independent of drive type, color or flavor is being developed I'll ping the responsible parties (bcc'd). - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sucking down my memory!?
Joseph Mocker wrote: Bart Smaalders wrote: How much swap space is configured on this machine? Zero. Is there any reason I would want to configure any swap space? --joe Well, if you want to allocate 500 MB in /tmp, and your machine has no swap, you need 500M of physical memory or the write _will_ fail. W/ no swap configured, every allocation in every process of any malloc'd memory, etc, is locked into RAM. I just swap on a zvol w/ my ZFS root machine. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: delegated administration
Matthew Ahrens wrote: On Mon, Jul 17, 2006 at 09:44:28AM -0700, Bart Smaalders wrote: Mark Shellenbaum wrote: PERMISSION GRANTING zfs allow -c ability[,ability...] dataset -c Create means that the permission will be granted (Locally) to the creator on any newly-created descendant filesystems. ALLOW EXAMPLE Lets setup a public build machine where engineers in group staff can create ZFS file systems,clones,snapshots and so on, but you want to allow only creator of the file system to destroy it. # zpool create sandbox disks # chmod 1777 /sandbox # zfs allow -l staff create sandbox # zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox So as administrator what do I need to do to set /export/home up for users to be able to create their own snapshots, create dependent filesystems (but still mounted underneath their /export/home/usrname)? In other words, is there a way to specify the rights of the owner of a filesystem rather than the individual - eg, delayed evaluation of the owner? I think you're asking for the -c Creator flag. This allows permissions (eg, to take snapshots) to be granted to whoever creates the filesystem. The above example shows how this might be done. --matt Actually, I think I mean owner. I want root to create a new filesystem for a new user under the /export/home filesystem, but then have that user get the right privs via inheritance rather than requiring root to run a set of zfs commands. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: delegated administration
Matthew Ahrens wrote: On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote: So as administrator what do I need to do to set /export/home up for users to be able to create their own snapshots, create dependent filesystems (but still mounted underneath their /export/home/usrname)? In other words, is there a way to specify the rights of the owner of a filesystem rather than the individual - eg, delayed evaluation of the owner? I think you're asking for the -c Creator flag. This allows permissions (eg, to take snapshots) to be granted to whoever creates the filesystem. The above example shows how this might be done. --matt Actually, I think I mean owner. I want root to create a new filesystem for a new user under the /export/home filesystem, but then have that user get the right privs via inheritance rather than requiring root to run a set of zfs commands. In that case, how should the system determine who the owner is? We toyed with the idea of figuring out the user based on the last component of the filesystem name, but that seemed too tricky, at least for the first version. FYI, here is how you can do it with an additional zfs command: # zfs create tank/home/barts # zfs allow barts create,snapshot,... tank/home/barts --matt Owner of the top level directory is the owner of the filesystem? - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS home/JDS interaction issue
Thomas Maier-Komor wrote: Hi, I just upgraded my machine at home to Solaris 10U2. As I already had a ZFS, I wanted to migrate my home directories at once to a ZFS from a local UFS metadisk. Copying and changing the config of the automounter succeeded without any problems. But when I tried to login to JDS, login suceeded, but JDS did not start and the X session gets always terminated after a couple of seconds. /var/dt/Xerrors says that /dev/fb could not be accessed, although it works without any problem when running from the UFS filesystem. Switching back to my UFS based home resolved this issue. I even tried switching over to ZFS and rebooted the machine to make 100% sure everything is in a sane state (i.e. no gconfd etc.), but the issue persisted and switching back to UFS again resolved this issue. Has anybody else had similar problems? Any idea how to resolve this? TIA, Tom I'm running w/ ZFS mounted home directories both on my home and work machines; my work desktop has ZFS root as well. Are you sure you moved just your home directory? Is the automounter config the same (wrt to setuid, etc)? Can you log in as root when ZFS is your home directory? If not, there's something else going on - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Gregory Shaw wrote: On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote: How would ZFS self heal in this case? You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. If you've got requirements for surviving an array failure, the recommended solution in that case is to mirror between volumes on multiple arrays. I've always liked software raid (mirroring) in that case, as no manual intervention is needed in the event of an array failure. Mirroring between discrete arrays is usually reserved for mission-critical applications that cost thousands of dollars per hour in downtime. In other words, it won't. You've spent the disk space, but because you're mirroring in the wrong place (the raid array) all ZFS can do is tell you that your data is gone. With luck, subsequent reads _might_ get the right data, but maybe not. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive write cache
Gregory Shaw wrote: I had a question to the group: In the different ZFS discussions in zfs-discuss, I've seen a recurring theme of disabling write cache on disks. I would think that the performance increase of using write cache would be an advantage, and that write cache should be enabled. Realistically, I can see only one situation where write cache would be an issue. If there is no way to flush the write cache, it would be possible for corruption to occur due to a power loss. There are two failure modes associated with disk write caches: 1) the disk write cache for performance reasons doesn't write back data (to diff. blocks) to the platter in the order they were received, so transactional ordering isn't maintained and corruption can occur. 2) writes to different can disks have different caching policies, so transactions to files on different filesystems may not complete correctly during a power failure. ZFS enables the write cache and flushes it when committing transaction groups; this insures that all of a transaction group appears or does not appear on disk. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss