[zfs-discuss] A simple script to measure SYNC writes
Hi, There was a requirement to measure all the OSYNC writes. Attached is a simple DTrace script which does this using the fsinfo provider and fbt::fop_write. I was wondering if this accurate enough or if I missed any other cases. I am sure this can be improved in many ways. Thanks and regards, Sanjeev #!/usr/sbin/dtrace -Cs /* CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License, Version 1.0 only * (the License). You may not use this file except in compliance * with the License. * * You can obtain a copy of the license at Docs/cddl1.txt * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * CDDL HEADER END * * Author: Sanjeev Bagewadi [Bangalore, India] */ #pragma D option quiet #include sys/file.h BEGIN { secs = 10; } fbt::fop_write:entry /arg2 (FSYNC | FDSYNC)/ { self-trace = 1; } fbt::fop_write:return /self-trace/ { self-trace = 0; } fsinfo::fop_write:write /self-trace/ { vp = (vnode_t *) arg0; vfs = (vfs_t *) vp-v_vfsp; mnt_pt = (char *)((refstr_t *)vfs-vfs_mntpt-rs_string); uio = (uio_t *) arg1; /*...@writes[stringof(mnt_pt)] = sum(uio-uio_resid);*/ @writes[args[0]-fi_mount] = sum(args[1]); } tick-1s /secs != 0/ { secs--; } tick-1s /secs == 0/ { exit(0); } END { printa(%s %...@8d\n, @writes); } ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot remove a file on a GOOD ZFS filesystem
Marcelo, Thanks for the details ! This rules out a bug that I was suspecting : http://bugs.opensolaris.org/view_bug.do?bug_id=6664765 This needs more analysis. What does the rm command fail with ? We could probably run truss on the rm command like : truss -o /tmp/rm.truss rm filename You then pass on the file : /tmp/rm.truss This would show us which system call is failing and why. That would give us a good idea of what is going wrong. Thanks and regards, Sanjeev. Marcelo Leal wrote: Hello all, # zpool status pool: mypool state: ONLINE scrub: scrub completed after 0h2m with 0 errors on Fri Dec 19 09:32:42 2008 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t8d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 logs ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 errors: No known data errors - zfs list -r shows eight filesystems, and nine snapshots per filesystem. ... mypool/colorado 1.83G 4.00T 1.13G /mypool/colorado mypool/color...@centenario-2008-12-28-01:00:00 40.3M - 1.46G - mypool/color...@centenario-2008-12-29-01:00:00 30.0M - 1.54G - mypool/color...@campeao-2008-12-29-09:00:00 10.4M - 1.24G - mypool/color...@campeao-2008-12-29-13:00:00 31.5M - 1.29G - mypool/color...@campeao-2008-12-29-17:00:00 5.46M - 1.10G - mypool/color...@campeao-2008-12-29-21:00:00 4.23M - 1.13G - mypool/color...@centenario-2008-12-30-01:00:00 0 - 1.16G - mypool/color...@campeao-2008-12-30-01:00:00 0 - 1.16G - mypool/color...@campeao-2008-12-30-05:00:00 6.24M - 1.16G - ... - How many entries does it have ? Now there is just one file, the problematic one... but before the whole problem, four or five small files (the whole pool is pretty empty). - Which filesystem (of the zpool) does it belong to ? See above... Thanks a lot! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot remove a file on a GOOD ZFS filesystem
Marcelo, Thanks for the details. Comments inline... Marcelo Leal wrote: execve(/usr/bin/rm, 0x08047DBC, 0x08047DC8) argc = 2 mmap(0x, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFEFF resolvepath(/usr/lib/ld.so.1, /lib/ld.so.1, 1023) = 12 resolvepath(/usr/bin/rm, /usr/bin/rm, 1023) = 11 sysconfig(_CONFIG_PAGESIZE) = 4096 xstat(2, /usr/bin/rm, 0x08047A68) = 0 open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT xstat(2, /lib/libc.so.1, 0x080471C8) = 0 resolvepath(/lib/libc.so.1, /lib/libc.so.1, 1023) = 14 open(/lib/libc.so.1, O_RDONLY)= 3 fstatat64(AT_FDCWD, Arquivos.file, 0x08047C80, 0x1000) Err#2 ENOENT This is interesting ! Note that the fstatat64() call is failing with ENOENT. So, there is something we are missing. I assume you are able to list the directory contents and ascertain that the file exists. Can you please provide the directory listing (ls -l) of the directory in question ? Note that a ls -l would use fstat64 to get the stats of the files. So, truss on ls -l would also help. Thanks and regards, Sanjeev. fstat64(2, 0x08046CE0) = 0 write(2, r m : , 4) = 4 write(2, Arquivos . fil.., 13) = 13 write(2, : , 2) = 2 write(2, N o s u c h f i l e.., 25) = 25 write(2, \n, 1) = 1 _exit(2) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot remove a file on a GOOD ZFS filesystem
Marcelo, Marcelo Leal wrote: Hello all... Can that be caused by some cache on the LSI controller? Some flush that the controller or disk did not honour? More details on the problem would help. Can you please give the following details : - zpool status - zfs list -r - The details of the directory : - How many entries does it have ? - Which filesystem (of the zpool) does it belong to ? Thanks and regards, Sanjeev. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
Jel, Jens Elkner wrote: On Fri, Nov 21, 2008 at 03:42:17PM -0800, David Pacheco wrote: Pawel Tecza wrote: But I still don't understand why `zfs list` doesn't display snapshots by default. I saw it in the Net many times at the examples of zfs usage. This was PSARC/2008/469 - excluding snapshot info from 'zfs list' http://opensolaris.org/os/community/on/flag-days/pages/2008091003/ The uncomplete one - where is the '-t all' option? It's really annoying, error prone, time consuming to type stories on the command line ... Does anybody remember the keep it small and simple thing? This change was made because there were a lot of users who has a large number of snapshots. This would cause 2 problems : - The listing of all the snapshots and filesystems would be really long. - Also, this would take rather long time... For those who want the older behaviour can still set the listsnapshots property accordingly. Hope that helps. Regards, Sanjeev. Regards, jel. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How can i make my zpool as faulted.
Yuvraj, I see that you are using files as disks. You could write a few random bytes to one of the files and that would induce corruption. To make a particular disk faulty you could mv the file to a new name. Also, you can explore the zinject from zfs testsuite . Probably it has a way to induce fault. Thanks and regards, Sanjeev yuvraj wrote: Hi Sanjeev, I am herewith giving all the details of my zpool by firirng #zpool status command on commandline. Please go through the same and help me out. Thanks in advance. Regards, Yuvraj Balkrishna Jadhav. == # zpool status pool: mypool1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM mypool1 ONLINE 0 0 0 /disk1ONLINE 0 0 0 /disk2ONLINE 0 0 0 errors: No known data errors pool: zpool21 state: ONLINE scrub: scrub completed with 0 errors on Sat Oct 18 13:01:52 2008 config: NAMESTATE READ WRITE CKSUM zpool21 ONLINE 0 0 0 /disk3ONLINE 0 0 0 /disk4ONLINE 0 0 0 errors: No known data errors -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver running for 35 trillion years
Mike, Indeed an interesting result :) ! This is a known problem with VirtualBox :) They have fixed it in the latest release -- snip -- #1639: Solaris Virtual box guest keeps getting its time reset after resuming VM from suspend +--- Reporter: nagki | Owner: Type: defect | Status: closed Priority: major |Version: VirtualBox 1.6.0 Resolution: duplicate | Keywords: time reset guest +--- Changes (by sandervl73): * status: new = closed * resolution: = duplicate Comment: Duplicate and fixed in 1.6.2 (due out in a day or two) -- snip -- Cheers, Sanjeev. Mike Gerdts wrote: This is good for a chuckle. # zpool status pool: rpool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 307445734561488536h47m, 19.31% done, 307445734560416371h48m to go config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirrorONLINE 0 0 0 c7d0s0 ONLINE 0 0 0 c7d1s0 ONLINE 0 0 0 errors: No known data errors I'm all for the 128 bit file system being able to use every atom in the universe for storage, but I doubt that this pool has been resilvering for over 35 trillion years. If it has, I'm certainly not staying up to wait for it to finish... How did this happen? According to the timestamps in my prompt, I'm thinking that virtualbox reset the time to zero while the command was running. This seems to happen from time to time, but this is the most entertaining result I have seen. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS fragmentation
Lance, This could be bug#*6596237 Stop looking and start ganging http://monaco.sfbay/detail.jsf?cr=6596237.* The fix is in progress and Victor Latushkin is working on it. We have an IDR based on the patches 127127-11/127128-11 which has the first cut of the fix. You could raise an escalation and get these IDRs. Thanks and regards, Sanjeev. Lance wrote: Any progress on a defragmentation utility? We appear to be having a severe fragmentation problem on an X4500, vanilla S10U4, no additional patches. 500GB disks in 4 x 11 disk RAIDZ2 vdevs. It hit 97% full and fell off a cliff...about 50KB/sec on writes. Deleting files so the zpool is at 92% has not helped. I rebooted the host...no difference. I lowered the recordsize from 128KB to 8KB. That has boosted performance to 250-500KB/sec on writes (still 10x-100x too slow). Reads have been fine all along. This is one big zpool and one file system of 16TB. Approximately 25-30M files, some of which change often. Lots of small, changing files, which are probably aggravating the problem. Due to the Marvell driver bug, I have SATA NCQ turned off in /etc/system via set sata:sata_func_enable=0x5. We plan to go to the most recent patch set so I can remove that, but I'm not convinced patching will fix the slowness we're seeing. We'll try to delete more files, but having a defragmentation utility might help in this case. It seems a shame to waste 10-20% of your disk space to maintain moderate performance, though I guess that's what we'll have to do. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS space map causing slow performance
Scott, This looks more like bug#*6596237 Stop looking and start ganging http://monaco.sfbay/detail.jsf?cr=6596237. * What version of Solaris are the production servers running (S10 or Opensolaris) ? Thanks and regards, Sanjeev. Scott wrote: Hello, I have several ~12TB storage servers using Solaris with ZFS. Two of them have recently developed performance issues where the majority of time in an spa_sync() will be spent in the space_map_*() functions. During this time, zpool iostat will show 0 writes to disk, while it does hundreds or thousands of small (~3KB) reads each second, presumably reading space map data from disk to find places to put the new blocks. The result is that it can take several minutes for an spa_sync() to complete, even if I'm only writing a single 128KB block. Using DTrace, I can see that space_map_alloc() frequently returns -1 for 128KB blocks. From my understanding of the ZFS code, that means that one or more metaslabs has no 128KB blocks available. Because of that, it seems to be spending a lot of time going through different space maps which aren't able to all be cached in RAM at the same time, thus causing bad performance as it has to read from the disks. The on-disk space map size seems to be about 500MB. I assume the simple solution is to leave enough free space available so that the space map functions don't have to hunt around so much. This problem starts happening when there's about 1TB free out of the 12TB. It seems like such a shame to waste that much space, so if anyone has any suggestions, I'd be glad to hear them. 1) Is there anything I can do to temporarily fix the servers that are having this problem? They are production servers, and I have customers complaining, so a temporary fix is needed. 2) Is there any sort of tuning I can do with future servers to prevent this from becoming a problem? Perhaps a way to make sure all the space maps are always in RAM? 3) I set recordsize=32K and turned off compression, thinking that should fix the performance problem for now. However, using a DTrace script to watch calls to space_map_alloc(), I see that it's still looking for 128KB blocks (!!!) for reasons that are unclear to me, thus it hasn't helped the problem. Thanks, Scott This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
William, It should be fairly easy to find the record size using DTrace. Take an aggregation of the the writes happening (aggregate on size for all the write(2) system calls). This would give fair idea of the IO size pattern. Does RRD4J have a record size mentioned ? Usually if it is a database-application they have a record-size option when the DB is created (based on my limited knowledge about DBs). Thanks and regards, Sanjeev. PS : Here is a simple script which just aggregates on the write size and executable name : -- snip -- #!/usr/sbin/dtrace -s syscall::write:entry { wsize = (size_t) arg2; @write[wsize, execname] = count(); } -- snip -- William Fretts-Saxton wrote: Unfortunately, I don't know the record size of the writes. Is it as simple as looking @ the size of a file, before and after a client request, and noting the difference in size? This is binary data, so I don't know if that makes a difference, but the average write size is a lot smaller than the file size. Should the recordsize be in place BEFORE data is written to the file system, or can it be changed after the fact? I might try a bunch of different settings for trial and error. The I/O is actually done by RRD4J, which is a round-robin database library. It is a Java version of 'rrdtool' which saves data into a binary format, but also cleans up the data according to its age, saving less of the older data as time goes on. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs device busy
Carol, Probably /mnt is already in use ie. some other filesystem is mounted there. Can you please verify ? What is the original mountpoint of pool/zfs1 ? Regards, Sanjeev. Caroline Carol wrote: Hi all, When i modify zfs FS propreties I get device busy -bash-3.00# zfs set mountpoint=/mnt1 pool/zfs1 cannot unmount '/mnt': Device busy Do you know how to identify porcess accessing this FS ? fuser doesn't work with zfs! Thanks a lot regards Carol Ne gardez plus qu'une seule adresse mail ! Copiez vos mails http://fr.rd.yahoo.com/mail/mail_taglines/trueswitch/*http://www.trueswitch.com/yahoo-fr/ vers Yahoo! Mail ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs won't import a pool automatically at boot
Michael, If you don't call zpool export -f tank it should work. However, it would be necessary to understand why you are using the above command after creation of the zpool. Can you avoid exporting after the creation ? Regards, Sanjeev Michael Goff wrote: Hi, When jumpstarting s10x_u4_fcs onto a machine, I have a postinstall script which does: zpool create tank c1d0s7 c2d0s7 c3d0s7 c4d0s7 zfs create tank/data zfs set mountpoint=/data tank/data zpool export -f tank When jumpstart finishes and the node reboots, the pool is not imported automatically. I have to do: zpool import tank for it to show up. Then on subsequent reboots it imports and mounts automatically. What I can I do to get it to mount automatically the first time? When I didn't have the zpool export I would get an message that I needed to use zpool import -f because it wasn't exported properly from another machine. So it looks like the state of the pool created during the jumpstart install was lost. BTW, I love using zfs commands to manage filesystems. They are so easy and intuitive! thanks, Mike This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs won't import a pool automatically at boot
Thanks Robert ! I missed that part. -- Sanjeev. Michael Goff wrote: Great, thanks Robert. That's what I was looking for. I was thinking that I would have to transfer the state somehow from the temporary jumpstart environment to /a so that it would be persistent. I'll test it out tomorrow. Sanjeev, when I did not have the zpool export, it still did not import automatically upon reboot after the jumpstart. And when I imported it manually, if gave an error. So that's why I added the export. Mike Robert Milkowski wrote: Hello Sanjeev, Tuesday, October 16, 2007, 10:14:01 AM, you wrote: SB Michael, SB If you don't call zpool export -f tank it should work. SB However, it would be necessary to understand why you are using the above SB command after creation of the zpool. SB Can you avoid exporting after the creation ? It won't help during jumpstart as /etc is not the same one as after he will boot. Before you export a pool put in your finish script: cp -p /etc/zfs/zpool.cache /a/etc/zfs/ Then export a pool. It should do the trick. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Directory
Kanishk, Directories are implemented as ZAP objects. Look at the routines in that order : - zfs_lookup() - zfs_dirlook() - zfs_dirent_lock() - zap_lookup Hope that helps. Regards, Sanjeev. kanishk wrote: i wanted to know how does ZFS finds an entry of a file from its dirctory object. anylinks to the code will be highly appriciated. thankx regards kanishk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS
Nathan, Some answers inline... Nathan Huisman wrote: = PROBLEM To create a disk storage system that will act as an archive point for user data (Non-recoverable data), and also act as a back end storage unit for virtual machines at a block level. = BUDGET Currently I have about 25-30k to start the project, more could be allocated in the next fiscal year for perhaps a backup solution. = TIMEFRAME I have 8 days to cut a P.O. before our fiscal year ends. = STORAGE REQUIREMENTS 5-10tb of redundant fairly high speed storage = QUESTION #1 What is the best way to mirror two zfs pools in order to achieve a sort of HA storage system? I don't want to have to physically swap my disks into another system if any of the hardware on the ZFS server dies. If I have the following configuration what is the best way to mirror these in near real time? BOX 1 (JBOD-ZFS) BOX 2 (JBOD-ZFS) I've seen the zfs send and recieve commands but I'm not sure how well that would work with a close to real time mirror. If you want close to realtime mirroring (across pools in this case) AVS would be a better option in my opinion. Refer to : http://www.opensolaris.org/os/project/avs/Demos/AVS-ZFS-Demo-V1/ = QUESTION #2 Can ZFS be exported via iscsi and then imported as a disk to a linux system and then be formated with another file system. I wish to use ZFS as a block level file systems for my virtual machines. Specifically using xen. If this is possible, how stable is this? How is error checking handled if the zfs is exported via iscsi and then the block device formated to ext3? Will zfs still be able to check for errors? If this is possible and this all works, then are there ways to expand a zfs iscsi exported volume and then expand the ext3 file system on the remote host? Yes, you can create volumes (ZVOL) in a Zpool and export them over iscsi. The ZVOL would guarantee the data consistency at the block level. Expanding the ZVOL should be possible. However, I am not sure if/how iSCSI behaves here. You might need to try it out. = QUESTION #3 How does zfs handle a bad drive? What process must I go through in order to take out a bad drive and replace it with a good one? # zpool replace poolname bad-drive new-good-drive The other option would be configure hot-spares and they will kickin automatically when a bad-drive is detected. = QUESTION #4 What is a good way to back up this HA storage unit? Snapshots will provide an easy way to do it live, but should it be dumped into a tape library, or an third offsite zfs pool using zfs send/recieve or ? = QUESTION #5 Does the following setup work? BOX 1 (JBOD) - iscsi export - BOX 2 ZFS. In other words, can I setup a bunch of thin storage boxes with low cpu and ram instead of using sas or fc to supply the jbod to the zfs server? Should be feasible. Just that you would then need a robust LAN and that would be flooded. Thanks and regards, Sanjeev. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Contents of transaction group?
Atul, Atul Vidwansa wrote: Hi, I have few questions about the way a transaction group is created. 1. Is it possible to group transactions related to multiple operations in same group? For example, an rmdir foo followed by mkdir bar, can these end up in same transaction group? Each TXG is 5 sec long (in normal cases unless some operation forcefully closed it). So, it is quite possible that the 2 syscalls can end up in the same TXG. But, is not guaranteed. If it has to be guaranteed then this logic will have to be built into the VNODE ops code. ie. ZPL code. However, that would be tricky as rmdir and mkdir are 2 different syscalls and I am not sure what locking issues you would need to take care. 2. Is it possible for an operation (say write()) to occupie multiple transaction groups? Yes. 3. Is it possible to know the thread id(s) for every commited txg_id? The TXG is always synced by the txg threads. Not sure why you want it. Regards, Sanjeev. Regards, -Atul ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: simple Raid-Z question
MC, If you originally had 4 * 500 GB disks configured in RAID-Z, you cannot add 1 single disk and grow the capacity of the pool (with same protection). This is not allowed. Regards, Sanjeev. MC wrote: Two conflicting answers to the same question? I guess we need someone to break the tie :) Hello, I have been reading alot of good things about Raid-z, but before I jump into it I have one unanswered question i can't find a clear answer for. Is it possible to enlarge the initial RAID size by adding single drives later on? If i start off with 4*500gb (should give me 1.5tb), can I add one 500gb to the raid later, and will the total size then grow 500gb and still have the same protection? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Kstats
Atul, libkstat(3LIB) is the library. man -s 3KSTAT kstat should give a good start. Regards, Sanjeev. Atul Vidwansa wrote: Peter, How do I get those stats programatically? Any clues? Regards, _Atul On 3/27/07, Peter Tribble [EMAIL PROTECTED] wrote: On 3/27/07, Atul Vidwansa [EMAIL PROTECTED] wrote: Hi, Does ZFS has support for kstats? If I want to extract information like no of files commited to disk during an interval, no of transactions performed, I/O bandwidth etc, how can I get that information? From the command line, look at the fsstat utility. If you want the raw kstats then you need to look for ones of the form 'unix:0:vopstats_*' where there are two forms: with the name of the filesystem type (eg zfs or ufs) on the end, or the device id of the individual filesystem. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Firewire/USB enclosures
Mike, We have used 4 disks (2X80GB disks and 2X250GB disks) on USB and things worked well. Hot plugging the disks was not all that smooth for us. Other than that we had no issues using the disks. We used this setup for demos at the FOSS 2007 conference at Bangalore and that went through several destructive tests for a period of 3 days and the setup survied well. (It never let us down in front of the customers :-) The disks we used had individual enclosures, which was a bit clunky. It would be nice to have a single enclosure for all the disks (which can power the disks). Thanks and regards, Sanjeev. Bev Crair wrote: Mike, Take a look at http://video.google.com/videoplay?docid=8100808442979626078q=CSI%3Amunich Granted, this was for demo purposes, but the team in Munich is clearly leveraging USB sticks for their purposes. HTH, Bev. mike wrote: I still haven't got any warm and fuzzy responses yet solidifying ZFS in combination with Firewire or USB enclosures. I am looking for 4-10 drive enclosures for quiet SOHO desktop-ish use. I am trying to confirm that OpenSolaris+ZFS would be stable with this, if exported out as JBOD and allow ZFS to manage each disk individually. Enclosure idea (choose one): http://fwdepot.com/thestore/default.php/cPath/1_88 Would be looking to use 750GB SATA2 drives, or IDE is fine too. Would anyone be willing to speak up and give me some faith in this before I invest money into a solution that won't work? I don't intend on hot-plugging any of these devices, just using Firewire (or USB, if I can find a big enclosure) since it is a cheap and reliable interconnect (eSATA seems to be a little too new for use with OpenSolaris unless I have some PCI-X slots) Any help is appreciated. I'd most likely use a Shuttle XPC as the head unit for all of this - it is quiet and small. (I'm looking to downsize my beefy huge noisy heavy tower with limited space availability) - obviously bandwidth on the bus would be limited the more drives sharing the same cable. That would be my only design constraint. Thanks a ton. Again, any input (good, bad, ugly, personal experiences or opinions) is appreciated A LOT! - mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DMU interfaces
Manoj, Welcome back on the alias :-) I don't think the interfaces are documented. However, refering to ZPL should be a good place to start. The ZPL code interacts with DMU and obviously it is using the DMU interfaces. However, I am not sure whether there is any gaurantee that they will not change. Thanks and regards, Sanjeev. Manoj Joseph wrote: Hi, I believe, ZFS, at least in the design ;) , provides APIs other than POSIX (for databases and other applications) to directly talk to the DMU. Are such interfaces ready/documented? If this is documented somewhere, could you point me to it? Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] suggestion: directory promotion to filesystem
Adrian, Seems like a cool idea to me :-) Not sure if there is anything of this kind being thought about... Would be a good idea to file an RFE. Regards, Sanjeev Adrian Saul wrote: Not sure how technically feasible it is, but something I thought of while shuffling some files around my home server. My poor understanding of ZFS internals is that the entire pool is effectivly a tree structure, with nodes either being data or metadata. Given that, couldnt ZFS just change a directory node to a filesystem with little effort, allowing me do everything ZFS does with filesystems on a subset of my filesystem :) Say you have some filesystems you created early on before you had a good idea of usage. Say for example I made a large share filesystem and started filling it up with photos and movies and some assorted downloads. A few months later I realise it would be so much nicer to be able to snapshot my movies and photos seperatly for backups, instead of doing the whole share. Not hard to work around - zfs create and a mv/tar command and it is done... some time later. If there was say a zfs graft directory newfs command, you could just break of the directory as a new filesystem and away you go - no copying, no risking cleaning up the wrong files etc. Corollary - zfs merge - take a filesystem and merge it into an existing filesystem. Just a thought - any comments welcome. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Limit ZFS Memory Utilization
Richard, Richard L. Hamilton wrote: If I understand correctly, at least some systems claim not to guarantee consistency between changes to a file via write(2) and changes via mmap(2). But historically, at least in the case of regular files on local UFS, since Solaris used the page cache for both cases, the results should have been consistent. Since zfs uses somewhat different mechanisms, does it still have the same consistency between write(2) and mmap(2) that was historically present (whether or not guaranteed) when using UFS on Solaris? Yes, it does have the consistency. There is specific code to keep the page cache (needed in case of mmaped files) and the ARC caches consistent. Thanks and regards, Sanjeev. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Meta data corruptions on ZFS.
Masthan, */dudekula mastan [EMAIL PROTECTED]/* wrote: Hi All, In my test set up, I have one zpool of size 1000M bytes. Is this the size given by zfs list ? Or is the amount of disk space that you had ? The reason I ask this is because ZFS/Zpool takes up some amount of space for its house keeping. So, if you add 1G worth of disk space to the pool the effective space available is a little less (few MBs) than 1G. On this zpool, my application writes 100 files each of size 10 MB. First 96 files were written successfully with out any problem. Here you are filling the FS to the brim. This is a border case and the copy-on-write nature of ZFS could lead to the behaviour that you are seeing. But the 97 file is not written successfully , it written only 5 MB (the return value of write() call ). Since it is short write my application tried to truncate it to 5MB. But ftruncate is failing with an erroe message saying that No space on the devices. This is expected because of the copy-onwrite nature of ZFS. During truncate it is trying to allocate new disk blocks probably to write the new metadata and fails to find them. Have you people ever seen these kind of error message ? Yes, there are others who have seen these errors. After ftruncate failure I checked the size of 97 th file, it is strange. The size is 7 MB but the expected size is only 5 MB. Is there any particular reason that you are pushing the filesystem to the brim ? Is this part of some test ? Please, help us understand what you are trying to test. Thanks and regards, Sanjeev. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Robert, Comments inline... Robert Milkowski wrote: Hello Jason, Wednesday, January 10, 2007, 9:45:05 PM, you wrote: JJWW Sanjeev Robert, JJWW Thanks guys. We put that in place last night and it seems to be doing JJWW a lot better job of consuming less RAM. We set it to 4GB and each of JJWW our 2 MySQL instances on the box to a max of 4GB. So hopefully slush JJWW of 4GB on the Thumper is enough. I would be interested in what the JJWW other ZFS modules memory behaviors are. I'll take a perusal through JJWW the archives. In general it seems to me that a max cap for ZFS whether JJWW set through a series of individual tunables or a single root tunable JJWW would be very helpful. Yes it would. Better yet would be if memory consumed by ZFS for caching (dnodes, vnodes, data, ...) would behave similar to page cache like with UFS so applications will be able to get back almost all memory used for ZFS caches if needed. I guess (and it's really a guess only based on some emails here) that in worst case scenario ZFS caches would consume about: arc_max + 3*arc_max + memory lost for fragmentation This is not true from what I know :-) How did you get to this number ? From my knowledge it uses : c_max + (some memory for other caches) NOTE : (some memory for other caches) is not as large as c_max. It is probably just x% of it and not multiples of c_max. So I guess with arc_max set to 1GB you can lost even 5GB (or more) and currently only that first 1GB can be get back automatically. This doesn't seem right based on my knowledge of ZFS. Regards, Sanjeev. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Jason, Jason J. W. Williams wrote: Hi Robert, We've got the default ncsize. I didn't see any advantage to increasing it outside of NFS serving...which this server is not. For speed the X4500 is showing to be a killer MySQL platform. Between the blazing fast procs and the sheer number of spindles, its perfromance is tremendous. If MySQL cluster had full disk-based support, scale-out with X4500s a-la Greenplum would be terrific solution. At this point, the ZFS memory gobbling is the main roadblock to being a good database platform. Regarding the paging activity, we too saw tremendous paging of up to 24% of the X4500s CPU being used for that with the default arc_max. After changing it to 4GB, we haven't seen anything much over 5-10%. Remember that ZFS does not use the standard solaris paging architecture for caching. Instead it uses ARC for all its caching. And that is the reason tuning the ARC should help in your case. The zio_bufs that you referred to in the previous are the caches used by ARC for caching various things (including the metadata and the data). Thanks and regards, Sanjeev. Best Regards, Jason On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Thursday, January 11, 2007, 12:36:46 AM, you wrote: JJWW Hi Robert, JJWW Thank you! Holy mackerel! That's a lot of memory. With that type of a JJWW calculation my 4GB arc_max setting is still in the danger zone on a JJWW Thumper. I wonder if any of the ZFS developers could shed some light JJWW on the calculation? JJWW That kind of memory loss makes ZFS almost unusable for a database system. If you leave ncsize with default value then I belive it won't consume that much memory. JJWW I agree that a page cache similar to UFS would be much better. Linux JJWW works similarly to free pages, and it has been effective enough in the JJWW past. Though I'm equally unhappy about Linux's tendency to grab every JJWW bit of free RAM available for filesystem caching, and then cause JJWW massive memory thrashing as it frees it for applications. Page cache won't be better - just better memory control for ZFS caches is strongly desired. Unfortunately from time to time ZFS makes servers to page enormously :( -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Jason, Apologies.. I missed out this mail yesterday... I am not too familiar with the options. Someoen else will have to answer this. Thanks and regards, Sanjeev. Jason J. W. Williams wrote: Sanjeev, Could you point me in the right direction as to how to convert the following GCC compile flags to Studio 11 compile flags? Any help is greatly appreciated. We're trying to recompile MySQL to give a stacktrace and core file to track down exactly why its crashing...hopefully it will illuminate if memory truly is the issue. Thank you very much in advance! -felide-constructors -fno-exceptions -fno-rtti Best Regards, Jason On 1/7/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote: Jason, There is no documented way of limiting the memory consumption. The ARC section of ZFS tries to adapt to the memory pressure of the system. However, in your case probably it is not quick enough I guess. One way of limiting the memory consumption would be limit the arc.c_max This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than memory available). This is done when the ZFS is loaded (arc_init()). You should be able to change the value of arc.c_max through mdb and set it to the value you want. Exercise caution while setting it. Make sure you don't have active zpools during this operation. Thanks and regards, Sanjeev. Jason J. W. Williams wrote: Hello, Is there a way to set a max memory utilization for ZFS? We're trying to debug an issue where the ZFS is sucking all the RAM out of the box, and its crashing MySQL as a result we think. Will ZFS reduce its cache size if it feels memory pressure? Any help is greatly appreciated. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hot Spare Behavior
Rob, It (hot-spare) should have kicked in. How long did you wait for it ? Was there any IO happening on the pool ? Try doing some IO to the disk and see if it kicks in. Also, another point to note is the size of the the hotspares. Please ensure that the hot-spares are of the same size as the mirrors. I think the hot-spares don't kickin if there is a size mismatch. If none of the above works then we will have a take a closer look at the details :-) Regards, Sanjeev. Rob wrote: I physically removed a disk (c3t8d0 used by ZFS 'pool01') from a 3310 JBOD connected to a V210 running s10u3 (11/06) and 'zpool status' reported this: # zpool status pool: pool01 state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: resilver completed with 0 errors on Mon Jan 8 15:56:20 2007 config: NAMESTATE READ WRITE CKSUM pool01 DEGRADED 0 0 0 mirrorDEGRADED 0 0 0 c2t4d0 ONLINE 0 0 0 c3t8d0 UNAVAIL 0 0 0 cannot open mirrorONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c3t9d0 ONLINE 0 0 0 spares c2t8d0AVAIL c3t10d0 AVAIL Why doesn't ZFS automatically use one of the hot spares? Is this expected behavior or a bug? Rob This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Jason, There is no documented way of limiting the memory consumption. The ARC section of ZFS tries to adapt to the memory pressure of the system. However, in your case probably it is not quick enough I guess. One way of limiting the memory consumption would be limit the arc.c_max This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than memory available). This is done when the ZFS is loaded (arc_init()). You should be able to change the value of arc.c_max through mdb and set it to the value you want. Exercise caution while setting it. Make sure you don't have active zpools during this operation. Thanks and regards, Sanjeev. Jason J. W. Williams wrote: Hello, Is there a way to set a max memory utilization for ZFS? We're trying to debug an issue where the ZFS is sucking all the RAM out of the box, and its crashing MySQL as a result we think. Will ZFS reduce its cache size if it feels memory pressure? Any help is greatly appreciated. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs hot spare not automatically getting used
Jim, That is good news !! Let's us know how it goes. Regards, Sanjeev. PS : I am out of office a couple of days. Jim Hranicky wrote: OK, spun down the drives again. Here's that output: http://www.cise.ufl.edu/~jfh/zfs/threads I just realized that I changed the configuration, so that doesn't reflect a system with spares, sorry. However, I reinitialized the pool and spun down one of the drives and everything is working as it should: pool: zmir state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: resilver completed with 0 errors on Wed Nov 29 16:29:53 2006 config: NAME STATE READ WRITE CKSUM zmir DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c0t0d0ONLINE 0 0 0 spare DEGRADED 0 0 0 c3t1d0 UNAVAIL 10 28.88 0 cannot open c3t3d0 ONLINE 0 0 0 spares c3t3d0 INUSE currently in use c3t4d0 AVAIL errors: No known data errors I'm just not sure if it will always work. I'll try a few different configs and see what happens. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs hot spare not automatically getting used
Jim, James F. Hranicky wrote: Sanjeev Bagewadi wrote: Jim, We did hit similar issue yesterday on build 50 and build 45 although the node did not hang. In one of the cases we saw that the hot spare was not of the same size... can you check if this true ? It looks like they're all slightly different sizes. Interestingly during our demo runs at the recent FOSS event (http://foss.in) we had no issues with this (snv build 45). We had a RAIDZ config of 3 disks and 1 spare disk. And what we found was that the spare kicked in. Here is how we tried it : - Plugged out one of the 3 disks - Kicked of a write to the FS on the pool (ie. dd to a new file in the FS). - The spare kicked in after a while. I guess there is some delay in the detection. I am not sure if there is some threshold beyond which it kicks in. Need to check the code for this. Do you have a threadlist from the node when it was hung ? That would reveal some info. Unfortunately I don't. Do you mean the output of ::threadlist -v Yes. That would be useful. Also, check the zpool status output. from mdb -k Run the following : # echo ::threadlist -v | mdb -k /var/tmp/threadlist.out Regards, Sanjeev. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs hot spare not automatically getting used
Jim, We did hit similar issue yesterday on build 50 and build 45 although the node did not hang. In one of the cases we saw that the hot spare was not of the same size... can you check if this true ? Do you have a threadlist from the node when it was hung ? That would reveal some info. Thanks and regards, Sanjeev. Jim Hranicky wrote: OS: Nevada build 51 x86 I recently upgraded Sol10x86 6/6 to Nevada build 51. I'm testing out zfs on a machine and set up a pool with a mirror of two drives and two hot spares. I then spun down a drive in the mirror which caused the machine to hang, so I rebooted the host. After a reboot, the mirror came up in degraded mode but neither of the spares were automatically used. Is there something I need to tweak to get this to work? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48
Tomas, Apologies for delayed response... Tomas Ögren wrote: Interesting ! So, it is not the ARC which is consuming too much memory It is some other piece (not sure if it belongs to ZFS) which is causing the crunch... Or the other possibility is that ARC ate up too much and caused a near crunch situation and the kmem hit back and caused ARC to free up it's buffers (hence the no_grow flag enabled). So, it (ARC) could be osscillating between large caching and then purging the caches. You might want to keep track of these values (ARC size and no_grow flag) and see how they change over a period of time. This would help us understand the pattern. I would guess it grows after boot until it hits some max and then stays there.. but I can check it out.. No, that is not true. Its shrinks when there is memory pressure. The values of 'c' and 'p' are adjusted accordingly. And if we know it ARC which is causing the crunch we could manually change the values of c_max to a comfortable value and that would limit the size of ARC. But in the ZFS world, DNLC is part of the ARC, right? Not really... ZFS uses the regular DNLC for lookup optimization. However, the metadata/data is cached in the ARC. My original question was how to get rid of data cache, but keep metadata cache (such as DNLC)... This is good question. AFAIK ARC does not really differentiate between metadata and data. So, I am not sure if we can control it. However, as I mentioned above ZFS still uses the DNLC caching. However, I would suggest that you try it out on a non-production machine first. By, default the c_max is set to 75% of physmem and that is the hard limit. c is the soft limit and ARC would try and grow upto 'c. The value of c is adjusted when there is a need to cache more but, it will never exceed c_max. Regarding the huge number of reads, I am sure you have already tried disabling the VDEV prefetch. If not, it is worth a try. That was part of my original question, how? :) Apologies :-) I was digging around the code and I find that zfs_vdev_cache_bshift is the one which would control the amount that is read. Currenty it is set to 16. So, we should be able to modify this and reduce the prefetch. However, I will have to double check with more people and get back to you. Thanks and regards, Sanjeev. /Tomas -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48
Tomas, comments inline... Tomas Ögren wrote: On 10 November, 2006 - Sanjeev Bagewadi sent me these 3,5K bytes: 1. DNLC-through-ZFS doesn't seem to listen to ncsize. The filesystem currently has ~550k inodes and large portions of it is frequently looked over with rsync (over nfs). mdb said ncsize was about 68k and vmstat -s said we had a hitrate of ~30%, so I set ncsize to 600k and rebooted.. Didn't seem to change much, still seeing hitrates at about the same and manual find(1) doesn't seem to be that cached (according to vmstat and dnlcsnoop.d). When booting, the following message came up, not sure if it matters or not: NOTICE: setting nrnode to max value of 351642 NOTICE: setting nrnode to max value of 235577 Is there a separate ZFS-DNLC knob to adjust for this? Wild guess is that it has its own implementation which is integrated with the rest of the ZFS cache which throws out metadata cache in favour of data cache.. or something.. Current memory usage (for some values of usage ;): # echo ::memstat|mdb -k Page SummaryPagesMB %Tot Kernel 95584 746 75% Anon20868 163 16% Exec and libs1703131% Page cache 1007 71% Free (cachelist) 97 00% Free (freelist) 7745606% Total 127004 992 Physical 125192 978 /Tomas This memory usage shows nearly all of memory consumed by the kernel and probably by ZFS. ZFS can't add any more DNLC entries due to lack of memory without purging others. This can be seen from the number of dnlc_nentries being way less than ncsize. I don't know if there's a DMU or ARC bug to reduce the memory footprint of their internal structures for situations like this, but we are aware of the issue. Can you please check the zio buffers and the arc status ? Here is how you can do it : - Start mdb : ie. mdb -k ::kmem_cache - In the output generated above check the amount consumed by the zio_buf_*, arc_buf_t and arc_buf_hdr_t. ADDR NAME FLAG CFLAG BUFSIZE BUFTOTL 030002640a08 zio_buf_512 02 512 102675 030002640c88 zio_buf_1024 0200 02 1024 48 030002640f08 zio_buf_1536 0200 02 1536 70 030002641188 zio_buf_2048 0200 02 2048 16 030002641408 zio_buf_2560 0200 02 25609 030002641688 zio_buf_3072 0200 02 3072 16 030002641908 zio_buf_3584 0200 02 3584 18 030002641b88 zio_buf_4096 0200 02 4096 12 030002668008 zio_buf_5120 0200 02 5120 32 030002668288 zio_buf_6144 0200 02 61448 030002668508 zio_buf_7168 0200 02 7168 1032 030002668788 zio_buf_8192 0200 02 81928 030002668a08 zio_buf_10240 0200 02102408 030002668c88 zio_buf_12288 0200 02122884 030002668f08 zio_buf_14336 0200 0214336 468 030002669188 zio_buf_16384 0200 0216384 3326 030002669408 zio_buf_20480 0200 0220480 16 030002669688 zio_buf_24576 0200 02245763 030002669908 zio_buf_28672 0200 0228672 12 030002669b88 zio_buf_32768 0200 0232768 1935 03000266c008 zio_buf_40960 0200 0240960 13 03000266c288 zio_buf_49152 0200 02491529 03000266c508 zio_buf_57344 0200 02573447 03000266c788 zio_buf_65536 0200 0265536 3272 03000266ca08 zio_buf_73728 0200 0273728 10 03000266cc88 zio_buf_81920 0200 02819207 03000266cf08 zio_buf_90112 0200 02901125 03000266d188 zio_buf_98304 0200 02983047 03000266d408 zio_buf_1064960200 02 106496 12 03000266d688 zio_buf_1146880200 02 1146886 03000266d908 zio_buf_1228800200 02 1228805 03000266db88 zio_buf_1310720200 02 131072 92 030002670508 arc_buf_hdr_t 00 12811970 030002670788 arc_buf_t 00 40 7308 - Dump the values of arc arc::print struct arc arc::print struct arc
Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48
Comments in line... Neil Perrin wrote: 1. DNLC-through-ZFS doesn't seem to listen to ncsize. The filesystem currently has ~550k inodes and large portions of it is frequently looked over with rsync (over nfs). mdb said ncsize was about 68k and vmstat -s said we had a hitrate of ~30%, so I set ncsize to 600k and rebooted.. Didn't seem to change much, still seeing hitrates at about the same and manual find(1) doesn't seem to be that cached (according to vmstat and dnlcsnoop.d). When booting, the following message came up, not sure if it matters or not: NOTICE: setting nrnode to max value of 351642 NOTICE: setting nrnode to max value of 235577 Is there a separate ZFS-DNLC knob to adjust for this? Wild guess is that it has its own implementation which is integrated with the rest of the ZFS cache which throws out metadata cache in favour of data cache.. or something.. Current memory usage (for some values of usage ;): # echo ::memstat|mdb -k Page SummaryPagesMB %Tot Kernel 95584 746 75% Anon20868 163 16% Exec and libs1703131% Page cache 1007 71% Free (cachelist) 97 00% Free (freelist) 7745606% Total 127004 992 Physical 125192 978 /Tomas This memory usage shows nearly all of memory consumed by the kernel and probably by ZFS. ZFS can't add any more DNLC entries due to lack of memory without purging others. This can be seen from the number of dnlc_nentries being way less than ncsize. I don't know if there's a DMU or ARC bug to reduce the memory footprint of their internal structures for situations like this, but we are aware of the issue. Can you please check the zio buffers and the arc status ? Here is how you can do it : - Start mdb : ie. mdb -k ::kmem_cache - In the output generated above check the amount consumed by the zio_buf_*, arc_buf_t and arc_buf_hdr_t. - Dump the values of arc arc::print struct arc - This should give you some like below. -- snip-- arc::print struct arc { anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0x3e2 -- tells you the current memory consumed by ARC buffer (including the the memory consumed for the data cached ie. zio_buff_* p = 0x1d06a06 c = 0x400 c_min = 0x400 c_max = 0x2f9aa800 hits = 0x2fd2 misses = 0xd1c deleted = 0x296 skipped = 0 hash_elements = 0xa85 hash_elements_max = 0xcc0 hash_collisions = 0x173 hash_chains = 0xbe hash_chain_max = 0x2 no_grow = 0 -- This would be set to 1 if we have a memory crunch } -- snip -- And as Niel pointed out we would probably need some way of limiting the ARC consumption. Regards, Sanjeev. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss