Re: [zfs-discuss] [storage-discuss] dos programs on a
Alan, I'm using nexenta core rc4 which is based on nevada 81/82. zfs casesensitivity is set to 'insensitive' Best regards. Maurilio. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] available space?
Maybe a basic zfs question ... I have a pool: # zpool status backup pool: backup state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM backupONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s1 ONLINE 0 0 0 c2t0d0s1 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t1d0ONLINE 0 0 0 c1t2d0ONLINE 0 0 0 c1t3d0ONLINE 0 0 0 c1t4d0ONLINE 0 0 0 c1t5d0ONLINE 0 0 0 c1t6d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 c2t1d0ONLINE 0 0 0 c2t2d0ONLINE 0 0 0 c2t3d0ONLINE 0 0 0 c2t4d0ONLINE 0 0 0 c2t5d0ONLINE 0 0 0 c2t6d0ONLINE 0 0 0 c2t7d0ONLINE 0 0 0 For which list reports: # zpool list backup NAME SIZE USED AVAILCAP HEALTH ALTROOT backup 13.5T 434K 13.5T 0% ONLINE - Yet df and zfs list shows something else: [EMAIL PROTECTED]:~# zfs list NAME USED AVAIL REFER MOUNTPOINT backup 262K 11.5T 1.78K none backup/files32.0K 11.5T 32.0K /export/files ... # df -h Filesystem size used avail capacity Mounted on ... backup/files11T32K11T 1%/export/files Why does AVAIL differ for such a large amount? (NexentaOS_20080131) -- Jure Pečar http://jure.pecar.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] status of zfs boot netinstall kit
Hi, I would like to continue this (maybe a bit outdated) thread with the question: 1. How to create a netinstall image? 2. How to write the netinstall image back as an ISO9660 on DVD? (after patching it for the zfsboot) Roman This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool status -x strangeness on b78
We run a cron job that does a 'zpool status -x' to check for any degraded pools. We just happened to find a pool degraded this morning by running 'zpool status' by hand and were surprised that it was degraded as we didn't get a notice from the cron job. # uname -srvp SunOS 5.11 snv_78 i386 # zpool status -x all pools are healthy # zpool status pool1 pool: pool1 state: DEGRADED scrub: none requested config: NAME STATE READ WRITE CKSUM pool1DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c1t8d0 REMOVED 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 errors: No known data errors I'm going to look into it now why the disk is listed as removed. Does this look like a bug with 'zpool status -x'? Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mounting a copy of a zfs pool /file system while orginal is still active
While browsing the ZFS source code, I noticed that usr/src/cmd/ ztest/ztest.c, includes ztest_spa_rename(), a ZFS test which renames a ZFS storage pool to a different name, tests the pool under its new name, and then renames it back. I wonder why this functionality was not exposed as part of zpool support? See 6280547 want to rename pools. Just hasn't been hight on the priority list. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS number of file systems scalability
There is a write up of similar findings and more information about sharemgr http://developers.sun.com/solaris/articles/nfs_zfs.html Unfortunately I don't see anything that says those changes will be in u5. Shawn On Feb 5, 2008, at 8:21 PM, Paul B. Henson wrote: I was curious to see about how many filesystems one server could practically serve via NFS, and did a little empirical testing. Using an x4100M2 server running S10U4x86, I created a pool from a slice of the hardware raid array created from the two internal hard disks, and set sharenfs=on for the pool. I then created filesystems, 1000 at a time, and timed how long it took to create each thousand filesystems, to set sharenfs=off for all filesystems created so far, and to set sharenfs=on again for all filesystems. I understand sharetab optimization is one of the features in the latest OpenSolaris, so just for fun I tried symlinking /etc/dfs/sharetab to a mfs file system to see if it made any difference. I also timed a complete boot cycle (from typing 'init 6' until the server was again remotely available) at 5000 and 10,000 filesystems. Interestingly, filesystem creation itself scaled reasonably well. I recently read a thread where someone was complaining it took over eight minutes to create a filesystem at the 10,000 filesystem count. In my tests, while the first 1000 filesystems averaged only a little more than half a second each to create, filesystems 9000-1 only took roughly twice that, averaging about 1.2 seconds each to create. Unsharing scalability wasn't as good, time requirements increasing by a factor of six. Having sharetab in mfs made a slight difference, but nothing outstanding. Sharing (unsurprisingly) was the least scalable, increasing by a factor of eight. Boot-wise, the system took about 10.5 minutes to reboot at 5000 filesystems. This increased to about 35 minutes at the 10,000 file system counts. Based on these numbers, I don't think I'd want to run more than 5-7 thousand filesystems per server to avoid extended outages. Given our user count, that will probably be 6-10 servers 8-/. I suppose we could have a large number of smaller servers rather than a small number of beefier servers; although that seems less than efficient. It's too bad there's no way to fast track backporting of openSolaris improvements to production Solaris, from what I've heard there will be virtually no ZFS improvements in S10U5 :(. Here are the raw numbers for anyone interested. The first column is number of file systems. The second column is total and average time in seconds to create that block of filesystems (eg, the first 1000 took 589 seconds to create, the second 1000 took 709 seconds). The third column is the time in seconds to turn off NFS sharing for all filesystems created so far (eg, 14 seconds for 1000 filesystems, 38 seconds for 2000 filesystems). The fourth is the same operation with sharetab in a memory filesystem (I stopped this measurement after 7000 because sharing was starting to take so long). The final column is how long it took to turn on NFS sharing for all filesystems created so far. #FS create/avgoff/avg off(mfs)/avg on/avg 1000 589/.59 14/.01 9/.01 32/.03 2000 709/.71 38/.02 25/.01 107/.05 3000 783/.78 70/.02 50/.02 226/.08 4000 836/.84 112/.0383/.02 388/.10 5000 968/.97 178/.04124/.02 590/.12 6000 930/.93 245/.04 172/.03 861/.14 7000 961/.96 319/.05 229/.03 1172/.17 8000 1045/1.05 405/.05 1515/.19 9000 1098/1.10 500/.06 1902/.21 11165/1.17 599/.06 2348/.23 -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/ ~henson/ Operating Systems and Network Analyst | [EMAIL PROTECTED] California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Shawn Ferry shawn.ferry at sun.com Senior Primary Systems Engineer Sun Managed Operations 571.291.4898 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Did MDB Functionality Change?
On Solaris 10 u3 (11/06) I can execute the following: bash-3.00# mdb -k Loading modules: [ unix krtld genunix specfs dtrace ufs sd pcipsy ip sctp usba nca md zfs random ipc nfs crypto cpc fctl fcip logindmux ptm sppp ] arc::print { anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0x6b800 p = 0x3f83f80 c = 0x7f07f00 c_min = 0x7f07f00 c_max = 0xbe8be800 hits = 0x30291 misses = 0x4f deleted = 0xe skipped = 0 hash_elements = 0x3a hash_elements_max = 0x3a hash_collisions = 0x3 hash_chains = 0x1 hash_chain_max = 0x1 no_grow = 0 } However, when I execute the same command on Solaris 10 u4 (8/07) I receive the following error: bash-3.00# mdb -k Loading modules: [ unix krtld genunix specfs dtrace ufs ssd fcp fctl qlc pcisch md ip hook neti sctp arp usba nca lofs logindmux ptm cpc fcip sppp random sd crypto zfs ipc nfs ] arc::print mdb: failed to dereference symbol: unknown symbol name In addition, u3 doesn't recognize ::arc where u4 does. u3 displays memory locations with arc::print -a where ::arc -a doesn't work for u4. I posted this into the zfs discussion forum, because this limited u4 functionality prevents you from dynamically changing the ARC in ZFS by trying the ZFS Tuning instructions. Spencer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
I disabled file prefetch and there was no effect. Here are some performance numbers. Note that, when the application server used a ZFS file system to save its data, the transaction took TWICE as long. For some reason, though, iostat is showing 5x as much disk writing (to the physical disks) on the ZFS partition. Can anyone see a problem here? - Average application server client response time (1st run/2nd run): SVM - 12/18 seconds ZFS - 35/38 seconds SVM Performance --- # iostat -xnz 5 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 195.1 414.3 1465.9 1657.3 0.0 1.70.02.7 0 98 md/d100 97.5 414.3 730.2 1657.3 0.0 1.00.01.9 0 74 md/d101 97.7 414.1 735.8 1656.5 0.0 0.80.01.5 0 59 md/d102 54.4 203.6 370.7 814.2 0.0 0.50.02.1 0 42 c0t2d0 52.8 210.6 359.5 842.2 0.0 0.50.01.9 0 40 c0t3d0 54.0 203.6 374.7 814.2 0.0 0.30.01.2 0 26 c0t4d0 52.2 210.6 361.1 842.2 0.0 0.50.01.8 0 38 c0t5d0 ZFS Performance --- # iostat -xnz 5 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 23.2 148.8 1496.7 3806.8 0.0 2.50.0 14.7 0 21 c0t2d0 22.8 148.8 1470.9 3806.8 0.0 2.40.0 13.9 0 22 c0t3d0 24.2 149.0 1561.1 3805.0 0.0 1.50.08.6 0 18 c0t4d0 23.4 149.4 1509.6 3805.0 0.0 2.50.0 14.7 0 25 c0t5d0 # zpool iostat 5 capacity operationsbandwidth pool used avail read write read write -- - - - - - - pool1 5.69G 266G 12243 775K 7.20M pool1 5.69G 266G 88232 5.53M 7.12M pool1 5.69G 266G 78216 4.87M 6.81M This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS configuration for a thumper
On Feb 4, 2008, at 5:10 PM, Marion Hakanson wrote: [EMAIL PROTECTED] said: FYI, you can use the '-c' option to compare results from various runs and have one single report to look at. That's a handy feature. I've added a couple of such comparisons: http://acc.ohsu.edu/~hakansom/thumper_bench.html Marion Your finding for random reads with or without NCQ match my findings: http://blogs.sun.com/erickustarz/entry/ncq_performance_analysis Disabling NCQ looks like a very tiny win for the multi-stream read case. I found a much bigger win, but i was doing RAID-0 instead of RAID-Z. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] available space?
Jure Pečar wrote: Maybe a basic zfs question ... I have a pool: # zpool status backup pool: backup state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM backupONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s1 ONLINE 0 0 0 c2t0d0s1 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t1d0ONLINE 0 0 0 c1t2d0ONLINE 0 0 0 c1t3d0ONLINE 0 0 0 c1t4d0ONLINE 0 0 0 c1t5d0ONLINE 0 0 0 c1t6d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 c2t1d0ONLINE 0 0 0 c2t2d0ONLINE 0 0 0 c2t3d0ONLINE 0 0 0 c2t4d0ONLINE 0 0 0 c2t5d0ONLINE 0 0 0 c2t6d0ONLINE 0 0 0 c2t7d0ONLINE 0 0 0 For which list reports: # zpool list backup NAME SIZE USED AVAILCAP HEALTH ALTROOT backup 13.5T 434K 13.5T 0% ONLINE - Yet df and zfs list shows something else: [EMAIL PROTECTED]:~# zfs list NAME USED AVAIL REFER MOUNTPOINT backup 262K 11.5T 1.78K none backup/files32.0K 11.5T 32.0K /export/files ... # df -h Filesystem size used avail capacity Mounted on ... backup/files11T32K11T 1%/export/files Why does AVAIL differ for such a large amount? They represent two different things. See the man pages for zpool and zfs for a description of their meanings. -- richard (NexentaOS_20080131) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send / receive between different opensolaris versions?
Hello everybody, I'm thinking of building out a second machine as a backup for our mail spool where I push out regular filesystem snapshots, something like a warm/hot spare situation. Our mail spool is currently running snv_67 and the new machine would probably be running whatever the latest opensolaris version is (snv_77 or later). My first question is whether or not zfs send / receive is portable between differing releases of opensolaris. My second question (kind of off topic for this list) is that I was wondering the difficulty involved in upgrading snv_67 to a later version of opensolaris given that we're running a zfs root boot configuration -- Michael Hale[EMAIL PROTECTED] Manager of Engineering Support Enterprise Engineering Group Transcom Enhanced Services http://www.transcomus.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
On Feb 6, 2008 6:36 PM, William Fretts-Saxton [EMAIL PROTECTED] wrote: Here are some performance numbers. Note that, when the application server used a ZFS file system to save its data, the transaction took TWICE as long. For some reason, though, iostat is showing 5x as much disk writing (to the physical disks) on the ZFS partition. Can anyone see a problem here? What is the disk layout of the zpool in question? Striped? Mirrored? Raidz? I would suggest either a simple stripe or striping+mirroring as the best-performing layout. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
It is a striped/mirror: # zpool status NAMESTATE READ WRITE CKSUM pool1 ONLINE 0 0 0 mirrorONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scrub halts
I now have a improved sata and marvell88sx driver modules that deal with various error conditions in a much more solid way. Changes include reducing the number of required device resets, properly reporting media errors (rather than no additional sense), clearing aborted packets more rapidly so that after an hardware error progress is again made much more quickly. Further the driver is much quieter (far fewer messages in /var/adm/messages). If there is still interest, I can make those binaries available for testing prior to their availability in Solaris Nevada (OpenSolaris). These changes will be checked in soon, but the process always inserts a significant delay, so if anyone would like, please e-mail me and I will make those binaries available via e-mail. Regards, Lida This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
Solaris 10u4 eh? Sounds a lot like fsync issues we want into, trying to run Cyrus mail-server spools in ZFS. This was highlighted for us by the filebench software varmail test. OpenSolaris nv78 however worked very well. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
William Fretts-Saxton william.fretts.saxton at sun.com writes: I disabled file prefetch and there was no effect. Here are some performance numbers. Note that, when the application server used a ZFS file system to save its data, the transaction took TWICE as long. For some reason, though, iostat is showing 5x as much disk writing (to the physical disks) on the ZFS partition. Can anyone see a problem here? Possible explanation: the Glassfish applications are using synchronous writes, causing the ZIL (ZFS Intent Log) to be intensively used, which leads to a lot of extra I/O. Try to disable it: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 Since disabling it is not recommended, if you find out it is the cause of your perf problems, you should instead try to use a SLOG (separate intent log, see above link). Unfortunately your OS version (Solaris 10 8/07) doesn't support SLOGs, they have only been added to OpenSolaris build snv_68: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] MySQL, Lustre and ZFS
Hi all, Any thoughts on if and when ZFS, MySQL, and Lustre 1.8 (and beyond) will work together and be supported so by Sun? - Network Systems Architect Advanced Digital Systems Internet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
Marc Bevand wrote: William Fretts-Saxton william.fretts.saxton at sun.com writes: I disabled file prefetch and there was no effect. Here are some performance numbers. Note that, when the application server used a ZFS file system to save its data, the transaction took TWICE as long. For some reason, though, iostat is showing 5x as much disk writing (to the physical disks) on the ZFS partition. Can anyone see a problem here? Possible explanation: the Glassfish applications are using synchronous writes, causing the ZIL (ZFS Intent Log) to be intensively used, which leads to a lot of extra I/O. The ZIL doesn't do a lot of extra IO. It usually just does one write per synchronous request and will batch up multiple writes into the same log block if possible. However, it does need to wait for the writes to be on stable storage before returning to the application, which is what the application has requested. It does this by waiting for the write to complete and then flushing the disk write cache. If the write cache is battery backed for all zpool devices then the global zfs_nocacheflush can be set to give dramatically better performance. Try to disable it: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 Since disabling it is not recommended, if you find out it is the cause of your perf problems, you should instead try to use a SLOG (separate intent log, see above link). Unfortunately your OS version (Solaris 10 8/07) doesn't support SLOGs, they have only been added to OpenSolaris build snv_68: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS configuration for a thumper
[EMAIL PROTECTED] said: Your finding for random reads with or without NCQ match my findings: http:// blogs.sun.com/erickustarz/entry/ncq_performance_analysis Disabling NCQ looks like a very tiny win for the multi-stream read case. I found a much bigger win, but i was doing RAID-0 instead of RAID-Z. I didn't set out to do the with/without NCQ comparisons. Rather, my first runs of filebench and bonnie++ triggered a number of I/O errors and controller timeout/resets on several different drives, so I disabled NCQ based on bug 6587133's workaround suggestion. No more errors during subsequent testing, so we're running with NCQ disabled until a patch comes along. It was useful, however, to see what effect disabling NCQ had. I find filebench easier to use than bonnie++, mostly because filebench is automatically multithreaded, which is necessary to generate a heavy enough workload to exercise anything more than a few drives (esp. on machines like T2000's). The HTML output doesn't hurt, either. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
[EMAIL PROTECTED] said: Here are some performance numbers. Note that, when the application server used a ZFS file system to save its data, the transaction took TWICE as long. For some reason, though, iostat is showing 5x as much disk writing (to the physical disks) on the ZFS partition. Can anyone see a problem here? I'm not familiar with the application in use here, but your iostat numbers remind me of something I saw during small overwrite tests on ZFS. Even though the test was doing only writing, because it was writing over only a small part of existing blocks, ZFS had to read (the unchanged part of) each old block in before writing out the changed block to a new location (COW). This is a case where you want to set the ZFS recordsize to match your application's typical write size, in order to avoid the read overhead inherent in partial-block updates. UFS by default has a smaller max blocksize than ZFS' default 128k, so in addition to the ZIL/fsync issue UFS will also suffer less overhead from such partial-block updates. Again, this may not be what's going on, but it's worth checking if you haven't already done so. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
Neil Perrin Neil.Perrin at Sun.COM writes: The ZIL doesn't do a lot of extra IO. It usually just does one write per synchronous request and will batch up multiple writes into the same log block if possible. Ok. I was wrong then. Well, William, I think Marion Hakanson has the most plausible explanation. As he suggests, experiment with zfs set recordsize=XXX to force the filesystem to use small records. See the zfs(1) manpage. -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS taking up to 80 seconds to flush a single 8KB O_SYNC block.
Hey all - I'm working on an interesting issue where I'm seeing ZFS being quite cranky about writing O_SYNC written blocks. Bottom line is that I have a small test case that does essentially this: open file for writing -- O_SYNC loop( write() 8KB of random data print time taken to write data } It's taking anywhere up to 80 seconds per 8KB block. When the 'problem' is not in evidence, (and it's not always happening), I can do around 1200 O_SYNC writes per second... It seems to be waiting here virtually all of the time: 0t11021::pid2proc | ::print proc_t p_tlist|::findstack -v stack pointer for thread 30171352960: 2a118052df1 [ 02a118052df1 cv_wait+0x38() ] 02a118052ea1 zil_commit+0x44(1, 6b50516, 193, 60005db66bc, 6b50570, 60005db6640) 02a118052f51 zfs_write+0x554(0, 14000, 2a1180539e8, 6000af22840, 2000, 2a1180539d8) 02a118053071 fop_write+0x20(304898cd100, 2a1180539d8, 10, 300a27a9e48, 0, 7b7462d0) 02a118053121 write+0x268(4, 8058, 60051a3d738, 2000, 113, 1) 02a118053221 dtrace_systrace_syscall32+0xac(4, ffbfdaf0, 2000, 21e80, ff3a00c0, ff3a0100) 02a1180532e1 syscall_trap32+0xcc(4, ffbfdaf0, 2000, 21e80, ff3a00c0, ff3a0100) And this also evident in a dtrace of it, following the write in... ... 28- zil_commit 28 - cv_wait 28- thread_lock 28- thread_lock 28- cv_block 28 - ts_sleep 28 - ts_sleep 28 - new_mstate 28- cpu_update_pct 28 - cpu_grow 28- cpu_decay 28 - exp_x 28 - exp_x 28- cpu_decay 28 - cpu_grow 28- cpu_update_pct 28 - new_mstate 28 - disp_lock_enter_high 28 - disp_lock_enter_high 28 - disp_lock_exit_high 28 - disp_lock_exit_high 28- cv_block 28- sleepq_insert 28- sleepq_insert 28- disp_lock_exit_nopreempt 28- disp_lock_exit_nopreempt 28- swtch 28 - disp 28- disp_lock_enter 28- disp_lock_enter 28- disp_lock_exit 28- disp_lock_exit 28- disp_getwork 28- disp_getwork 28- restore_mstate 28- restore_mstate 28 - disp 28 - pg_cmt_load 28 - pg_cmt_load 28- swtch 28- resume 28 - savectx 28- schedctl_save 28- schedctl_save 28 - savectx ... At this point, it waits for up to 80 seconds. I'm also seeing zil_commit() being called around 7-15 times per second. For kicks, I disabled the ZIL: zil_disable/W0t1, and that made not a pinch of difference. :) For what it's worth, this is a T2000, running Oracle, connected to an HDS 9990 (using 2GB fibre), with 8KB record sizes for the oracle filesystems, and I'm only seeing the issue on the ZFS filesystems that have the active oracle tables on them. The O_SYNC test case is just trying to help me understand what's happening. The *real* problem is that oracle is running like rubbish when it's trying to roll forward archive logs from another server. It's an almost 100% write workload. At the moment, it cannot even keep up with the other server's log creation rate, and it's barely doing anything. (The other box is quite different, so not really valid for direct comparison at this point). 6513020 looked interesting for a while, but I already have 120011-14 and 127111-03 and installed. I'm looking into the cache flush settings of the 9990 array to see if it's that killing me, but I'm also looking for any other ideas on what might be hurting me. I also have set zfs:zfs_nocacheflush = 1 in /etc/system The Oracle Logs are on a separate Zpool and I'm not seeing the issue on those filesystems. The lockstats I have run are not yet all that interesting. If anyone has ideas on specific incantations I should use or some specific D or anything else, I'd be most appreciative. Cheers! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss