Re: [zfs-discuss] /bin/cp vs /usr/gnu/bin/pc
Hello, I came across this blog post: http://kevinclosson.wordpress.com/2007/03/15/copying-files-on-solaris-slow-or-fast-its-your-choice/ and would like to hear from you performance gurus how this 2007 article relates to the 2010 ZFS implementation? What should I use and why? [ WARNING : red herring non-sequiter follows ] My PATH looks like so : $ echo $PATH /opt/SUNWspro/bin:/usr/xpg6/bin:/usr/xpg4/bin:/usr/ccs/bin:/usr/bin:/usr/sbin:/bin:/sbin Thus I have no such issues with the GNU vs OpenGroup/POSIX compliance tools. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs iostat - which unit bit vs. byte
Hi-- ZFS command operations involving disk space take input and display using numeric values specified as exact values, or in a human-readable form with a suffix of B, K, M, G, T, P, E, Z for bytes, kilobytes, megabytes, gigabytes, terabytes, petabytes, exabytes, or zettabytes. Let's play a game here. :-) Suppose you wanted a 1PB zpool and you wanted dedup. How much memory would you need for that and would you separate out ZIL cache etc? I'm guessing ( total WAG ) that one would want at least 16TB of memory. I have no idea of any system out there that can pop that many 8G ECC SIMMs ( 2048 of them ) into. But really .. what sort of theoretical machine would be needed to handle a single 1024 TB zpool ? Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
I think, with current bits, it's not a simple matter of ok for enterprise, not ok for desktops. with an ssd for either main storage or l2arc, and/or enough memory, and/or a not very demanding workload, it seems to be ok. The main problem is not performance (for a home server is not a problem)... but what really is a BIG PROBLEM is when you try to delete a snapshot a little big... (try yourself...create a big random file with 90 Gb of data... then snapshot... then delete the file and delete the snapshotyou will see)... and better... try removing the SSD disk. just out of curiosity... my test sytem (8 Gb ram)... takes over 30 hours to delete a dataset of 1.7 TB (still not finished...)... and the system does not respond (is working but does not respond... not even a simple ls command) -- Hold on a sec. I have been lurking in this thread for a while for various reasons and only now does a thought cross my mind worth posting : Are you saying that a reasonably fast computer with 8GB of memory is entirely non-responsive due to a ZFS related function? Does the machine respond to ping? If there is a gui does the mouse pointer move? Does the keyboard numlock key respond at all ? I just find it very hard to believe that such a situation could exist as I have done some *abusive* tests on a SunFire X4100 with Sun 6120 fibre arrays ( in HA config ) and I could not get it to become a warm brick like you describe. How many processors does your machine have ? -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup performance hit
You are severely RAM limited. In order to do dedup, ZFS has to maintain a catalog of every single block it writes and the checksum for that block. This is called the Dedup Table (DDT for short). So, during the copy, ZFS has to (a) read a block from the old filesystem, (b) check the current DDT to see if that block exists and (c) either write the block to the new filesytem (and add an appropriate DDT entry for it), or write a metadata update with the dedup reference block reference. Likely, you have two problems: (1) I suspect your source filesystem has lots of blocks (that is, it's likely made up smaller-sized files). Lots of blocks means lots of seeking back and forth to read all those blocks. (2) Lots of blocks also means lots of entries in the DDT. It's trivial to overwhelm a 4GB system with a large DDT. If the DDT can't fit in RAM, then it has to get partially refreshed from disk. Thus, here's what's likely going on: (1) ZFS reads a block and it's checksum from the old filesystem (2) it checks the DDT to see if that checksum exists (3) finding that the entire DDT isn't resident in RAM, it starts a cycle to read the rest of the (potential) entries from the new filesystems' metadata. That is, it tries to reconstruct the DDT from disk. Which involves a HUGE amount of random seek reads on the new filesystem. In essence, since you likely can't fit the DDT in RAM, each block read from the old filesystem forces a flurry of reads from the new filesystem. Which eats up the IOPS that your single pool can provide. It thrashes the disks. Your solution is to either buy more RAM, or find something you can use as an L2ARC cache device for your pool. Ideally, it would be an SSD. However, in this case, a plain hard drive would do OK (NOT one already in a pool).To add such a device, you would do: 'zpool add tank mycachedevice' That was an awesome response! Thank you for that :-) I tend to config my servers with 16G of ram minimum these days and now I know why. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] swap - where is it coming from?
Re-read the section onSwap Space and Virtual Memory for particulars on how Solaris does virtual memory mapping, and the concept of Virtual Swap Space, which is what 'swap -s' is really reporting on. The Solaris Internals book is awesome for this sort of thing. A bit over the top in detail but awesome regardless. -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] one more time: pool size changes
c4t2004CFA4D655d0s0 ONLINE 0 0 0 c4t2004CF9B63D0d0s0 ONLINE 0 0 0 So the manner in which any given IO transaction gets to the zfs filesystem just gets ever more complicated and convoluted and it makes me wonder if I am tossing away performance to get higher levels of safety. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs/lofi/share panic
FWIW (even on a freshly booted system after a panic) # lofiadm zyzzy.iso /dev/lofi/1 # mount -F hsfs /dev/lofi/1 /mnt mount: /dev/lofi/1 is already mounted or /mnt is busy # mount -O -F hsfs /dev/lofi/1 /mnt # share /mnt # If you unshare /mnt and then do this again, it will panic. This has been a bug since before Open Solaris came out. I just tried this with a UFS based filesystem just for a lark. r...@aequitas:/# mkdir /testfs r...@aequitas:/# mount -F ufs -o noatime,nologging /dev/dsk/c0d1s0 /testfs r...@aequitas:/# ls -l /testfs/sol\-nv\-b130\-x86\-dvd.iso -rw-r--r-- 1 root root 3818782720 Feb 5 16:02 /testfs/sol-nv-b130-x86-dvd.iso r...@aequitas:/# lofiadm -a /testfs/sol-nv-b130-x86-dvd.iso May 27 21:08:58 aequitas pseudo: pseudo-device: lofi0 May 27 21:08:58 aequitas genunix: lofi0 is /pseudo/l...@0 May 27 21:08:58 aequitas rootnex: xsvc0 at root: space 0 offset 0 May 27 21:08:58 aequitas genunix: xsvc0 is /x...@0,0 May 27 21:08:58 aequitas pseudo: pseudo-device: devinfo0 May 27 21:08:58 aequitas genunix: devinfo0 is /pseudo/devi...@0 /dev/lofi/1 r...@aequitas:/# mount -F hsfs -o ro /dev/lofi/1 /mnt r...@aequitas:/# share -F nfs -o nosub,nosuid,sec=sys,ro,anon=0 /mnt Then at a Sol 10 server : # uname -a SunOS jupiter 5.10 Generic_142900-11 sun4u sparc SUNW,Sun-Fire-480R # dfshares aequitas RESOURCE SERVER ACCESSTRANSPORT aequitas:/mnt aequitas - - # # mount -F nfs -o bg,intr,nosuid,ro,vers=4 aequitas:/mnt /mnt # ls /mnt Copyrightautorun.inf JDS-THIRDPARTYLICENSEREADME autorun.sh License boot README.txt installer Solaris_11 sddtool Sun_HPC_ClusterTools # umount aequitas:/mnt # dfshares aequitas RESOURCE SERVER ACCESSTRANSPORT aequitas:/mnt aequitas - - Then back at the snv_138 box I unshare and re-share and ... nothing bad happens. r...@aequitas:/# unshare /mnt r...@aequitas:/# share -F nfs -o nosub,nosuid,sec=sys,ro,anon=0 /mnt r...@aequitas:/# unshare /mnt r...@aequitas:/# Guess I must now try this with a ZFS fs under that iso file. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris?
On 05-17-10, Thomas Burgess wonsl...@gmail.com wrote: psrinfo -pv shows: The physical processor has 8 virtual processors (0-7) x86 (AuthenticAMD 100F91 family 16 model 9 step 1 clock 200 MHz) AMD Opteron(tm) Processor 6128 [ Socket: G34 ] That's odd. Please try this : # kstat -m cpu_info -c misc module: cpu_infoinstance: 0 name: cpu_info0 class:misc brand VIA Esther processor 1200MHz cache_id0 chip_id 0 clock_MHz 1200 clog_id 0 core_id 0 cpu_typei386 crtime 3288.24125364 current_clock_Hz1199974847 current_cstate 0 family 6 fpu_typei387 compatible implementation x86 (CentaurHauls 6A9 family 6 model 10 step 9 clock 1200 MHz) model 10 ncore_per_chip 1 ncpu_per_chip 1 pg_id -1 pkg_core_id 0 snaptime1526742.97169617 socket_type Unknown state on-line state_begin 1272610247 stepping9 supported_frequencies_Hz1199974847 supported_max_cstates 0 vendor_id CentaurHauls You should get a LOT more data. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris?
- Original Message - From: Thomas Burgess wonsl...@gmail.com Date: Saturday, May 15, 2010 8:09 pm Subject: Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris? To: Orvar Korvar knatte_fnatte_tja...@yahoo.com Cc: zfs-discuss@opensolaris.org Well i just wanted to let everyone know that preliminary results are good. The livecd booted, all important things seem to be recognized. It sees all 16 gb of ram i installed and all 8 cores of my opteron 6128 The only real shocker is how loud the norco RPC-4220 fans are (i have another machine with a norco 4020 case so i assumed the fans would be similar.this was a BAD assumption) This thing sounds like a hair dryer Anyways, I'm running the install now so we'll see how that goes. It did take about 10 minutes to find a disk durring the installer, but if i remember right, this happened on other machines as well. Once you have the install done could you post ( somewhere ) what you see during a single user mode boot with options -srv ? I would like to see all the gory details. Also, could you run cpustat -h ? At the bottom, according to usr/src/uts/intel/pcbe/opteron_pcbe.c you shoud see : See BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors (AMD publication 31116) The following registers should be listed : #defineAMD_FAMILY_10h_generic_events \ { PAPI_tlb_dm,DC_dtlb_L1_miss_L2_miss, 0x7 }, \ { PAPI_tlb_im,IC_itlb_L1_miss_L2_miss, 0x3 }, \ { PAPI_l3_dcr,L3_read_req, 0xf1 }, \ { PAPI_l3_icr,L3_read_req, 0xf2 }, \ { PAPI_l3_tcr,L3_read_req, 0xf7 }, \ { PAPI_l3_stm,L3_miss, 0xf4 }, \ { PAPI_l3_ldm,L3_miss, 0xf3 }, \ { PAPI_l3_tcm,L3_miss, 0xf7 } You should NOT see anything like this : r...@aequitas:/root# uname -a SunOS aequitas 5.11 snv_139 i86pc i386 i86pc Solaris r...@aequitas:/root# cpustat -h cpustat: cannot access performance counters - Operation not applicable ... as well as psrinfo -pv please ? When I get my HP Proliant with the 6174 procs I'll be sure to post whatever I see. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris?
Bit of a chicken and egg that, isn't it? You need to run the tool to see if the board's worth buying and you need to buy the board to run the tool! *Somebody* has to be that first early adopter. After that, we all get to ride on their experience. I am sure the Tier-1 stuff will work just fine. I have an HP unit on order thus : HP Proliant DL165G7 server, 1U Rack Server, 2 × AMD Opteron Processor Model 6172 ( 12 core, 2.1 GHz, 12MB Level 3 Cache, 80W), dual socket configuration for 24-cores in total, 16GB (8 x 2GB) Advanced ECC PC3-10600R (RDIMM) memory, Twenty Four DIMM slots, 2 PCI-E Slots ( 1 PCI Express expansion slot 1, low-profile, half-length and PCI Express expansion slot 2 full height full length ×16 75W +EXT 75W with optional PCI-X support ), 2x HP NC362i Integrated Dual Port Gigabit Server Adapter, Storage Controller (1) Smart Array P410i/256MB BBWC, single HP 500W CS HE Power Supply, no internal HDD, slim height 9.5mm DVD included, no OS - no Monitor, 3 year warranty So when it gets in I'll toss it into a rack, hook up a serial cable and then boot *whatever* as verbosely as possible.[1] If you want you can ssh in to the blastwave server farm and jump on that also ... I'm always game to play with such things. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris [1] ummm No, I won't be installing Microsoft Windows 7 64-bit Ultimate Edition. .. or maybe I will :-P ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why both dedup and compression?
On 06/05/2010 21:07, Erik Trimble wrote: VM images contain large quantities of executable files, most of which compress poorly, if at all. What data are you basing that generalisation on ? note : I can't believe someone said that. warning : I just detected a fast rise time on my pedantic input line and I am in full geek herd mode : http://www.blastwave.org/dclarke/blog/?q=node/160 The degree to which a file can be compressed is often related to the degree of randomness or entropy in the bit sequences in that file. We tend to look at files in chunks of bits called bytes or words or blocks of some given length but the harsh reality is that it is just a sequence of ones and zero values and nothing more. However I can spot blocks or patterns in there and then create tokens that represent repeating blocks. If you want a really random file that you are certain has nearly perfect high entropy then just get a coin and flip it 1024 times while recording the heads and tails results. Then input that data into a file as a sequence of ones and zero bits and you have a very neatly random chunk of data. Good luck trying to compress that thing. Pardon me .. here it comes. I spent waay too many years in labs doing work with RNG hardware and software to just look the other way. And I'm in a good mood. Suppose that C is soem discrete random variable. That means that C can have well defined values like HEAD or TAIL. You usually have a bunch ( n of them ) of possible values x1, x2, x3, ..., xn that C can be. Each of those shows up in the data set with specific propabilities p1, p2, p3, ..., pn where the sum of those add to exactly one. This means that x1 will appear in the dataset with an expected probability of p1. All of those propabilities are expressed as a value between 0 and 1. A value of 1 means certainty. Okay, so in the case of a coin ( not the one in Bat Man The Dark Knight ) you have x1=TAIL and x2=HEAD with ( we hope ) p1=0.5=p2 such that p1+p2 = 1 exactly unless the coin lands on its edge and the universe collapses due to entropy implosion. That is a joke. I used to teach this as a TA in university so bear with me. So go flip a coin a few thousand times and you will get fairly random data. That is a Random Number Generator that you have and its always kicking around your lab or in your pocket or on the street. Pretty cheap but the baud rate is hellish low. If you get tired of flipping bits using a coin then you may have to just give up on that ( or buy a radioactive source where you can monitor particles emitted as it decays for input data ) OR be really cheap and look at /dev/urandom on a decent Solaris machine : $ ls -lap /dev/urandom lrwxrwxrwx 1 root root 34 Jul 3 2008 /dev/urandom - ../devices/pseudo/ran...@0:urandom That thing right there is a pseudo random number generator. It will make for really random data but there is no promise that over a given number of bits that the sum p1 + p2 will be precisely 1. It will be real real close however to a very random ( high entropy ) data source. Need 1024 bits of random data ? $ /usr/xpg4/bin/od -Ax -N 128 -t x1 /dev/urandom 000 ef c6 2b ba 29 eb dd ec 6d 73 36 06 58 33 c8 be 010 53 fa 90 a2 a2 70 25 5f 67 1b c3 72 4f 26 c6 54 020 e9 83 44 c6 b9 45 3f 88 25 0c 4d c7 bc d5 77 58 030 d3 94 8e 4e e1 dd 71 02 dc c2 d0 19 f6 f4 5c 44 040 ff 84 56 9f 29 2a e5 00 33 d2 10 a4 d2 8a 13 56 050 d1 ac 86 46 4d 1e 2f 10 d9 0b 33 d7 c2 d4 ef df 060 d9 a2 0b 7f 24 05 72 39 2d a6 75 25 01 bd 41 6c 070 eb d9 4f 23 d9 ee 05 67 61 7c 8a 3d 5f 3a 76 e3 080 There ya go. That was faster than flipping a coin eh? ( my Canadian bit just flipped ) So you were saying ( or someone somewhere had the crazy idea that ZFS with dedupe and compression enabled ) won't really be of great benefit because of all the binary files in the filesystem. Well thats just nuts. Sorry but it is. Those binary files are made up of ELF headers and opcodes from a specific set of opcodes for a given architecture and that means the input set C consists of a discrete set of possible values and NOT pure random high entropy data. Want a demo ? Here : (1) take a nice big lib $ uname -a SunOS aequitas 5.11 snv_138 i86pc i386 i86pc $ ls -lap /usr/lib | awk '{ print $5 $9 }' | sort -n | tail 4784548 libwx_gtk2u_core-2.8.so.0.6.0 4907156 libgtkmm-2.4.so.1.1.0 6403701 llib-lX11.ln 8939956 libicudata.so.2 9031420 libgs.so.8.64 9300228 libCg.so 9916268 libicudata.so.3 14046812 libicudata.so.40.1 21747700 libmlib.so.2 40736972 libwireshark.so.0.0.1 $ cp /usr/lib/libwireshark.so.0.0.1 /tmp $ ls -l /tmp/libwireshark.so.0.0.1 -r-xr-xr-x 1 dclarke csw 40736972 May 7 14:20 /tmp/libwireshark.so.0.0.1 What is the SHA256 hash for that file ? $ cd /tmp Now compress it with gzip ( a good test case ) : $ /opt/csw/bin/gzip -9v libwireshark.so.0.0.1 libwireshark.so.0.0.1: 76.1% -- replaced with libwireshark.so.0.0.1.gz $ ls -l
Re: [zfs-discuss] ZFS kstat Stats
Do the following ZFS stats look ok? ::memstat Page Summary Pages MB %Tot Kernel 106619 832 28% ZFS File Data 79817 623 21% Anon 28553 223 7% Exec and libs 3055 23 1% Page cache 18024 140 5% Free (cachelist) 2880 22 1% Free (freelist) 146309 1143 38% Total 385257 3009 Physical 367243 2869 Looks beautiful. Just for giggles try this : r...@aequitas:/root# uname -a SunOS aequitas 5.11 snv_136 i86pc i386 i86pc Solaris r...@aequitas:/root# r...@aequitas:/root# /bin/printf ::kmastat\n | mdb -k cachebufbufbufmemory alloc alloc namesize in use totalin use succeed fail - -- -- -- -- - - kmem_magazine_18 8595 8736 212992B 8595 0 kmem_magazine_3 16 3697 3780 122880B 3697 0 kmem_magazine_7 32 7633 7686 499712B 7633 0 kmem_magazine_15 64 11642 116561540096B 11642 0 . . etc etc . nfs4_access_cache 32 0 0 0B 0 0 client_handle4_cache 16 0 0 0B 0 0 nfs4_ace4vals_cache 36 0 0 0B 0 0 nfs4_ace4_list_cache 176 0 0 0B 0 0 NFS_idmap_cache 24 0 0 0B 0 0 pty_map 48 0 64 4096B 1 0 -- - -- -- -- - - Total [hat_memload]974848B 1306984 0 Total [kmem_msb] 56860672B506215 0 Total [kmem_va] 78249984B 12180 0 Total [kmem_default] 76316672B 8546762 0 Total [kmem_io_1G] 36712448B 8643 0 Total [bp_map] 0B 212 0 Total [segkp] 6356992B186825 0 Total [umem_np] 0B 148 0 Total [ip_minor_arena_sa] 64B 180 0 Total [spdsock] 0B 1 0 Total [namefs_inodes] 64B18 0 -- - -- -- -- - - . . etc etc . Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
ea == erik ableson eable...@me.com writes: dc == Dennis Clarke dcla...@blastwave.org writes: rw,ro...@100.198.100.0/24, it works fine, and the NFS client can do the write without error. ea I' ve found that the NFS host based settings required the ea FQDN, and that the reverse lookup must be available in your ea DNS. I found, oddly, the @a.b.c.d/y syntax works only if the client's IP has reverse lookup. I had to add bogus hostnames to /etc/hosts for the whole /24 because if I didn't, for v3 it would reject mounts immediately, and for v4 mountd would core dump (and get restarted) which you see from the client as a mount that appears to hang. This is all using the @ip/mask syntax. I have LDAP and DNS in place for name resolution and NFS v4 works fine with either format in the sharenfs parameter. Never seen a problem. The Solaris 8 an 9 NFS clients work fine also. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6901832 If you use hostnames instead, it makes sense that you would have to use FQDN's. If you want to rewrite mountd to allow using short hostnames, the access checking has to be done like this: at export time: given hostname- forward nss lookup - list of IP's - remember IP's at mount time: client IP - check against list of remembered IP's but with fqdn's it can be: at export time: given hostname - remember it at mount time: client IP - reverse nss lookup - check against remembered list \--forward lookup-verify client IP among results The second way, all the lookups happen at mount time rather than export time. This way the data in the nameservice can change without forcing you to learn and then invoke some kind of ``rescan the exported filesystems'' command or making mountd remember TTL's for its cached nss data, or any such complexity. Keep all the nameservice caching inside nscd so there is only one place to flush it! However the forward lookup is mandatory for security, not optional OCDism. Without it, anyone from any IP can access your NFS server so long as he has control of his reverse lookup, which he probably does. I hope mountd is doing that forward lookup! dc Try to use a backslash to escape those special chars like so : dc zfs set dc sharenfs=nosub\,nosuid\,rw\=hostname1\:hostname2\,root\=hostname2 dc zpoolname/zfsname/pathname wth? Commas and colons are not special characters. This is silly. Works real well. -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
Hi All, I had create a ZFS filesystem test and shared it with zfs set sharenfs=root=host1 test, and I checked the sharenfs option and it already update to root=host1: Try to use a backslash to escape those special chars like so : zfs set sharenfs=nosub\,nosuid\,rw\=hostname1\:hostname2\,root\=hostname2 zpoolname/zfsname/pathname Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] getting drive serial number
On Sun, Mar 7, 2010 at 12:30 PM, Ethan notet...@gmail.com wrote: I have a failing drive, and no way to correlate the device with errors in the zpool status with an actual physical drive. If I could get the device's serial number, I could use that as it's printed on the drive. I come from linux, so I tried dmesg, as that's what's familiar (I see that the man page for dmesg on opensolaris says that I should be using syslogd but I haven't been able to figure out how to get the same output from syslogd). But, while I see at the top the serial numbers for some other drives, I don't see the one I want because it seems to be scrolled off the top. Can anyone tell me how to get the serial number of my failing drive? Or some other way to correlate the device with the physical drive? -Ethan smartctl will do what you're looking for. I'm not sure if it's included by default or not with the latest builds. Here's the package if you need to build from source: http://smartmontools.sourceforge.net/ You can find it at http://blastwave.network.com/csw/unstable/ Just install it with pkgadd or use pkgtrans to extract it and then run the binary. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] false DEGRADED status based on cannot open device at boot.
I find that some servers display a DEGRADED zpool status at boot. More troubling is that this seems to be silent and no notice is given on the console or via a snmp message or other notification process. Let me demonstrate : {0} ok boot -srv Sun Blade 2500 (Silver), No Keyboard Copyright 2005 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.17.3, 4096 MB memory installed, Serial #64510477. Ethernet address 0:3:ba:d8:5a:d, Host ID: 83d85a0d. Rebooting with command: boot -srv Boot device: /p...@1d,70/s...@4,1/d...@0,0:a File and args: -srv module /platform/sun4u/kernel/sparcv9/unix: text at [0x100, 0x10a3695] data at 0x180 module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10a3698, 0x126bbf7] data at 0x1866840 module /platform/SUNW,Sun-Blade-2500/kernel/misc/sparcv9/platmod: text at [0x126bbf8, 0x126c1e7] data at 0x18bc0c8 . . . many lines of verbose messages . . dump on /dev/zvol/dsk/mercury_rpool/swap size 0 MB Loading smf(5) service descriptions: 2/2 Requesting System Maintenance Mode SINGLE USER MODE Root password for system maintenance (control-d to bypass): single-user privilege assigned to /dev/console. Entering System Maintenance Mode # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT mercury_rpool68G 27.4G 40.6G40% DEGRADED - # zpool status mercury_rpool pool: mercury_rpool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: none requested config: NAME STATE READ WRITE CKSUM mercury_rpool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c3t0d0s0 ONLINE 0 0 0 c1t2d0s0 UNAVAIL 0 0 0 cannot open errors: No known data errors This is trivial to remedy : # zpool online mercury_rpool c1t2d0s0 # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT mercury_rpool68G 27.4G 40.6G40% ONLINE - # zpool status mercury_rpool pool: mercury_rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: resilver completed after 0h0m with 0 errors on Wed Feb 17 21:26:11 2010 config: NAME STATE READ WRITE CKSUM mercury_rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t0d0s0 ONLINE 0 0 0 c1t2d0s0 ONLINE 0 0 0 14.5M resilvered errors: No known data errors # I have many systems where I keep mirrors on multiple controllers, either fibre or SCSI. It seems that the SCSI devices don't get detected at boot on the Sparc systems. The x86/AMD64 systems do not seem to have this problem but I may be wrong. Is this a known bug or am I seeing something due to a missing line in /etc/system ? Oh, also, I should point out that it does not matter if I boot with init S or 3 or 6. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Detach ZFS Mirror
I have a 2-disk/2-way mirror and was wondering if I can remove 1/2 the mirror and plunk it in another system? You can remove it fine. You can plunk it in another system fine. I think you will end up with the same zpool name and id number. Also, I do not know if that disk would be bootable. You probably have to go through the installboot procedure for that. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] possible to remove a mirror pair from a zpool?
Suppose the requirements for storage shrink ( it can happen ) is it possible to remove a mirror set from a zpool? Given this : # zpool status array03 pool: array03 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: resilver completed after 0h41m with 0 errors on Sat Jan 9 22:54:11 2010 config: NAME STATE READ WRITE CKSUM array03 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 spares c2t22d0AVAIL errors: No known data errors Suppose I want to power down the disks c2t19d0 and c5t5d0 because they are not needed. One can easily picture a thumper with many disks unused and see reasons why one would want to power off disks. -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] possible to remove a mirror pair from a zpool?
No, sorry Dennis, this functionality doesn't exist yet, but is being worked, but will take a while, lots of corner cases to handle. James Dickens uadmin.blogspot.com 1 ) dammit 2 ) looks like I need to do a full offline backup and then restore to shrink a zpool. As usual, Thanks for always being there James. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Clearing a directory with more than 60 million files
On Tue, January 5, 2010 10:12, casper@sun.com wrote: How about creating a new data set, moving the directory into it, and then destroying it? Assuming the directory in question is /opt/MYapp/data: 1. zfs create rpool/junk 2. mv /opt/MYapp/data /rpool/junk/ 3. zfs destroy rpool/junk The move will create and remove the files; the remove by mv will be as inefficient removing them one by one. rm -rf would be at least as quick. Normally when you do a move with-in a 'regular' file system all that's usually done is the directory pointer is shuffled around. This is not the case with ZFS data sets, even though they're on the same pool? You can also use star which may speed things up, safely. star -copy -p -acl -sparse -dump -xdir -xdot -fs=96m -fifostats -time \ -C source_dir . destination_dir that will buffer the transport of the data from source to dest via memory and work to keep that buffer full as data is written on the output side. Its probably at least as fast as mv and probably safer because you never delete the original until after the copy is complete. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] invalid mountpoint 'mountpoint=legacy' ?
Anyone seen this odd message ? It seems a tad counter intuitive. # uname -a SunOS gamma 5.11 snv_126 sun4u sparc SUNW,Sun-Fire-480R # cat /etc/release Solaris Express Community Edition snv_126 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 19 October 2009 # ptime zpool create -f -o autoreplace=on -o version=10 \ -m mountpoint=legacy \ fibre01 mirror c2t0d0 c3t16d0 \ mirror c2t1d0 c3t17d0 \ mirror c2t2d0 c3t18d0 \ mirror c2t3d0 c3t19d0 \ mirror c2t4d0 c3t20d0 \ mirror c2t5d0 c3t21d0 \ mirror c2t6d0 c3t22d0 \ mirror c2t7d0 c3t23d0 \ mirror c2t8d0 c3t24d0 \ spare c2t10d0 invalid mountpoint 'mountpoint=legacy': must be an absolute path, 'legacy', or 'none' real 14.884950400 user0.998020300 sys 3.334027400 -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] invalid mountpoint 'mountpoint=legacy' ?
I hate it when I do that .. 30 secs later I see -m mountpoint which is a Property but not specified as -o foo-bar format. erk # ptime zpool create -f -o autoreplace=on -o version=10 \ -m legacy \ fibre01 mirror c2t0d0 c3t16d0 \ mirror c2t1d0 c3t17d0 \ mirror c2t2d0 c3t18d0 \ mirror c2t3d0 c3t19d0 \ mirror c2t4d0 c3t20d0 \ mirror c2t5d0 c3t21d0 \ mirror c2t6d0 c3t22d0 \ mirror c2t7d0 c3t23d0 \ mirror c2t8d0 c3t24d0 \ spare c2t10d0 real 12.367204400 user0.712670500 sys 2.022335900 # # zpool status fibre01 pool: fibre01 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM fibre01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c3t16d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t17d0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c3t18d0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c3t19d0 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c3t20d0 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c3t21d0 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 c3t22d0 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c3t23d0 ONLINE 0 0 0 mirror-8 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c3t24d0 ONLINE 0 0 0 spares c2t10d0AVAIL errors: No known data errors Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] quotas on zfs at solaris 10 update 9 (10/09)
We have just update a major file server to solaris 10 update 9 so that we can control user and group disk usage on a single filesystem. We were using qfs and one nice thing about samquota was that it told you your soft limit, your hard limit and your usage on disk space and on the number of files. Is there , on solaris 10 U9 a command which will report A lot of folks will want a similar functionality on everything from Sol 10 up to snv_129. I have a few experimental systems running thus : Sun Microsystems Inc. SunOS 5.11 snv_129 Dec. 01, 2009 SunOS Internal Development: root 2009-Nov-30 [onnv_129-tonic] bfu'ed from /build/archives-nightly-osol/sparc on 2009-12-04 Sun Microsystems Inc. SunOS 5.11 snv_126 November 2008 $ zpool upgrade This system is currently running ZFS pool version 22. All pools are formatted using this version. When we take into consideration the effects of compression and dedupe it can get difficult to answer the very basic question How much space do I have left? Perhaps a better question is How much space do I have left given a worst case scenario? I have pushed many copies of the exact same data into a ZFS filesystem with both compression and dedupe and watched as the actual space used was trivial. With a classic filesystem ( UFS ) we can generally answer the question quickly. One blunt object method would be to allocate a filesystem per user such that zfs list reports a long list of names under /export/home or similar. Then you can easily see the used space per filesystem. Allocating user quotas and then asking the simple questions seems mysterious to me also. I am looking into this for my own reasons and will stay in touch. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b128a available w/deduplication
FYI, OpenSolaris b128a is available for download or image-update from the dev repository. Enjoy. I thought that dedupe has been out for weeks now ? Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b128a available w/deduplication
Dennis Clarke wrote: FYI, OpenSolaris b128a is available for download or image-update from the dev repository. Enjoy. I thought that dedupe has been out for weeks now ? The source has, yes. But what Richard was referring to was the respun build now available via IPS. Oh, sorry. Thought I had missed something. I hadn't :-) I'm not on version 22 for ZFS and am not even entirely sure what that is : # uname -a SunOS europa 5.11 snv_129 sun4u sparc SUNW,UltraAX-i2 # zpool upgrade -v This system is currently running ZFS pool version 22. The following versions are supported: VER DESCRIPTION --- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 Snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Deduplication 22 Received properties For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number. HOWEVER, that URL no longer works for N 19 and in fact, the entire URL has changed to : http://hub.opensolaris.org/bin/view/Community+Group+zfs/22 -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe question
On Sat, 7 Nov 2009, Dennis Clarke wrote: Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2 directories named [a-z][a-z] where each file is 64K of random non-compressible data and then some english text. What method did you use to produce this random data? I'm using the tt800 method from Makoto Matsumoto described here : see http://random.mat.sbg.ac.at/generators/ and then here : /* * Generate the random text before we need it and also * outside of the area that measures the IO time. * We could have just read bytes from /dev/urandom but * you would be *amazed* how slow that is. */ random_buffer_start_hrt = gethrtime(); if ( random_buffer_start_hrt == -1 ) { perror(Could not get random_buffer high res start time); exit(EXIT_FAILURE); } for ( char_count = 0; char_count 65535; ++char_count ) { k_index = (int) ( genrand() * (double) 62 ); buffer_64k_rand_text[char_count]=alph[k_index]; } /* would be nice to break this into 0x40h char lines */ for ( p = 0x03fu; p 65535; p = p + 0x040u ) buffer_64k_rand_text[p]='\n'; buffer_64k_rand_text[65535]='\n'; buffer_64k_rand_text[65536]='\0'; random_buffer_end_hrt = gethrtime(); That works well. You know what ... I'm a schmuck. I didn't grab a time based seed first. All those files with random text .. have identical twins on the filesystem somewhere. :-P damn I'll go fix that. The dedupe ratio has climbed to 1.95x with all those unique files that are less than %recordsize% bytes. Perhaps there are other types of blocks besides user data blocks (e.g. metadata blocks) which become subject to deduplication? Presumably 'dedupratio' is based on a count of blocks rather than percentage of total data. I have no idea .. yet. I figure I'll try a few more experiments to see what it does and maybe, dare I say it, look at the source :-) -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe question
You can get more dedup information by running 'zdb -DD zp_dd'. This should show you how we break things down. Add more 'D' options and get even more detail. - George OKay .. thank you. Looks like I have piles of numbers here : # zdb -DDD zp_dd DDT-sha256-zap-duplicate: 37317 entries, size 342 on disk, 210 in core bucket allocated referenced __ __ __ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE -- -- - - - -- - - - 218.4K763M355M355M37.9K 1.52G727M727M 418.0K 1.16G 1.15G 1.15G72.4K 4.67G 4.61G 4.61G 8 70 1.47M849K849K 657 12.0M 6.78M 6.78M 16 27 39.5K 31.5K 31.5K 535747K598K598K 326 4K 4K 4K 276180K180K180K 644 9.00K 6.50K 6.50K 340680K481K481K 1281 2K 1.50K 1.50K 170340K255K255K 2561 1K 1K 1K 313313K313K313K 5121 512 512 512 522261K261K261K Total36.4K 1.91G 1.50G 1.50G 113K 6.21G 5.33G 5.33G DDT-sha256-zap-unique: 154826 entries, size 335 on disk, 196 in core bucket allocated referenced __ __ __ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE -- -- - - - -- - - - 1 151K 5.61G 2.52G 2.52G 151K 5.61G 2.52G 2.52G Total 151K 5.61G 2.52G 2.52G 151K 5.61G 2.52G 2.52G DDT histogram (aggregated over all DDTs): bucket allocated referenced __ __ __ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE -- -- - - - -- - - - 1 151K 5.61G 2.52G 2.52G 151K 5.61G 2.52G 2.52G 218.4K763M355M355M37.9K 1.52G727M727M 418.0K 1.16G 1.15G 1.15G72.4K 4.67G 4.61G 4.61G 8 70 1.47M849K849K 657 12.0M 6.78M 6.78M 16 27 39.5K 31.5K 31.5K 535747K598K598K 326 4K 4K 4K 276180K180K180K 644 9.00K 6.50K 6.50K 340680K481K481K 1281 2K 1.50K 1.50K 170340K255K255K 2561 1K 1K 1K 313313K313K313K 5121 512 512 512 522261K261K261K Total 188K 7.52G 4.01G 4.01G 264K 11.8G 7.85G 7.85G dedup = 1.96, compress = 1.51, copies = 1.00, dedup * compress / copies = 2.95 # I have no idea what any of that means, yet :-) -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quick dedup question
18 local neptune_rpool dedupratio 1.00x- neptune_rpool free12.5G- neptune_rpool allocated 21.3G- I'm currently running tests with this : http://www.blastwave.org/dclarke/crucible_source.txt -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] dedupe question
Does the dedupe functionality happen at the file level or a lower block level? I am writing a large number of files that have the fol structure : -- file begins 1024 lines of random ASCII chars 64 chars long some tilde chars .. about 1000 of then some text ( english ) for 2K more text ( english ) for 700 bytes or so -- Each file has the same tilde chars and then english text at the end of 64K of random character data. Before writing the data I see : # zpool get size,capacity,version,dedupratio,free,allocated zp_dd NAME PROPERTYVALUESOURCE zp_dd size67.5G- zp_dd capacity6% - zp_dd version 21 default zp_dd dedupratio 1.16x- zp_dd free63.3G- zp_dd allocated 4.19G- After I see this : # zpool get size,capacity,version,dedupratio,free,allocated zp_dd NAME PROPERTYVALUESOURCE zp_dd size67.5G- zp_dd capacity6% - zp_dd version 21 default zp_dd dedupratio 1.11x- zp_dd free63.1G- zp_dd allocated 4.36G- Note the drop in dedup ratio from 1.16x to 1.11x which seems to indicate that dedupe does not detect the english text is identical in every file. -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe question
On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote: Does the dedupe functionality happen at the file level or a lower block level? it occurs at the block allocation level. I am writing a large number of files that have the fol structure : -- file begins 1024 lines of random ASCII chars 64 chars long some tilde chars .. about 1000 of then some text ( english ) for 2K more text ( english ) for 700 bytes or so -- ZFS's default block size is 128K and is controlled by the recordsize filesystem property. Unless you changed recordsize, each of the files above would be a single block distinct from the others. you may or may not get better dedup ratios with a smaller recordsize depending on how the common parts of the file line up with block boundaries. the cost of additional indirect blocks might overwhelm the savings from deduping a small common piece of the file. - Bill Well, I as curious about these sort of things and figured that a simple test would show me the behavior. Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2 directories named [a-z][a-z] where each file is 64K of random non-compressible data and then some english text. I guess I was wrong about the 64K random text chunk also .. because I wrote out that data as chars from the set { [A-Z][a-z][0-9] } and thus .. compressible ASCII data as opposed to random binary data. So ... after doing that a few times I now see something fascinating : $ ls -lo /tester/foo/*/aa/aa.dat -rw-r--r-- 1 dclarke68330 Nov 7 22:38 /tester/foo/1/aa/aa.dat -rw-r--r-- 1 dclarke68330 Nov 7 22:45 /tester/foo/2/aa/aa.dat -rw-r--r-- 1 dclarke68330 Nov 7 22:43 /tester/foo/3/aa/aa.dat -rw-r--r-- 1 dclarke68330 Nov 7 22:43 /tester/foo/4/aa/aa.dat $ ls -lo /tester/foo/*/zz/az.dat -rw-r--r-- 1 dclarke68330 Nov 7 22:39 /tester/foo/1/zz/az.dat -rw-r--r-- 1 dclarke68330 Nov 7 22:47 /tester/foo/2/zz/az.dat -rw-r--r-- 1 dclarke68330 Nov 7 22:45 /tester/foo/3/zz/az.dat -rw-r--r-- 1 dclarke68330 Nov 7 22:47 /tester/foo/4/zz/az.dat $ find /tester/foo -type f | wc -l 70304 Those files, all 70,000+ of them, are unique and smaller than the filesystem blocksize. However : $ zfs get used,available,referenced,compressratio,recordsize,compression,dedup zp_dd/tester NAME PROPERTY VALUE SOURCE zp_dd/tester used 4.51G - zp_dd/tester available 3.49G - zp_dd/tester referenced 4.51G - zp_dd/tester compressratio 1.00x - zp_dd/tester recordsize 128K default zp_dd/tester compressionoff local zp_dd/tester dedup onlocal Compression factors don't interest me at the moment .. but see this : $ zpool get all zp_dd NAME PROPERTY VALUE SOURCE zp_dd size 67.5G - zp_dd capacity 6% - zp_dd altroot- default zp_dd health ONLINE - zp_dd guid 14649016030066358451 default zp_dd version21 default zp_dd bootfs - default zp_dd delegation on default zp_dd autoreplaceoff default zp_dd cachefile - default zp_dd failmode waitdefault zp_dd listsnapshots off default zp_dd autoexpand off default zp_dd dedupratio 1.95x - zp_dd free 63.3G - zp_dd allocated 4.22G - The dedupe ratio has climbed to 1.95x with all those unique files that are less than %recordsize% bytes. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SunOS neptune 5.11 snv_127 sun4u sparc SUNW, Sun-Fire-880
I just went through a BFU update to snv_127 on a V880 : neptune console login: root Password: Nov 3 08:19:12 neptune login: ROOT LOGIN /dev/console Last login: Mon Nov 2 16:40:36 on console Sun Microsystems Inc. SunOS 5.11 snv_127 Nov. 02, 2009 SunOS Internal Development: root 2009-Nov-02 [onnv_127-tonic] bfu'ed from /build/archives-nightly-osol/sparc on 2009-11-03 I have [ high ] hopes that there was a small tarball somewhere which contained the sources listed in : http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.html Is there such a tarball anywhere at all or shall I just wait for the putback to hit the mercurial repo ? Yes .. this is sort of begging .. but I call it enthusiasm :-) -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SunOS neptune 5.11 snv_127 sun4u sparc SUNW, Sun-Fire-880
Dennis Clarke wrote: I just went through a BFU update to snv_127 on a V880 : neptune console login: root Password: Nov 3 08:19:12 neptune login: ROOT LOGIN /dev/console Last login: Mon Nov 2 16:40:36 on console Sun Microsystems Inc. SunOS 5.11 snv_127 Nov. 02, 2009 SunOS Internal Development: root 2009-Nov-02 [onnv_127-tonic] bfu'ed from /build/archives-nightly-osol/sparc on 2009-11-03 I have [ high ] hopes that there was a small tarball somewhere which contained the sources listed in : http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.html Is there such a tarball anywhere at all or shall I just wait for the putback to hit the mercurial repo ? Yes .. this is sort of begging .. but I call it enthusiasm :-) Hi Dennis, we haven't done source tarballs or Mercurial bundles in quite some time, since it's more efficient for you to pull from the Mercurial repo and build it yourself :) Well, funny you should mention it. I was this close ( --|.|-- ) to running a nightly build and then I had a minor brainwave .. why bother? because the sparc archive bits were there already. Also, the build 127 tonic bits that I generated today (and which you appear to be using) won't contain Jeff's push from yesterday, because that changeset is part of build 128 - and I haven't closed the build yet. The push is in the repo, btw: changeset: 10922:e2081f502306 user:Jeff Bonwick jeff.bonw...@sun.com date:Sun Nov 01 14:14:46 2009 -0800 comments: PSARC 2009/571 ZFS Deduplication Properties 6677093 zfs should have dedup capability funny .. I didn't see it last night. :-\ I'll blame the coffee and go get a nightly happening right away :-) Thanks for the reply! -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
Terrific! Can't wait to read the man pages / blogs about how to use it... Just posted one: http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup Enjoy, and let me know if you have any questions or suggestions for follow-on posts. Looking at FIPS-180-3 in sections 4.1.2 and 4.1.3 I was thinking that the major leap from SHA256 to SHA512 was a 32-bit to 64-bit step. If the implementation of the SHA256 ( or possibly SHA512 at some point ) algorithm is well threaded then one would be able to leverage those massively multi-core Niagara T2 servers. The SHA256 hash is based on six 32-bit functions whereas SHA512 is based on six 64-bit functions. The CMT Niagara T2 can easily process those 64-bit hash functions and the multi-core CMT trend is well established. So long as context switch times are very low one would think that IO with a SHA512 based de-dupe implementation would be possible and even realistic. That would solve the hash collision concern I would think. Merely thinking out loud here ... -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] root pool can not have multiple vdevs ?
This seems like a bit of a restriction ... is this intended ? # cat /etc/release Solaris Express Community Edition snv_125 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 05 October 2009 # uname -a SunOS neptune 5.11 snv_125 sun4u sparc SUNW,Sun-Fire-880 # zpool status pool: neptune_rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM neptune_rpool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t3d0s0 ONLINE 0 0 0 errors: No known data errors Now I want to add two more mirrors to that pool because the V880 has more drives to offer, that are not used at the moment. So I'd like to add in a mirror of c1t1d0 and c1t4d0 : # zpool add -f neptune_rpool c1t1d0 cannot label 'c1t1d0': EFI labeled devices are not supported on root pools. OKay .. I can live with that. # prtvtoc -h /dev/rdsk/c1t0d0s0 | fmthard -s - /dev/rdsk/c1t1d0s0 fmthard: New volume table of contents now in place. # prtvtoc -h /dev/rdsk/c1t0d0s0 | fmthard -s - /dev/rdsk/c1t4d0s0 fmthard: New volume table of contents now in place. # zpool add -f neptune_rpool c1t1d0s0 cannot add to 'neptune_rpool': root pool can not have multiple vdevs or separate logs So essentially there is no way to grow that zpool. Is this the case? -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] You really do need ECC RAM
You really do need ECC RAM, but for the naysayers: http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf There are people that still question that? Really ? From section 3.2 Errors per DIMM in that paper : The mean number of correctable errors per DIMM are more comparable, ranging from 33514530 correctable errors per year. B. Schroeder, E. Pinheiro, W.-D. Weber. DRAM errors in the wild: A Large-Scale Field Study. Sigmetrics/Performance 2009 see http://www.cs.toronto.edu/~bianca/ -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] True in U4? Tar and cpio...save and restore ZFS File attributes and ACLs
/libiberty/objalloc.o is sparse gcc-4.3.4_SunOS_5.10-release/libiberty/cplus-dem.o is sparse gcc-4.3.4_SunOS_5.10-release/libiberty/cp-demint.o is sparse . . . gcc-4.3.4_SunOS_5.10-release/prev-libiberty/vasprintf.o is sparse /home/dclarke/bin/star_1.5a89: fifo had 57001 puts 55368 gets. /home/dclarke/bin/star_1.5a89: fifo was 2 times empty and 33 times full. /home/dclarke/bin/star_1.5a89: fifo held 100669440 bytes max, size was 100669440 bytes /home/dclarke/bin/star_1.5a89: 0 blocks + 1593968128 bytes (total of 1593968128 bytes = 1556609.50k). /home/dclarke/bin/star_1.5a89: Total time 1735.341sec (897 kBytes/sec) $ I should mention really poor performance also. Look in the output dir and see the ACL indicator ? $ ls -la /home/dclarke/test/destination total 11 drwxr-xr-x 3 dclarke csw3 Oct 1 08:17 . drwxr-xr-x 3 dclarke csw5 Oct 1 08:16 .. drwxr-xr-x+ 22 dclarke csw 31 Sep 25 11:40 gcc-4.3.4_SunOS_5.10-release That should not be there. $ cd /home/dclarke/test/destination $ ls -lVdE gcc-4.3.4_SunOS_5.10-release drwxr-xr-x+ 22 dclarke csw 31 2009-09-25 11:40:09.491951000 + gcc-4.3.4_SunOS_5.10-release owner@:-DaA--cC-s:--:allow owner@:--:--:deny group@:--a---c--s:--:allow group@:-D-A---C--:--:deny everyone@:--a---c--s:--:allow everyone@:-D-A---C--:--:deny owner@:--:--:deny owner@:rwxp---A-W-Co-:--:allow group@:-w-p--:--:deny group@:r-x---:--:allow everyone@:-w-p---A-W-Co-:--:deny everyone@:r-x---a-R-c--s:--:allow If I look down into that dir I see ACL's on all the ddir entries : $ cd gcc-4.3.4_SunOS_5.10-release $ ls -l total 1119 -rw-r--r-- 1 dclarke csw 577976 Sep 24 02:59 Makefile drwxr-xr-x+ 4 dclarke csw5 Sep 24 03:11 build-i386-pc-solaris2.10 -rw-r--r-- 1 dclarke csw 10 Sep 24 20:58 compare -rw-r--r-- 1 dclarke csw30323 Sep 24 02:59 config.log -rwxr-xr-x 1 dclarke csw31724 Sep 24 02:59 config.status -rwxr-xr-x 1 dclarke csw 400174 Sep 24 02:59 configure.lineno drwxr-xr-x+ 2 dclarke csw 20 Sep 24 20:58 fixincludes drwxr-xr-x+ 15 dclarke csw 535 Sep 25 03:55 gcc drwxr-xr-x+ 10 dclarke csw 10 Sep 24 21:23 i386-pc-solaris2.10 drwxr-xr-x+ 2 dclarke csw 32 Sep 25 10:39 intl drwxr-xr-x+ 3 dclarke csw 29 Sep 24 20:44 libcpp drwxr-xr-x+ 2 dclarke csw 15 Sep 24 20:44 libdecnumber drwxr-xr-x+ 4 dclarke csw 72 Sep 24 20:43 libiberty drwxr-xr-x+ 14 dclarke csw 532 Sep 24 20:41 prev-gcc drwxr-xr-x+ 4 dclarke csw4 Sep 24 20:40 prev-i386-pc-solaris2.10 drwxr-xr-x+ 2 dclarke csw 32 Sep 24 19:56 prev-intl drwxr-xr-x+ 3 dclarke csw 29 Sep 24 19:57 prev-libcpp drwxr-xr-x+ 2 dclarke csw 15 Sep 24 19:58 prev-libdecnumber drwxr-xr-x+ 4 dclarke csw 72 Sep 24 19:56 prev-libiberty -rw-r--r-- 1 dclarke csw 13 Sep 24 02:59 serdep.tmp drwxr-xr-x+ 14 dclarke csw 520 Sep 24 19:53 stage1-gcc drwxr-xr-x+ 4 dclarke csw4 Sep 24 19:51 stage1-i386-pc-solaris2.10 drwxr-xr-x+ 2 dclarke csw 32 Sep 24 03:09 stage1-intl drwxr-xr-x+ 3 dclarke csw 29 Sep 24 03:15 stage1-libcpp drwxr-xr-x+ 2 dclarke csw 15 Sep 24 03:16 stage1-libdecnumber drwxr-xr-x+ 4 dclarke csw 72 Sep 24 03:08 stage1-libiberty -rw-r--r-- 1 dclarke csw7 Sep 24 20:58 stage_current -rw-r--r-- 1 dclarke csw7 Sep 24 03:01 stage_final -rw-r--r-- 1 dclarke csw7 Sep 24 20:58 stage_last I'll delete this now .. I can't use it with those strange ACL's there. $ cd /home/dclarke/test $ rm -rf destination I'll do some more testing with star 1.5a89 and let you know what I see. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] *Almost* empty ZFS filesystem - 14GB?
Chris Murray wrote: Accidentally posted the below earlier against ZFS Code, rather than ZFS Discuss. My ESXi box now uses ZFS filesystems which have been shared over NFS. Spotted something odd this afternoon - a filesystem which I thought didn't have any files in it, weighs in at 14GB. Before I start deleting the empty folders to see what happens, any ideas what's happened here? # zfs list | grep temp zp/nfs/esx_temp 14.0G 225G 14.0G /zp/nfs/esx_temp # ls -la /zp/nfs/esx_temp total 20 drwxr-xr-x 5 root root 5 Aug 13 12:54 . drwxr-xr-x 7 root root 7 Aug 13 12:40 .. drwxr-xr-x 2 root root 2 Aug 13 12:53 iguana drwxr-xr-x 2 root root 2 Aug 13 12:54 meerkat drwxr-xr-x 2 root root 2 Aug 16 19:39 panda # ls -la /zp/nfs/esx_temp/iguana/ total 8 drwxr-xr-x 2 root root 2 Aug 13 12:53 . drwxr-xr-x 5 root root 5 Aug 13 12:54 .. # ls -la /zp/nfs/esx_temp/meerkat/ total 8 drwxr-xr-x 2 root root 2 Aug 13 12:54 . drwxr-xr-x 5 root root 5 Aug 13 12:54 .. # ls -la /zp/nfs/esx_temp/panda/ total 8 drwxr-xr-x 2 root root 2 Aug 16 19:39 . drwxr-xr-x 5 root root 5 Aug 13 12:54 .. # Could there be something super-hidden, which I can't see here? There don't appear to be any snapshots relating to zp/nfs/esx_temp. On a suggestion, I have ran the following: # zfs list -r zp/nfs/esx_temp NAME USED AVAIL REFER MOUNTPOINT zp/nfs/esx_temp 14.0G 225G 14.0G /zp/nfs/esx_temp # du -sh /zp/nfs/esx_temp 8K /zp/nfs/esx_temp # Does zfs list -t snapshot -r zp/nfs/esx_temp show anything? What about zfs get refquota,refreservation,quota,reservation zp/fs/esx_tmp pardon me for butting in .. but I thought that was a spelling error. It wasn't : # zfs get refquota,refreservation,quota,reservation fibre0 NAMEPROPERTYVALUE SOURCE fibre0 refquotanone default fibre0 refreservation none default fibre0 quota none default fibre0 reservation none default what the heck is refreservation ?? 8-) -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] *Almost* empty ZFS filesystem - 14GB?
what the heck is refreservation ?? 8-) PSARC/2009/204 ZFS user/group quotas space accounting [1] Integrated in build 114 [1] http://arc.opensolaris.org/caselog/PSARC/2009/204/ [2] http://mountall.blogspot.com/2009/05/sxce-build-114-is-out.html that was fast . Cyril, long time no hear. :-( Hows life the universe and risc processors for you these days ? -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ps: I have been busy porting as per usual. New 64-bit ready Tk/Tcl released today. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool iostat reports seem odd. bug ?
- - - - - - ^C# Non-verbose iostat data shows no ( near zero ) write bandwidth : # zpool iostat phobos_rpool 5 capacity operationsbandwidth pool used avail read write read write - - - - - - phobos_rpool 16.2G 17.5G233 36 6.98M 51.0K phobos_rpool 16.2G 17.5G202 10 18.6M 12.2K phobos_rpool 16.2G 17.5G212 15 15.5M 14.6K phobos_rpool 16.2G 17.5G274 43 15.5M 36.9K phobos_rpool 16.2G 17.5G250 24 21.1M 22.7K phobos_rpool 16.2G 17.5G189 15 16.8M 14.9K phobos_rpool 16.2G 17.5G205 21 16.8M 18.5K ^C# I also note that the verbose output reports often show no units for read bandwidth on the new device : # zpool iostat -v phobos_rpool 5 capacity operationsbandwidth pool used avail read write read write - - - - - - phobos_rpool 16.2G 17.5G375 52 8.60M 74.7K mirror 16.2G 17.5G375 52 8.60M 74.7K c1t0d0s0 - -112 29 6.21M 75.5K c1t1d0s0 - - 59 32 3.10M 75.5K c0t2d0- - 0343104 13.3M - - - - - - See the 104 in the last row. That may be bytes, KB, or MB. That may be documented somewhere but I suspect it is not just bytes. Sorry if I am being nit-picky but I thought that this data would be in the kstat chain and the per-device data would be summed up for the non-verbose report. It looks like the write traffic to the new device is being ignored in the non-verbose output data. -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The zfs performance decrease when enable the MPxIO round-robin
To enable mpxio, you need to have mpxio-disable=no; in your fp.conf file. You should run /usr/sbin/stmsboot -e to make this happen. If you *must* edit that file by hand, always run /usr/sbin/stmsboot -u afterwards to ensure that your system's MPxIO config is correctly updated. I thought stmsboot was mildly broken on the latest releases of S10 and possibly confused in the ZFS world. I am going by memory here of course and this may have been fixed since I looked at it 6 months ago or so. I also feel that editing the fb.conf file manually and entering the paths chosen is perhaps the best way to go. Also, since I'm up late and posting a comment anyways, I set up a V890 with mpxio and ZFS with every ZFS mirror being composed of a mpxio enabled device. The whole process took some time but the level of redundency and throughput was worth it. Who knows, maybe someday mpxio will be default from install on FCAL enabled machines. Dennis ps: I'm going to go search for those bugids if they ever existed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Zpool lazy mirror ?
Pardon me but I had to change subject lines just to get out of that other thread. In that other thread .. you were saying : dick hoogendijk uttered: true. Furthermore, much so-called consumer hardware is very good these days. My guess is ZFS should work quite reliably on that hardware. (i.e. non ECC memory should work fine!) / mirroring is a -must- ! Gavin correctly revealed : No, ECC memory is a must too. ZFS checksumming verifies and corrects data read back from a disk, but once it is read from disk it is stashed in memory for your application to use - without ECC you erode confidence that what you read from memory is correct. Well here I run into a small issue. And timing is everything in life and this small issue is happening right in front of me as I write this. I have a Sun Blade 2500 with 4GB of genuine Sun ECC memory ( 370-6203 [1] ) and internally there are dual Sun 72GB Ultra 320 disks ( 390-0106 ). I like to have mirrors everywhere and I also like safety. I had the brilliant idea of pulling the secondary disk in slot 1 out and installing some more ethernet and SCSI paths. So I popped in a 501-5727 ( Dual FastEthernet / Dual SCSI Ultra-2 PCI Adapter ) and then moved the internal disk out to an external disk pack. So now I still have a mirror but with dual SCSI controllers involved. When the machine boots I see this : Rebooting with command: boot Boot device: /p...@1d,70/s...@4/d...@0,0:a File and args: SunOS Release 5.10 Version Generic_141414-02 64-bit Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hostname: mercury Loading smf(5) service descriptions: 1/1 Reading ZFS config: done. Mounting ZFS filesystems: (5/5) mercury console login: root Password: Jul 20 00:13:06 mercury login: ROOT LOGIN /dev/console Last login: Sun Jul 19 23:41:22 on console Sun Microsystems Inc. SunOS 5.10 Generic January 2005 # zpool status pool: mercury_rpool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: none requested config: NAME STATE READ WRITE CKSUM mercury_rpool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c0t0d0s0 ONLINE 0 0 0 c1t2d0s0 UNAVAIL 0 0 0 cannot open So I have to manually intervene and do this : # zpool online mercury_rpool c1t2d0s0 # zpool status pool: mercury_rpool state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Mon Jul 20 00:13:28 2009 config: NAME STATE READ WRITE CKSUM mercury_rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s0 ONLINE 0 0 0 c1t2d0s0 ONLINE 0 0 0 errors: No known data errors This means that I do have a Zpool with mirrored ZFS boot and root and all that goodness but not unless I *know* to look at the state of the mirror after boot. The system seems to be lazy in that it does not report the DEGRADED state on the console or via syslogd. Now I caught this, just now ( see date and kernel rev above ) and wonder .. is this not a bug ? -- Dennis [1] DDR266, PC2100, CL2, ECC Serial Presence Detect 1.0 1GB Registered DIMM ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Zpool lazy mirror ?
self replies are so degrading ( pun intended ) I see this patch : Document Audience: PUBLIC Document ID:139555-08 Title: SunOS 5.10: Kernel Patch Copyright Notice: Copyright © 2009 Sun Microsystems, Inc. All Rights Reserved Update Date:Fri Jul 10 04:29:40 MDT 2009 I have a sneaky feeling .. the issue was fixed in a kernel patch released *this* past week. We shall see ... I'll patch now. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Thank you.
I want to express my thanks. My gratitude. I am not easily impressed by technology anymore and ZFS impressed me this morning. Sometime late last night a primary server of mine had a critical fault. One of the PCI cards in a V480 was the cause and for whatever reasons this destroyed the DC-DC power convertors that powered the primary internal disks. It also dropped the whole machine and 12 zones. I feared the worst and made the call for service at about midnight last night. A Sun service tech said he could be there in 2 hours or so but he asked me to check this and check that. The people at the datacenter were happy to tell me there was a wrench light on but other than that, they knew nothing. This machine, like all critical systems I have, uses mirrored disks in ZPools with multiple links of fibre to arrays. I dreaded what would happen when we tried to boot this box after all the dust was blown out and hardware swapped. Early this morning ... I watched the detailed diags run and finally a nice clean ok prompt. * Hardware Power On @(#)OBP 4.22.34 2007/07/23 13:01 Sun Fire 4XX System is initializing with diag-switch? overrides. Online: CPU0 CPU1 CPU2 CPU3* Validating JTAG integrity...Done . . . CPU0: System POST Completed Pass/Fail Status = ... ESB Overall Status = ... * POST Reset . . . {3} ok show-post-results System POST Results Component:Results CPU/Memory:Passed IO-Bridge8:Passed IO-Bridge9:Passed GPTwo Slots: Passed Onboard FCAL: Passed Onboard Net1: Passed Onboard Net0: Passed Onboard IDE: Passed PCI Slots: Passed BBC0: Passed RIO: Passed USB: Passed RSC: Passed POST Message: POST PASS {3} ok boot -s Eventually I saw my login prompt. There were no warnings about data corruption. No data loss. No noise at all in fact. :-O # zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT fibre0 680G 654G 25.8G96% ONLINE - z0 40.2G 103K 40.2G 0% ONLINE - # zpool status fibre0 pool: fibre0 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM fibre0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 spares c2t22d0AVAIL errors: No known data errors # Not one error. No message about resilver this or inode that. Everything booted flawlessly and I was able to see all my zones : # bin/lz - NAME ID STATUS PATH HOSTNAME BRAND IP - z_001 4 running /zone/z_001 pluto solaris8 excl z_002 - installed /zone/z_002 ldap01nativeshared z_003 - installed /zone/z_003 openfor solaris9 shared z_004 6 running /zone/z_004 gaspranativeshared z_005 5 running /zone/z_005 ibisprd nativeshared z_006 7 running /zone/z_006 ionativeshared z_007 1 running /zone/z_007 nis nativeshared z_008 3 running /zone/z_008 callistoz nativeshared z_009 2 running /zone/z_009 loginznativeshared z_010 - installed /zone/z_010 venus solaris8 shared z_011 - installed /zone/z_011 adbs solaris9 shared z_012 - installed /zone/z_012 auroraux nativeshared z_013 8 running /zone/z_013 osirisnativeexcl z_014 - installed /zone/z_014 jira nativeshared People love to complain. I see it all the time. I downloaded this OS for free and I run it in production. I have support and I am fine with paying for support contracts. But someone somewhere needs to buy the ZFS guys some keg(s) of whatever beer they want. Or maybe new Porsche Cayman S toys. That would be gratitude as something more than just words. Thank you. -- Dennis Clarke ps: the one funny thing
Re: [zfs-discuss] first use send/receive... somewhat confused.
Richard Elling richard.ell...@gmail.com writes: You can only send/receive snapshots. However, on the receiving end, there will also be a dataset of the name you choose. Since you didn't share what commands you used, it is pretty impossible for us to speculate what you might have tried. I thought I made it clear I had not used any commands but gave two detailed examples of different ways to attempt the move. I see now the main thing that confused me is that sending a z1/proje...@something to a new z2/proje...@something would also result in z2/projects being created. That part was not at all clear to me from the man page. This will probably get me bombed with napalm but I often just use star from Jörg Schilling because its dead easy : star -copy -p -acl -sparse -dump -C old_dir . new_dir and you're done.[1] So long as you have both the new and the old zfs/ufs/whatever[2] filesystems mounted. It doesn't matter if they are static or not. If anything changes on the filesystem then star will tell you about it. -- Dennis [1] -p means preserve meta-properties of the files/dirs etc. -acl means what it says. Grabs ACL data also. -sparse means what it says. Handles files with holes in them. -dump means be super careful about everything ( read the manpage ) [2] star doesn't care if its zfs or ufs or a CDROM or a floppy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
Dennis Clarke dcla...@blastwave.org writes: This will probably get me bombed with napalm but I often just use star from Jörg Schilling because its dead easy : star -copy -p -acl -sparse -dump -C old_dir . new_dir and you're done.[1] So long as you have both the new and the old zfs/ufs/whatever[2] filesystems mounted. It doesn't matter if they are static or not. If anything changes on the filesystem then star will tell you about it. I'm not sure I see how that is easier. The command itself may be but it requires other moves not shown in your command. 1) zfs create z2/projects 2) star -copy -p -acl -sparse -dump -C old_dir . new_dir As a bare minimum would be required. whereas zfs send z1/proje...@snap |zfs receive z2/proje...@snap Is all that is necessary using zfs send receive, and the new filesystem z2/projects is created and populated with data from z1/projects, not to mention a snapshot at z2/projects/.zfs/snapshot sort of depends on what you want to get done and both work. dc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on 32 bit?
On Tue, 16 Jun 2009, roland wrote: so, we have a 128bit fs, but only support for 1tb on 32bit? i`d call that a bug, isn`t it ? is there a bugid for this? ;) I'd say the bug in this instance is using a 32-bit platform in 2009! :-) Rich, a lot of embedded industrial solutions are 32-bit and very up to date in terms of features. Thus : $ uname -a SunOS aequitas 5.11 snv_115 i86pc i386 i86pc $ isainfo -v 32-bit i386 applications ahf sse2 sse fxsr mmx cmov sep cx8 tsc fpu $ isalist -v i486 i386 i86 $ psrinfo -pv The physical processor has 1 virtual processor (0) x86 (CentaurHauls 6A9 family 6 model 10 step 9 clock 1200 MHz) VIA Esther processor 1200MHz Also, some of the very very small little PC units out there, those things called eePC ( or whatever ) are probably 32-bit only. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression at zfs filesystem creation
On Mon, 15 Jun 2009, dick hoogendijk wrote: IF at all, it certainly should not be the DEFAULT. Compression is a choice, nothing more. I respectfully disagree somewhat. Yes, compression shuould be a choice, but I think the default should be for it to be enabled. I agree that Compression is a choice and would add : Compression is a choice and it is the default. Just my feelings on the issue. Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quick adding devices question
zpool create dpool c1t0d0 c1t1d0 c1t2d0 yep And then later, when the other cable is installed: zpool attach dpool c1t0d0 c2t0d0 zpool attach dpool c1t1d0 c2t1d0 zpool attach dpool c1t2d0 c2t2d0 That is sort of the way I do things also : # zpool status pool: fibre0 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: resilver completed after 1h35m with 0 errors on Tue Mar 24 18:23:20 2009 config: NAME STATE READ WRITE CKSUM fibre0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 spares c2t22d0AVAIL errors: No known data errors You noticed that the man page is not too clear on that eh? zpool attach [-f] pool device new_device Attaches new_device to an existing zpool device. The existing device cannot be part of a raidz configuration. If device is not currently part of a mirrored configura- tion, device automatically transforms into a two-way mirror of device and new_device. If device is part of a two-way mirror, attaching new_device creates a three-way mirror, and so on. In either case, new_device begins to resilver immediately. so yeah, you have it. Want to go for bonus points? Try to read into that man page to figure out how to add a hot spare *after* you are all mirrored up. -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is zpool import unSIGKILLable ?
It may be because it is blocked in kernel. Can you do something like this: echo 0tpid of zpool import::pid2proc|::walk thread|::findstack -v So we see that it cannot complete import here and is waiting for transaction group to sync. So probably spa_sync thread is stuck, and we need to find out why. Well, the details are going to change, I had to reboot. :-( I'll start up the stuck thread bug again here by simply starting over. I'll bet you would be able to learn a few things if you were to ssh into this machine. ? regardless, let's start over. dcla...@neptune:~$ uname -a SunOS neptune 5.11 snv_111 i86pc i386 i86pc dcla...@neptune:~$ uptime 2:04pm up 10:13, 1 user, load average: 0.17, 0.16, 0.15 dcla...@neptune:~$ su - Password: Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 # # zpool import pool: foo id: 15989070886807735056 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: foo ONLINE c0d0p0ONLINE # please see ALL the details at : http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README also see output from fmdump -eV http://www.blastwave.org/dclarke/blog/files/fmdump_e.log Please let me know what else you may need. -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is zpool import unSIGKILLable ?
Dennis Clarke wrote: It may be because it is blocked in kernel. Can you do something like this: echo 0tpid of zpool import::pid2proc|::walk thread|::findstack -v So we see that it cannot complete import here and is waiting for transaction group to sync. So probably spa_sync thread is stuck, and we need to find out why. Well, the details are going to change, I had to reboot. :-( I'll start up the stuck thread bug again here by simply starting over. I'll bet you would be able to learn a few things if you were to ssh into this machine. ? regardless, let's start over. dcla...@neptune:~$ uname -a SunOS neptune 5.11 snv_111 i86pc i386 i86pc dcla...@neptune:~$ uptime 2:04pm up 10:13, 1 user, load average: 0.17, 0.16, 0.15 dcla...@neptune:~$ su - Password: Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 # # zpool import pool: foo id: 15989070886807735056 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: foo ONLINE c0d0p0ONLINE # please see ALL the details at : http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README There's a corrupted space map which is being updated as part of the txg sync; in order to update it (add a few free ops to the last block), we need to read in current content of the last block from disk first, and that fails because it is corrupted (as indicated by checksum errors in the fmdump output): eb5c9dc0 fec1f3980 0 60 d38e1828 PC: _resume_from_idle+0xb1THREAD: txg_sync_thread() stack pointer for thread eb5c9dc0: eb5c9a28 swtch+0x188() cv_wait+0x53() zio_wait+0x55() dbuf_read+0x201() dbuf_will_dirty+0x30() dmu_write+0xd7() space_map_sync+0x304() metaslab_sync+0x284() vdev_sync+0xc6() spa_sync+0x3d0() txg_sync_thread+0x308() thread_start+8() Victor I had to cc that back onto the ZFS list, it may be of value here. I agree that there is something wrong, no doubt, however we should not see zpool import simply hang and become unresponsive nor should that pid be unresponsive to a SIGKILL. Good behaviour should be the norm and that is not what we see with a stuck kernel thread. Really, we should get some response to the effect that a device is corrupt or similar. Right now, what the user gets, is very little information other than a non-responsive command. CTRL+C does nothing and kill -9 pid does nothing to this command. feels like a bug to me Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is zpool import unSIGKILLable ?
Dennis Clarke wrote: Dennis Clarke wrote: It may be because it is blocked in kernel. Can you do something like this: echo 0tpid of zpool import::pid2proc|::walk thread|::findstack -v So we see that it cannot complete import here and is waiting for transaction group to sync. So probably spa_sync thread is stuck, and we need to find out why. Well, the details are going to change, I had to reboot. :-( I'll start up the stuck thread bug again here by simply starting over. I'll bet you would be able to learn a few things if you were to ssh into this machine. ? regardless, let's start over. dcla...@neptune:~$ uname -a SunOS neptune 5.11 snv_111 i86pc i386 i86pc dcla...@neptune:~$ uptime 2:04pm up 10:13, 1 user, load average: 0.17, 0.16, 0.15 dcla...@neptune:~$ su - Password: Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 # # zpool import pool: foo id: 15989070886807735056 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: foo ONLINE c0d0p0ONLINE # please see ALL the details at : http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README There's a corrupted space map which is being updated as part of the txg sync; in order to update it (add a few free ops to the last block), we need to read in current content of the last block from disk first, and that fails because it is corrupted (as indicated by checksum errors in the fmdump output): eb5c9dc0 fec1f3980 0 60 d38e1828 PC: _resume_from_idle+0xb1THREAD: txg_sync_thread() stack pointer for thread eb5c9dc0: eb5c9a28 swtch+0x188() cv_wait+0x53() zio_wait+0x55() dbuf_read+0x201() dbuf_will_dirty+0x30() dmu_write+0xd7() space_map_sync+0x304() metaslab_sync+0x284() vdev_sync+0xc6() spa_sync+0x3d0() txg_sync_thread+0x308() thread_start+8() Victor I had to cc that back onto the ZFS list, it may be of value here. Sorry for that, I've just hit wrong button ;-) I agree that there is something wrong, no doubt, however we should not see zpool import simply hang and become unresponsive nor should that pid be unresponsive to a SIGKILL. Good behaviour should be the norm and that is not what we see with a stuck kernel thread. Really, we should get some response to the effect that a device is corrupt or similar. Right now, what the user gets, is very little information other than a non-responsive command. CTRL+C does nothing and kill -9 pid does nothing to this command. feels like a bug to me Yes, it is: http://bugs.opensolaris.org/view_bug.do?bug_id=6758902 oh drat, I thought I hit something new :-\ Not very likely with ZFS, it is pretty well flushed out all the way into the dark corners I guess. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is zpool import unSIGKILLable ?
CTRL+C does nothing and kill -9 pid does nothing to this command. feels like a bug to me Yes, it is: http://bugs.opensolaris.org/view_bug.do?bug_id=6758902 Now I recall why I had to reboot. Seems as if a lot of commands hang now. Things like : df -ak zfs list zpool list they all just hang. Dennis ps: this machine is really just an embedded device based on the VIA chipset. Not too sure if that matters. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] using zdb -e -bbcsL to debug that hung thread issue
Original Message Subject: Re: I see you're running zdb -e -bbcsL From:Victor Latushkin victor.latush...@sun.com Date:Sun, May 10, 2009 11:17 To: dcla...@blastwave.org -- Dennis Clarke wrote: # w 3:14pm up 11:24, 3 users, load average: 0.46, 0.29, 0.23 User tty login@ idle JCPU PCPU what dclarke console 1:22pm 1:52 2:02 1:31 /usr/lib/nwam-manager dclarke pts/4 1:44pm 1:10zpool import -f -R /mnt/foo 1598 dclarke pts/7 1:49pm 9ssh -2 -4 -e^ -l dclarke loginz. dclarke pts/8 1:51pm 3ssh -2 -4 -e^ -l dclarke mail.li dclarke pts/102:07pm 20 w iktorn pts/113:06pm 4zpool import iktorn pts/123:13pm1 1 zdb -e -bbcsL 159890708868077350 Now I need to go read the manual to see what zdb is :-) thus far I see some output from that : dcla...@neptune:~$ cat ../iktorn/zdb/zdb-ebbcsL.out I know that will wrap all wrong for people to see. see : http://www.blastwave.org/dclarke/blog/files/zdb-ebbcsL.README Traversing all blocks to verify metadata checksums ... zdb_blkptr_cb: Got error 50 reading 0, 34, 0, 0 [L0 SPA space map] 0x1000L/0x200P DVA[0]=0:0x6091e400:0x200 DVA[1]=0:0x3e091e400:0x200 DVA[2]=0:0x78091e400:0x200 fletcher4 lzjb LE contiguous birth=1461 fill=1 cksum=0x1f7bc0ee12:0x6fcfd90640d:0x10787c83addaf:0x1f3ef97a921b6f -- skipping zdb_blkptr_cb: Got error 50 reading 0, 35, 0, 0 [L0 SPA space map] 0x1000L/0x200P DVA[0]=0:0x6091e600:0x200 DVA[1]=0:0x3e091e600:0x200 DVA[2]=0:0x78091e600:0x200 fletcher4 lzjb LE contiguous birth=1461 fill=1 cksum=0x1f7bc0ee12:0x6fcfd90640d:0x10787c83addaf:0x1f3ef97a921b6f -- skipping zdb_blkptr_cb: Got error 50 reading 0, 36, 0, 0 [L0 SPA space map] 0x1000L/0x200P DVA[0]=0:0x6091e200:0x200 DVA[1]=0:0x3e091e200:0x200 DVA[2]=0:0x78091e200:0x200 fletcher4 lzjb LE contiguous birth=1461 fill=1 cksum=0x1f522c92e2:0x6ae50e6dbad:0xf0a944e70790:0x1b6468e6c6f56a -- skipping zdb_blkptr_cb: Got error 50 reading 0, 37, 0, 0 [L0 SPA space map] 0x1000L/0x200P DVA[0]=0:0x803d9800:0x200 DVA[1]=0:0x4003d9800:0x200 DVA[2]=0:0x7a03d9800:0x200 fletcher4 lzjb LE contiguous birth=1509 fill=1 cksum=0x1f2b0a539f:0x763a6e219f4:0x1200601439c63:0x22f1a766cefc7c -- skipping zdb_blkptr_cb: Got error 50 reading 0, 38, 0, 0 [L0 SPA space map] 0x1000L/0x200P DVA[0]=0:0x803d9a00:0x200 DVA[1]=0:0x4003d9a00:0x200 DVA[2]=0:0x7a03d9a00:0x200 fletcher4 lzjb LE contiguous birth=1509 fill=1 cksum=0x1f2b0a539f:0x763a6e219f4:0x1200601439c63:0x22f1a766cefc7c -- skipping zdb_blkptr_cb: Got error 50 reading 0, 39, 0, 0 [L0 SPA space map] 0x1000L/0x200P DVA[0]=0:0x803d9600:0x200 DVA[1]=0:0x4003d9600:0x200 DVA[2]=0:0x7a03d9600:0x200 fletcher4 lzjb LE contiguous birth=1509 fill=1 cksum=0x1f2b0a539f:0x763a6e219f4:0x1200601439c63:0x22f1a766cefc7c -- skipping zdb_blkptr_cb: Got error 50 reading 0, 48, 0, 0 [L0 SPA space map] 0x1000L/0x400P DVA[0]=0:0xc1263c00:0x400 DVA[1]=0:0x441263c00:0x400 DVA[2]=0:0x7c1263c00:0x400 fletcher4 lzjb LE contiguous birth=648 fill=1 cksum=0x22b93a8434:0x190afe8456c3:0x9632f68e6719b:0x2703bc59856dd31 -- skipping zdb_blkptr_cb: Got error 50 reading 0, 49, 0, 0 [L0 SPA space map] 0x1000L/0x400P DVA[0]=0:0xc1264000:0x400 DVA[1]=0:0x441264000:0x400 DVA[2]=0:0x7c1264000:0x400 fletcher4 lzjb LE contiguous birth=648 fill=1 cksum=0x24c7dc289e:0x1a54426b8513:0x9cb262a2e8e04:0x286474a6a40f0ea -- skipping zdb_blkptr_cb: Got error 50 reading 0, 50, 0, 0 [L0 SPA space map] 0x1000L/0x400P DVA[0]=0:0xc1264400:0x400 DVA[1]=0:0x441264400:0x400 DVA[2]=0:0x7c1264400:0x400 fletcher4 lzjb LE contiguous birth=648 fill=1 cksum=0x24c7dc289e:0x1a54426b8513:0x9cb262a2e8e04:0x286474a6a40f0ea -- skipping Error counts: errno count 50 9 block traversal size 1561281536 != alloc 20934112256 (unreachable 19372830720) bp count:4121 bp logical: 521589760avg: 126568 bp physical:520441856avg: 126290compression: 1.00 bp allocated: 1561281536avg: 378859compression: 0.33 SPA allocated: 20934112256 used: 26.17% Blocks LSIZE PSIZE ASIZE avgcomp %Total Type 8 56.0K 10.0K 30.0K 3.75K5.60 0.00 deferred free 1512 512 1.50K 1.50K1.00 0.00 object directory 2 1K 1K 3.00K 1.50K1.00 0.00 object array 116K 1.50K 4.50K 4.50K 10.67 0.00 packed nvlist - - - - - -- packed nvlist size 116K 1K 3.00K 3.00K 16.00 0.00 bplist - - - - - -- bplist header - - - - - -- SPA space map header 48 192K 37.0K111K 2.31K5.19
[zfs-discuss] is zpool import unSIGKILLable ?
I tried to import a zpool and the process just hung there, doing nothing. It has been ten minutes now so I tries to hit CTRL-C. That did nothing. So then I tried : Sun Microsystems Inc. SunOS 5.11 snv_110 November 2008 r...@opensolaris:~# ps -efl F S UID PID PPID C PRI NI ADDR SZWCHANSTIME TTY TIME CMD 1 T root 0 0 0 0 SY fec1f318 0 10:02:47 ? 0:01 sched 0 S root 1 0 0 40 20 d3a62448683 d3291d32 10:02:50 ? 0:00 /sbin/init 1 S root 2 0 0 0 SY d3a61bc0 0 fec776b0 10:02:50 ? 0:00 pageout . . . 0 S root 1185 1014 0 40 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 r...@opensolaris:~# kill -9 1185 r...@opensolaris:~# ps -efl | grep root . . . 0 S root 1014 1008 0 50 20 d74ff260 1470 d74ff2cc 10:16:23 pts/2 0:00 -bash 0 S root 1185 1014 0 40 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 . . . OKay, I'll kill the shell. r...@opensolaris:~# kill -9 1014 r...@opensolaris:~# ps -efl | grep root 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 r...@opensolaris:~# kill -9 1185 r...@opensolaris:~# ps -efl | grep root | grep import 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 r...@opensolaris:~# kill -9 1185 r...@opensolaris:~# ps -efl | grep root | grep import 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 r...@opensolaris:~# r...@opensolaris:~# date Sat May 9 11:29:37 PDT 2009 r...@opensolaris:~# ps -efl | grep root | grep import 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 r...@opensolaris:~# Seems to be permanently wedged in there. r...@opensolaris:~# truss -faeild -p 1185 truss: unanticipated system error: 1185 So what is the trick to killing this ? -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is zpool import unSIGKILLable ?
Dennis Clarke wrote: I tried to import a zpool and the process just hung there, doing nothing. It has been ten minutes now so I tries to hit CTRL-C. That did nothing. This symptom is consistent with a process blocked waiting on disk I/O. Are the disks functional? totally I'm running with the machine right now. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is zpool import unSIGKILLable ?
Dennis Clarke wrote: I tried to import a zpool and the process just hung there, doing nothing. It has been ten minutes now so I tries to hit CTRL-C. That did nothing. This symptom is consistent with a process blocked waiting on disk I/O. Are the disks functional? dcla...@neptune:~$ zpool status pool: rpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c0d0s0ONLINE 0 0 0 errors: No known data errors dcla...@neptune:~$ zpool get all rpool NAME PROPERTY VALUE SOURCE rpool size 74G - rpool used 11.3G - rpool available 62.7G - rpool capacity 15% - rpool altroot- default rpool health ONLINE - rpool guid 3386894308818650832 default rpool version14 default rpool bootfs rpool/ROOT/snv_111 local rpool delegation on default rpool autoreplaceoff default rpool cachefile - default rpool failmode continuelocal rpool listsnapshots off default dcla...@neptune:~$ su - Password: Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 # zpool import pool: foo id: 15989070886807735056 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: foo ONLINE c0d0p0ONLINE If I try this again .. it may just hang again. But here goes. # mkdir /mnt/foo # zpool import -f -R /mnt/foo 15989070886807735056 and then ... nothing happens. Not too sure what is going on here. In another window I do this and see the same thing as before : dcla...@neptune:~$ date;ps -efl | grep root | grep import Sat May 9 20:42:11 GMT 2009 0 S root 1096 1088 0 50 20 df81e378 1327 d8274526 20:40:38 pts/5 0:00 zpool import -f -R /mnt/foo 1598907 I have to look into this a bit and try to figure out why I am seeing this thing foo and why can I not import it. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is zpool import unSIGKILLable ?
Dennis Clarke wrote: I tried to import a zpool and the process just hung there, doing nothing. It has been ten minutes now so I tries to hit CTRL-C. That did nothing. It may be because it is blocked in kernel. Can you do something like this: echo 0tpid of zpool import::pid2proc|::walk thread|::findstack -v dcla...@neptune:~$ date;ps -efl | grep root | grep import Sat May 9 22:54:45 GMT 2009 0 S root 1096 1088 0 50 20 df81e378 1327 d8274526 20:40:38 pts/5 0:00 zpool import -f -R /mnt/foo 1598907 dcla...@neptune:~$ su - Password: Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 # /bin/echo 0t1096::pid2proc|::walk thread|::findstack -v | mdb -k stack pointer for thread e0156100: d575fc54 d575fc94 swtch+0x188() d575fca4 cv_wait+0x53(d8274526, d82744e8, , 0) d575fce4 txg_wait_synced+0x90(d8274380, 65a, 0, 0) d575fd34 spa_config_update_common+0x88(e600fd40, 0, 0, d575fd68) d575fd84 spa_import_common+0x3cf() d575fdb4 spa_import+0x18(ecee4000, dfa040b0, d75b04e0, febd9444) d575fde4 zfs_ioc_pool_import+0xcd(ecee4000, 0, 0) d575fe14 zfsdev_ioctl+0xe0() d575fe44 cdev_ioctl+0x31(2d8, 5a02, 80424a0, 13, daf0b0b0, d575ff00) d575fe74 spec_ioctl+0x6b(d83b1d80, 5a02, 80424a0, 13, daf0b0b0, d575ff00) d575fec4 fop_ioctl+0x49(d83b1d80, 5a02, 80424a0, 13, daf0b0b0, d575ff00) d575ff84 ioctl+0x171() d575ffac sys_sysenter+0x106() # echo ::threadlist | mdb -k # /bin/echo ::threadlist | mdb -k | wc -l 542 # /bin/echo ::threadlist | mdb -k kern.thread.list # wc -l kern.thread.list 541 kern.thread.list Output of the second command may be rather big, so it would be better to post it somewhere. see http://www.blastwave.org/dclarke/blog/files/kern.thread.list I see this line : e0156100 df81e378 ed00fe70 zpool/1 which seems consistent with what ps says : F S UID PID PPID C PRI NI ADDR SZWCHANSTIME TTY TIME CMD 0 S root 1096 1088 0 50 20 df81e378 1327 d8274526 20:40:38 pts/5 0:00 zpool import -f -R /mnt/foo 1598907 which is not telling me much at the moment. I'm game to play, what is next here ? By the way, brace yourself, this is a 32-bit system and even worse than that, take a look at this isalist : dcla...@neptune:~$ isainfo -v 32-bit i386 applications ahf sse2 sse fxsr mmx cmov sep cx8 tsc fpu dcla...@neptune:~$ isalist i486 i386 i86 Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [on-discuss] Reliability at power failure?
And after some 4 days without any CKSUM error, how can yanking the power cord mess boot-stuff? Maybe because on the fifth day some hardware failure occurred? ;-) ha ha ! sorry .. that was pretty funny. -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Honesty after a power failure
0 0 c5t1d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 errors: No known data errors # zpool attach fibre0 c5t1d0 c2t17d0 # zpool add fibre0 spare c2t22d0 # zpool status fibre0 pool: fibre0 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h2m, 2.86% done, 1h18m to go config: NAME STATE READ WRITE CKSUM fibre0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 spares c2t22d0AVAIL errors: No known data errors # I have also learned that you can not trust that silver progress report either. It will not take 1h18m to complete. If I wait 20 minutes I'll get *nearly* the same estimate. The process must not be deterministic in nature. # zpool status pool: fibre0 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h39m, 34.24% done, 1h15m to go config: NAME STATE READ WRITE CKSUM fibre0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 spares c2t22d0AVAIL errors: No known data errors pool: z0 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM z0ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s7 ONLINE 0 0 0 c1t1d0s7 ONLINE 0 0 0 errors: No known data errors # fmadm faulty -afg # I do TOTALLY trust that last line that says No known data errors which makes me wonder if the Severe FAULTs are for unknown data errors :-) -- Dennis Clarke sig du jour : An appeaser is one who feeds a crocodile, hoping it will eat him last., Winston Churchill [1] I really want to know where PowerChute for Solaris went to. [2] I would create a ZPool of striped mirrors based on multiple USB keys and on disks on IDE/SATA with or without compression and with copies={1|2|3} and while running a ON compile I'd pull the USB keys out and yank the power on the IDE/SATA or fibre disks. ZFS would not throw a fatal error nor drop a bit of data. Performance suffered but data did not. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Honesty after a power failure
On Tue, 24 Mar 2009, Dennis Clarke wrote: However, I have repeatedly run into problems when I need to boot after a power failure. I see vdevs being marked as FAULTED regardless if there are actually any hard errors reported by the on disk SMART Firmware. I am able to remove these FAULTed devices temporarily and then re-insert the same disk again and then run fine for months. Until the next long power failure. In spite of huge detail, you failed to describe to us the technology used to communicate with these disks. The interface adaptors, switches, and wiring topology could make a difference. Nothing fancy. Dual QLogic ( Sun ) fibre cards directly connected to the back of A5200's. Simple really. Is there *really* a severe fault in that disk ? # luxadm -v display 2118625d599d This sounds some some sort of fiber channel. Transport protocol: IEEE 1394 (SBP-2) Interesting that it mentions the protocol used by FireWire. I have no idea where that is coming from. If you are using fiber channel, the device names in the pool specification suggest that Solaris multipathing is not being used (I would expect something long like c4t600A0B800039C9B50A9C47B4522Dd0). If multipathing is not used, then you either have simplex connectivity, or two competing simplex paths to each device. Multipathing is recommended if you have redundant paths available. Yes, I have another machine that has mpxio in place. However a power failure also trips phantom faults. If the disk itself is not aware of its severe faults then that suggests that there is a transient problem with communicating with the disk. You would think so eh? But a transient problem that only occurs after a power failure? The problem could be in a device driver, adaptor card, FC switch, or cable. If the disk drive also lost power, perhaps the disk is unusually slow at spinning up. All disks were up at boot, you can see that when I ask for a zpool status at boot time in single user mode. No errors and no faults. The issue seems to be when fmadm starts up or perhaps some other service that can thrown a fault. I'm not sure. It is easy to blame ZFS for problems. It is easy to blame a power failure for problems as well as an nice shiney new APC Smart-UPS XL 3000VA RM 3U unit with external extended run time battery that doesn't signal a power failure. I never blame ZFS for anything. On my system I was experiencing system crashes overnight while running 'zfs scrub' via cron job. The fiber channel card was locking up. Eventually I learned that it was due to a bug in VirtualBox's device driver. If VirtualBox was not left running overnight, then the system would not crash. VirtualBox ? This is a Solaris 10 machine. Nothing fancy. OKay, sorry, nothing way out in the field fancy like VirtualBox. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Honesty after a power failure
On Tue, 24 Mar 2009, Dennis Clarke wrote: You would think so eh? But a transient problem that only occurs after a power failure? Transient problems are most common after a power failure or during initialization. Well the issue here is that power was on for ten minutes before I tried to do a boot from the ok pronpt. Regardless, the point is that the ZPool shows no faults at boot time and then shows phantom faults *after* I go to init 3. That does seem odd. Dennsi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Honesty after a power failure
Hey, Dennis - I can't help but wonder if the failure is a result of zfs itself finding some problems post restart... Yes, yes, this is what I am feeling also, but I need to find the data also and then I can sleep at night. I am certain that ZFS does not just toss out faults on a whim because there must be a deterministic, logical and code based reason for those faults that occur *after* I go to init 3. Is there anything in your FMA logs? Oh God yes, brace yourself :-) http://www.blastwave.org/dclarke/zfs/fmstat.txt [ I edit the whitespace here for clarity ] # fmstat module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz cpumem-diagnosis 0 0 0.0 2.7 0 0 3 0 4.2K 1.1K cpumem-retire 0 0 0.0 0.2 0 0 0 0 0 0 disk-transport 0 0 0.0 45.7 0 0 0 040b 0 eft0 0 0.0 0.7 0 0 0 0 1.2M 0 fabric-xlate 0 0 0.0 0.7 0 0 0 0 0 0 fmd-self-diagnosis 3 0 0.0 0.2 0 0 0 0 0 0 io-retire 0 0 0.0 0.2 0 0 0 0 0 0 snmp-trapgen 2 0 0.0 1.7 0 0 0 032b 0 sysevent-transport 0 0 0.0 75.4 0 0 0 0 0 0 syslog-msgs2 0 0.0 1.4 0 0 0 0 0 0 zfs-diagnosis296 252 2.0 236719.7 98 0 1 2 176b 144b zfs-retire 4 0 0.0 27.4 0 0 0 0 0 0 zfs-diagnosis svc_t=236719.7 ? for a summary and fmdump for a summary of the related errors http://www.blastwave.org/dclarke/zfs/fmdump.txt # fmdump TIME UUID SUNW-MSG-ID Dec 05 21:31:46.1069 aa3bfcfa-3261-cde4-d381-dae8abf296de ZFS-8000-D3 Mar 07 08:46:43.6238 4c8b199b-add1-c3fe-c8d6-9deeff91d9de ZFS-8000-FD Mar 07 19:37:27.9819 b4824ce2-8f42-4392-c7bc-ab2e9d14b3b7 ZFS-8000-FD Mar 07 19:37:29.8712 af726218-f1dc-6447-f581-cc6bb1411aa4 ZFS-8000-FD Mar 07 19:37:30.2302 58c9e01f-8a80-61b0-ffea-ded63a9b076d ZFS-8000-FD Mar 07 19:37:31.6410 3b0bfd9d-fc39-e7c2-c8bd-879cad9e5149 ZFS-8000-FD Mar 10 19:37:08.8289 aa3bfcfa-3261-cde4-d381-dae8abf296de FMD-8000-4M Repaired Mar 23 23:47:36.9701 2b1aa4ae-60e4-c8ef-8eec-d92a18193e7a ZFS-8000-FD Mar 24 01:29:00.1981 3780a2dd-7381-c053-e186-8112b463c2b7 ZFS-8000-FD Mar 24 01:29:02.1649 146dad1d-f195-c2d6-c630-c1adcd58b288 ZFS-8000-FD # fmdump -vu 3780a2dd-7381-c053-e186-8112b463c2b7 TIME UUID SUNW-MSG-ID Mar 24 01:29:00.1981 3780a2dd-7381-c053-e186-8112b463c2b7 ZFS-8000-FD 100% fault.fs.zfs.vdev.io Problem in: zfs://pool=fibre0/vdev=444604062b426970 Affects: zfs://pool=fibre0/vdev=444604062b426970 FRU: - Location: - # fmdump -vu 146dad1d-f195-c2d6-c630-c1adcd58b288 TIME UUID SUNW-MSG-ID Mar 24 01:29:02.1649 146dad1d-f195-c2d6-c630-c1adcd58b288 ZFS-8000-FD 100% fault.fs.zfs.vdev.io Problem in: zfs://pool=fibre0/vdev=23e4d7426f941f52 Affects: zfs://pool=fibre0/vdev=23e4d7426f941f52 FRU: - Location: - will show more and more information about the error. Note that some of it might seem like rubbish. The important bits should be obvious though - things like the SUNW error message is (like ZFS-8000-D3), which can be pumped into sun.com/msg like so : http://www.sun.com/msg/ZFS-8000-FD or see http://www.blastwave.org/dclarke/zfs/ZFS-8000-FD.txt Article for Message ID: ZFS-8000-FD Too many I/O errors on ZFS device Type Fault Severity Major Description The number of I/O errors associated with a ZFS device exceeded acceptable levels. Automated Response The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact The fault tolerance of the pool may be affected. Yep, I agree, that is what I saw. Note also that there should also be something interesting in the /var/adm/messages log to match and 'faulted' devices. You might also find an fmdump -e spooky long list of events : TIME CLASS Mar 23 23:47:28.5586 ereport.fs.zfs.io Mar 23 23:47:28.5594 ereport.fs.zfs.io Mar 23 23:47:28.5588 ereport.fs.zfs.io Mar 23 23:47:28.5592 ereport.fs.zfs.io Mar 23 23:47:28.5593 ereport.fs.zfs.io . . . Mar 23 23:47:28.5622 ereport.fs.zfs.io Mar 23 23:47:28.5560 ereport.fs.zfs.io Mar 23 23:47:28.5658 ereport.fs.zfs.io Mar 23 23:48:41.5957 ereport.fs.zfs.io http://www.blastwave.org/dclarke/zfs/fmdump_e.txt ouch, that is a nasty long list all in a few seconds. and fmdump -eV a very detailed verbose long list with such entries as Mar 23 2009 23:48:41.595757900 ereport.fs.zfs.io nvlist version: 0 class = ereport.fs.zfs.io ena =
[zfs-discuss] Question about zpool create parameter version
version 10. The following versions are supported: VER DESCRIPTION --- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number. For example, version 4 : # zpool destroy fibre00 # zpool create -o autoreplace=on -o version=4 -m legacy \ fibre00 \ mirror c8t2004CFAC0E97d0 c8t202037F859F1d0 \ mirror c8t2004CFB53F97d0 c8t202037F84044d0 \ mirror c8t2004CFA3C3F2d0 c8t2004CF2FCE99d0 \ mirror c8t2004CF9645A8d0 c8t2004CFA3F328d0 \ mirror c8t202037F812EAd0 c8t2004CF96FF00d0 \ mirror c8t2004CFAC489Fd0 c8t2004CF961853d0 Does the keyword current work in some other fashion ? -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool import minor bug in snv_64a
Not sure if this has been reported or not. This is fairly minor but slightly annoying. After fresh install of snv_64a I run zpool import to find this : # zpool import pool: zfs0 id: 13628474126490956011 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: zfs0 ONLINE mirror ONLINE c1t9d0 ONLINE c0t9d0 ONLINE mirror ONLINE c1t10d0 ONLINE c0t10d0 ONLINE mirror ONLINE c1t11d0 ONLINE c0t11d0 ONLINE mirror ONLINE c1t12d0 ONLINE c0t12d0 ONLINE mirror ONLINE c1t13d0 ONLINE c0t13d0 ONLINE mirror ONLINE c1t14d0 ONLINE c0t14d0 ONLINE So I then run a zpool import but I add in the -R option and specify root thus : # zpool import -f -R / 13628474126490956011 One would think that the -R / would not result in any damage but this si the result : # zfs list NAME USED AVAIL REFER MOUNTPOINT zfs0 191G 8.23G 24.5K legacy zfs0/SUNWspro 567M 201M 567M //opt/SUNWspro zfs0/backup190G 8.23G 189G //export/zfs/backup zfs0/backup/qemu 1.09G 934M 1.09G //export/zfs/qemu zfs0/csw 124M 3.88G 124M //opt/csw zfs0/home 239M 7.77G 239M //export/home zfs0/titan24.5K 8.23G 24.5K //export/zfs/titan Note the extra / there that should not be there. Not a simple thing to fix either : # zfs set mountpoint=/opt/SUNWspro zfs0/SUNWspro # zfs list NAME USED AVAIL REFER MOUNTPOINT zfs0 191G 8.23G 24.5K legacy zfs0/SUNWspro 567M 201M 567M //opt/SUNWspro zfs0/backup190G 8.23G 189G //export/zfs/backup zfs0/backup/qemu 1.09G 934M 1.09G //export/zfs/qemu zfs0/csw 124M 3.88G 124M //opt/csw zfs0/home 239M 7.77G 239M //export/home zfs0/titan24.5K 8.23G 24.5K //export/zfs/titan relatively harmless. Looks like altroot should be assumed to be / unless otherwise specified and if it is specified to be / then the altroot can be ignored. I don't know if that is clear but I think you know what I mean : in /usr/src/cmd/zpool/zpool_main.c : static int do_import(nvlist_t *config, const char *newname, const char *mntopts, const char *altroot, int force, int argc, char **argv) if that const char *altroot happens to be nothing more than a forward slash char ( nul terminated ) then I think it should be ignored. What say you ? Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import minor bug in snv_64a
in /usr/src/cmd/zpool/zpool_main.c : at line 680 forwards we can probably check for this scenario : if ( ( altroot != NULL ) ( altroot[0] != '/') ) { (void) fprintf(stderr, gettext(invalid alternate root '%s': must be an absolute path\n), altroot); nvlist_free(nvroot); return (1); } /* some altroot has been specified * * thus altroot[0] and altroot[1] exist */ else if ( ( altroot[0] = '/') ( altroot[1] = '\0') ) { (void) fprintf(stderr, Do not specify / as alternate root.\n); nvlist_free(nvroot); return (1); } not perfect .. but something along those lines. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import minor bug in snv_64a
On Mon, Jun 25, 2007 at 02:34:21AM -0400, Dennis Clarke wrote: note that it was well after 2 AM for me .. half blind asleep that's my excuse .. I'm sticking to it. :-) in /usr/src/cmd/zpool/zpool_main.c : at line 680 forwards we can probably check for this scenario : if ( ( altroot != NULL ) ( altroot[0] != '/') ) { (void) fprintf(stderr, gettext(invalid alternate root '%s': must be an absolute path\n), altroot); nvlist_free(nvroot); return (1); } /* some altroot has been specified * * thus altroot[0] and altroot[1] exist */ else if ( ( altroot[0] = '/') ( altroot[1] = '\0') ) { s/=/==/ yep ... that's what I intended. The above would bork royally. (void) fprintf(stderr, Do not specify / as alternate root.\n); You need gettext() here. why ? nvlist_free(nvroot); return (1); } not perfect .. but something along those lines. even worse .. I was looking in the wrong section of the code or zpool_main.c if I get coffee and wake up .. maybe I can take another kick at that eh? Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import minor bug in snv_64a
You've tripped over a variant of: 6335095 Double-slash on /. pool mount points - Eric oh well .. no points for originality then I guess :-) Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS disables nfs/server on a host
On 4/27/07, Ben Miller [EMAIL PROTECTED] wrote: I just threw in a truss in the SMF script and rebooted the test system and it failed again. The truss output is at http://www.eecis.udel.edu/~bmiller/zfs.truss-Apr27-2007 324:read(7, 0x000CA00C, 5120) = 0 324:llseek(7, 0, SEEK_CUR) Err#29 ESPIPE 324:close(7)= 0 324:waitid(P_PID, 331, 0xFFBFE740, WEXITED|WTRAPPED) = 0 llseek(7, 0, SEEK_CUR) returns Err#29 ESPIPE . so then .. whats that mean ? ERRORS The llseek() function will fail if: ESPIPE The fildes argument is associated with a pipe or FIFO. dunno if that helps Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FYI: X4500 (aka thumper) sale
On 4/23/07, Richard Elling [EMAIL PROTECTED] wrote: FYI, Sun is having a big, 25th Anniversary sale. X4500s are half price -- 24 TBytes for $24k. ZFS runs really well on a X4500. http://www.sun.com/emrkt/25sale/index.jsp?intcmp=tfa5101 I appologize for those not in the US or UK and can't take advantage of the sale. I really don't think that advertisements are the right thing to drop into these maillists. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] *** High Praise for ZFS and NFS services ***
Dear ZFS and OpenSolaris people : I recently upgraded a large NFS server upwards from Solaris 8. This is a production manufacturing facility with football field sized factory floors and 25 tonne steel products. Many on-site engineers on AIX and CATIA as well as Solaris users and Windows and everything you can shake a stick at. Everything in this place must rest in central storage that is bulletproof and fast. The NFS server is like the companies vault for its valuables and its future. After serious consideration looking at a number of products I can say that *nothing* comes close to the value of ZFS and Solaris. Nothing comes close to the speed either. With instant quota control, like turning a dial, we were able to deliver terabytes of storage to all users but only expose the as much or as little as we want. On the fly. That impressed everyone involved. I received this yesterday. :: Verbatim eMail Dennis: I have all the data transferred over and reconfigured an AIX machine for testing tomorrow. I was surprised that nobody came to me and noticed there was 200 Gig of disk space available. Or that performance was much faster. Our spec's showed 5x perfomance increase. I also did a seat of the pants test ... Much faster as well. Windows drive mappings were almost instant upon login. Large CAD models and assembles came up in seconds compared to minutes ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preferred backup mechanism for ZFS?
On 4/18/07, Nicolas Williams [EMAIL PROTECTED] wrote: On Wed, Apr 18, 2007 at 03:47:55PM -0400, Dennis Clarke wrote: Maybe with a definition of what a backup is and then some way to achieve it. As far as I know the only real backup is one that can be tossed into a vault and locked away for seven years. Or any arbitrary amount of time within in reason. Like a decade or a century. But perhaps a backup today will have as much meaning as papertape over time. Can we discuss this with a few objectives ? Like define backup and then describe mechanisms that may achieve one? Or a really big question that I guess I have to ask, do we even care anymore? As far as ZFS is concerned any discussion of how you'll read today's media a decade into the future is completely OT :) probably. Media should have a shelf life of seven years minimum and probably a whole lot longer. The technology ( QIC, 4mm DAT, DLT etc etc ) should be available and around for a long long time. zfs send as backup is probably not generally acceptable: you can't expect to extract a single file out of it (at least not out of an incremental zfs send), but that's certainly done routinely with ufsdump, tar, cpio, ... Also, why not just punt to NDMP? .. lets look at it. http://www.ndmp.org/products/ thats a fair list of companies there. The SDK looks to be alpha stage or maybe beta : Q: What good is the NDMP SDK? The NDMP software developers kit is developed to prototype new NDMP functionality added and provides a functional (although fairly basic) implementation of an NDMP client and NDMP server. The objective of the SDK is to facilitate rapid development of NDMP compliant clients and servers on a variety of platforms. Third parties are welcome to download and make use of the provided source code within your products (subject to copyright notices supplied) or as example/reference code. OKay .. thats a good candidate to look at. dc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] modify zpool_main.c for raw iostat data
WARNING : long and verbose .. sorry. ALL : While doing various performance tests and measurements of IO rates with a zpool I found that the output from zpool iostat poolname was not really ready for plotting by gnuplot. The output from zpool iostat poolname looks like so : # zpool iostat zpool0 15 capacity operationsbandwidth pool used avail read write read write -- - - - - - - zpool0 292K 1.63T 48 91 2.96M 6.10M zpool0 292K 1.63T 0 0 0 0 zpool0 10.0M 1.63T 0373 0 878K zpool0 27.4M 1.63T 0567 0 1.36M zpool0 43.9M 1.63T 0612 0 1.44M zpool0 72.1M 1.63T 0566 0 1.32M zpool0 101M 1.63T 0 1.16K 0 2.80M zpool0 156M 1.63T 0 1.17K 0 2.81M zpool0 183M 1.63T 0 1.13K 0 2.73M zpool0 237M 1.63T 0 1.14K 0 2.74M . . . I note that the headers there are hard coded into zpool_main.c and there appears to be no option to disable them, thus : static void print_iostat_header(iostat_cbdata_t *cb) { (void) printf(%*s capacity operationsbandwidth\n, cb-cb_namewidth, ); (void) printf(%-*s used avail read write read write\n, cb-cb_namewidth, pool); print_iostat_separator(cb); } and here we see : static void print_iostat_separator(iostat_cbdata_t *cb) { int i = 0; for (i = 0; i cb-cb_namewidth; i++) (void) printf(-); (void) printf( - - - - - -\n); } However these headers are only ever printed once while the verbose option -v is NOT in effect. We do see them repeated over and over when the verbose flag is engaged : # zpool iostat -v zpool0 5 capacity operationsbandwidth pool used avail read write read write --- - - - - - - zpool0148K 1.63T 0 21635 2.64M mirror 14.5K 278G 0 3112 449K c2t8d0 - - 0 3 52 449K c3t8d0 - - 0 3 95 449K mirror 47K 278G 0 3 77 449K c2t9d0 - - 0 3 52 449K c3t9d0 - - 0 3 63 449K mirror 47.5K 278G 0 3 69 449K c2t10d0 - - 0 3 52 449K c3t10d0 - - 0 3 42 449K mirror 14K 278G 0 3 97 451K c2t11d0 - - 0 3 74 451K c3t11d0 - - 0 3 52 451K mirror6K 278G 0 3125 451K c2t12d0 - - 0 3 73 451K c3t12d0 - - 0 3 94 451K mirror 19.5K 278G 0 3154 451K c2t13d0 - - 0 3147 451K c3t13d0 - - 0 3 31 451K --- - - - - - - It seems perfectly reasonable to have those headers repeated over and over when we expect verbose reports. The issue that I have is with the summary reports where we get the headers only once. I wanted to take iostat data from zpool iostat poolname and be able to directly pass it through awk and into a datafile for processing into nice graphs. I find that gnuplot does this nicely. However the units of the data may vary from Kilobytes ( K ), Megabytes ( M ) and perhaps a simple digit zero with no suffix at all. I can only presume that we may also see Gigabyes ( G ) and perhaps some day even Terabytes ( T ). I was thinking that we could add in an option for the iostat subcommand for raw output as simple integers. Here we could have a -r option which would work only when the verbose option is NOT present. At least for now. So then we see this : static const char * get_usage(zpool_help_t idx) { switch (idx) { . . . /* * I may have the option syntax wrong here but the intent is that one may * specify the -r OR the -v but not both at the same time. */ case HELP_IOSTAT: return (gettext(\tiostat {[-r]|[-v]} [pool] ... [interval [count]]\n)); . . . } abort(); /* NOTREACHED */ } The iostat_cbdata struct would need a new int element also : typedef struct iostat_cbdata { zpool_list_t *cb_list; /* * The cb_raw int is added here by Dennis Clarke */ int cb_raw; int cb_verbose; int cb_iteration; int cb_namewidth; } iostat_cbdata_t; I don't think that any change to print_vdev_stats is required because the creation of the suffixes seems to occur with print_one_stat : /* * Display a single statistic. */ void print_one_stat(uint64_t value) { char buf[64]; zfs_nicenum(value, buf, sizeof (buf
[zfs-discuss] zpool iostat : This command can be tricky ...
I really need to take a longer look here. /* * zpool iostat [-v] [pool] ... [interval [count]] * * -v Display statistics for individual vdevs * * This command can be tricky because we want to be able to deal with pool . . . I think I may need to deal with a raw option here ? /* * Enter the main iostat loop. */ cb.cb_list = list; cb.cb_verbose = verbose; cb.cb_iteration = 0; cb.cb_namewidth = 0; Hopefully you can see what I am trying to do ( see previous post ) is just get the raw data and I may do a quick hack to look at it. so until I get a clean compile that dumps out the data .. it may be best to ignore me :-) Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: update on zfs boot support
Robert Milkowski wrote: Hello Ivan, Sunday, March 11, 2007, 12:01:28 PM, you wrote: IW Got it, thanks, and a more general question, in a single disk IW root pool scenario, what advantage zfs will provide over ufs w/ IW logging? And when zfs boot integrated in neveda, will live upgrade work with zfs root? Snapshots/clones + live upgrade or standard patching. Additionally no more hassle with separate /opt /var ... Potentially also compression turned on on /var - just to add to Robert's list, here's other advantages ZFS on root has over UFS, even on a single disk: * knowing when your data starts getting corrupted (if your disk starts failing, and what data is being lost) * ditto blocks to take care of filesystem metadata consistency * performance improvements over UFS * ability to add disks to mirror the root filesystem at any time, should they become available * ability to use free space on the root pool, making it available for other uses (by setting a reservation on the root filesystem, you can ensure that / always has sufficient available space) - am I missing any others ? * ability to show off to your geeky friends who will all say neato! Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why number of NFS threads jumps to the max value?
You don't honestly, really, reasonably, expect someone, anyone, to look at the stack well of course he does :-) and I looked at it .. all of it and I can tell exactly what the problem is but I'm not gonna say because its a trick question. so there. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [osol-help] How to recover from rm *?
On Sun, 18 Feb 2007, Calvin Liu wrote: I want to run command rm Dis* in a folder but mis-typed a space in it so it became rm Dis *. Unfortunately I had pressed the return button before I noticed the mistake. So you all know what happened... :( :( :( Ouch! How can I get the files back in this case? You restore them from your backups. I haven't backup them. This is one ( of many ) reasons why ZFS just rocks. A snapshot would have saved you. I don't consider a snapshot to be an actual backup however. I define a backup as something that you can actually restore to bare metal when your entire datacenter has vanished into a blackhole. That means a tape generally. In the Lotus Notes/Domino world there is a very nice feature where you can have soft-deletions. Essentially you can delete a record from a database and then still do a recovery if needed within a given retention time period. Perhaps a soft-deletion feature to ZFS would be nice. It would allow a sysadmin or maybe even a user to delete something and then come back later, check a deletion log and possibly just unrm the file. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] 118855-36 ZFS
/* Warning : soapbox speech ahead */ Something here is broken. As a rule don't trust smpatch. Don't trust the freeware pca either. Either one may or may not include patches that you don't need or they may list patches you do need or seem to need but once you apply them you find your system buggered up in some way. So, in my opinion, patches are like russian roulette. So very carefully apply what you know you *need* based on actually looking in the patch readme files. The recommended pile of patches are 99.9% safe and then outside of that you have to pick and choose. Since I am on a soapbox here, I may as well be in for a pound as well as the penny. I like to install what I call a reference edition of Solaris. An update release like Solaris 10 Update 3 or Solaris 9 Update 8. These releases are generally very well tested and you can install them and run them in a very stable fashion long term. Once you add a single patch to that system you have wandered out of this is shipped on media to somewhere else. -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] impressive
boldly plowing forwards I request a few disks/vdevs to be mirrored all at the same time : bash-3.2# zpool status zfs0 pool: zfs0 state: ONLINE scrub: resilver completed with 0 errors on Thu Feb 1 04:17:58 2007 config: NAME STATE READ WRITE CKSUM zfs0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c1t11d0ONLINE 0 0 0 c1t12d0ONLINE 0 0 0 c1t13d0ONLINE 0 0 0 c1t14d0ONLINE 0 0 0 errors: No known data errors bash-3.2# zpool attach -f zfs0 c1t11d0 c0t11d0 bash-3.2# zpool attach -f zfs0 c1t12d0 c0t12d0 bash-3.2# zpool attach -f zfs0 c1t13d0 c0t13d0 bash-3.2# zpool attach -f zfs0 c1t14d0 c0t14d0 needless to say there is some thrashing going on bash-3.2# zpool status zfs0 pool: zfs0 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 0.00% done, 45h14m to go config: NAME STATE READ WRITE CKSUM zfs0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 c0t14d0 ONLINE 0 0 0 errors: No known data errors bash-3.2# moments later I see : bash-3.2# zpool status zfs0 pool: zfs0 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 1.59% done, 2h19m to go config: NAME STATE READ WRITE CKSUM zfs0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 c0t14d0 ONLINE 0 0 0 errors: No known data errors bash-3.2# bash-3.2# mdb -k Loading modules: [ unix krtld genunix specfs dtrace ufs scsi_vhci sd ip hook neti sctp arp usba nca zfs random audiosup sppp crypto ptm md logindmux cpc wrsmd fcip fctl fcp nfs ] ::memstat Page SummaryPagesMB %Tot Kernel 79986 624 71% Anon16131 126 14% Exec and libs1830142% Page cache533 40% Free (cachelist) 934 71% Free (freelist) 13662 106 12% Total 113076 883 Physical 111514 871 bash-3.2# so in a few hours I will have decent redundency all on snv_55b ... looking very very fine -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] panic with zfs
Hello, We're setting up a new mailserver infrastructure and decided, to run it on zfs. On a E220R with a D1000, I've setup a storage pool with four mirrors: Good morning Ihsan ... I see that you have everything mirrored here, thats excellent. When you pulled a disk, was it a disk that was containing a metadevice or was it a disk in the zpool ? In the case of a metadevice, as you know, the system should have kept running fine. We have probably both done this over and over at various sites to demonstrate SVM to people. If you pulled out a device in the zpool, well now we are in a whole new world and I had heard that there was some *feature* in Solaris now that will protect the ZFS file system integrity by simply causing a system to panic if the last device in some redundant component was compromised. I think you hit a major bug in ZFS personally. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] panic with zfs
Hello Michael, Am 24.1.2007 14:36 Uhr, Michael Schuster schrieb: -- [EMAIL PROTECTED] # zpool status pool: pool0 state: ONLINE scrub: none requested config: [...] Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=3e81600: Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to lack of DiskSuite state Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the total were available, Jan 23 18:51:38 newponit so panic to ensure data integrity. this message shows (and the rest of the stack prove) that your panic happened in SVM. It has NOTHING to do with zfs. So either you pulled the wrong disk, or the disk you pulled also contained SVM volumes (next to ZFS). I noticed that the panic was in SVM and I'm wondering, why the machine was hanging. SVM is only running on the internal disks (c0) and I pulled a disk from the D1000: so the device that was affected had nothing to do with SVM at all. fine ... I have the exact same cconfig here. Internal SVM and then external ZFS on two disk arrays on two controllers. Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd50): Jan 23 17:24:14 newponit SCSI transport failed: reason 'incomplete': retrying command Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd50): Jan 23 17:24:14 newponit disk not responding to selection Jan 23 17:24:18 newponit scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd50): Jan 23 17:24:18 newponit disk not responding to selection This is clearly the disk with ZFS on it: SVM has nothing to do with this disk. A minute later, the troubles started with the internal disks: OKay .. so are we back to looking at ZFS or ZFS and the SVM components or some interaction between these kernel modules. At this point I have to be careful not to fall into a pit of blind ignorance as I grobe for the answer. Perhaps some data would help. Was there a core file in /var/crash/newponit ? Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /[EMAIL PROTECTED],4000/[EMAIL PROTECTED] (glm0): Jan 23 17:25:26 newponit Cmd (0x6a3ed10) dump for Target 0 Lun 0: Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /[EMAIL PROTECTED],4000/[EMAIL PROTECTED] (glm0): Jan 23 17:25:26 newponit cdb=[ 0x28 0x0 0x0 0x78 0x6 0x30 0x0 0x0 0x10 0x0 ] Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /[EMAIL PROTECTED],4000/[EMAIL PROTECTED] (glm0): Jan 23 17:25:26 newponit pkt_flags=0x4000 pkt_statistics=0x60 pkt_state=0x7 Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /[EMAIL PROTECTED],4000/[EMAIL PROTECTED] (glm0): Jan 23 17:25:26 newponit pkt_scbp=0x0 cmd_flags=0x860 Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED] (glm0): Jan 23 17:25:26 newponit Disconnected tagged cmd(s) (1) timeout for Target 0.0 so a pile of scsi noise above there .. one would expect that from a suddenly missing scsi device. Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: Disconnected tagged cmd(s) (1) timeout for Target 0.0 NCR scsi controllers .. what OS revision is this ? Solaris 10 u 3 ? Solaris Nevada snv_55b ? Jan 23 17:25:26 newponit glm: [ID 401478 kern.warning] WARNING: ID[SUNWpd.glm.cmd_timeout.6018] Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED] (glm0): Jan 23 17:25:26 newponit got SCSI bus reset Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available SVM and ZFS disks are on a seperate SCSI bus, so theoretically there should be any impact on the SVM disks when I pull out a ZFS disk. I still feel that you hit a bug in ZFS somewhere. Under no circumstances should a Solaris server panic and crash simply because you pulled out a single disk that was totally mirrored. In fact .. I will reproduce those conditions here and then see what happens for me. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] panic with zfs
Am 24.1.2007 14:59 Uhr, Dennis Clarke schrieb: Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: Disconnected tagged cmd(s) (1) timeout for Target 0.0 NCR scsi controllers .. what OS revision is this ? Solaris 10 u 3 ? Solaris Nevada snv_55b ? [EMAIL PROTECTED] # cat /etc/release Solaris 10 11/06 s10s_u3wos_10 SPARC Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 14 November 2006 [EMAIL PROTECTED] # uname -a SunOS newponit 5.10 Generic_118833-33 sun4u sparc SUNW,Ultra-60 oh dear. that's not Solaris Nevada at all. That is production Solaris 10. SVM and ZFS disks are on a seperate SCSI bus, so theoretically there should be any impact on the SVM disks when I pull out a ZFS disk. I still feel that you hit a bug in ZFS somewhere. Under no circumstances should a Solaris server panic and crash simply because you pulled out a single disk that was totally mirrored. In fact .. I will reproduce those conditions here and then see what happens for me. And Solaris should not hang at all. I agree. We both know this. You just recently patched a blastwave server that was running for over 700 days in production and *this* sort of behavior just does not happen in Solaris. Let me see if I can reproduce your config here : bash-3.2# metastat -p d0 -m /dev/md/rdsk/d10 /dev/md/rdsk/d20 1 d10 1 1 /dev/rdsk/c0t1d0s0 d20 1 1 /dev/rdsk/c0t0d0s0 d1 -m /dev/md/rdsk/d11 1 d11 1 1 /dev/rdsk/c0t1d0s1 d4 -m /dev/md/rdsk/d14 1 d14 1 1 /dev/rdsk/c0t1d0s7 d5 -m /dev/md/rdsk/d15 1 d15 1 1 /dev/rdsk/c0t1d0s5 d21 1 1 /dev/rdsk/c0t0d0s1 d24 1 1 /dev/rdsk/c0t0d0s7 d25 1 1 /dev/rdsk/c0t0d0s5 bash-3.2# metadb flags first blk block count a m p luo16 8192/dev/dsk/c0t0d0s4 ap luo82088192/dev/dsk/c0t0d0s4 ap luo16 8192/dev/dsk/c0t1d0s4 ap luo82088192/dev/dsk/c0t1d0s4 bash-3.2# zpool status -v zfs0 pool: zfs0 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM zfs0ONLINE 0 0 0 c1t9d0ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 errors: No known data errors bash-3.2# I will add in mirrors to that zpool from another array on another controller and then yank a disk. However this machine is on snv_52 at the moment. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
What do you mean by UFS wasn't an option due to number of files? Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle Financials environment well exceeds this limitation. what ? $ uname -a SunOS core 5.10 Generic_118833-17 sun4u sparc SUNW,UltraSPARC-IIi-cEngine $ df -F ufs -t / (/dev/md/dsk/d0): 5367776 blocks 616328 files total: 13145340 blocks 792064 files /export/nfs(/dev/md/dsk/d8): 83981368 blocks 96621651 files total: 404209452 blocks 100534720 files /export/home (/dev/md/dsk/d7): 980894 blocks 260691 files total: 986496 blocks 260736 files $ I think that I am 95,621,651 files over your 1 million limit right there! Should I place a support call and file a bug report ? Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
On Mon, Jan 08, 2007 at 03:47:31PM +0100, Peter Schuller wrote: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() and synchronous writes from the application perspective; it will do *NOTHING* to lessen the correctness guarantee of ZFS itself, including in the case of a power outtage? That is correct. ZFS, with or without the ZIL, will *always* maintain consistent on-disk state and will *always* preserve the ordering of events on-disk. That is, if an application makes two changes to the filesystem, first A, then B, ZFS will *never* show B on-disk without also showing A. So then, this begs the question Why do I want this ZIL animal at all? This makes it more reasonable to actually disable the zil. But still, personally I would like to be able to tell the NFS server to simply not be standards compliant, so that I can keep the correct semantics on the lower layer (ZFS), and disable the behavior at the level where I actually want it disabled (the NFS server). This would be nice, simply to make it easier to do apples-to-apples comparisons with other NFS server implementations that don't honor the correct semantics (Linux, I'm looking at you). is that a glare or a leer or a sneer ? :-) dc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] HOWTO make a mirror after the fact
is not currently part of a mirrored configura- tion, device automatically transforms into a two-way mirror of device and new_device. If device is part of a two-way mirror, attaching new_device creates a three-way mirror, and so on. In either case, new_device begins to resilver immediately. -f Forces use of new_device, even if its appears to be in use. Not all devices can be overridden in this manner. Note that attach has no option for -n which would just show me the damage I am about to do :-( So I am making a best guess here that what I need is something like this : # zpool attach zfs0 c1t9d0 c0t9d0 which would mean that the fist disk in my zpool would be mirrored and nothing else. A weird config to be sure but .. is this what will happen? I ask all this in painful boring detail because I have no way to backup this zpool other than tar to a DLT. The last thing I want to do is destroy my data when I am trying to add redundency. Any thoughts ? -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HOWTO make a mirror after the fact
Note that attach has no option for -n which would just show me the damage I am about to do :-( In general, ZFS does a lot of checking before committing a change to the configuration. We make sure that you don't do things like use disks that are already in use, partitions aren't overlapping, etc. All of the data integrity features in ZFS wouldn't be worth much if we allowed an administrator to unintentionally destroy data. which is why I am beginning to think of ZFS as the last filesystem I will need. But the head space transition is not easy for a guy that thrives on super stable technology. Like Solaris 8 :-) So I am making a best guess here that what I need is something like this : # zpool attach zfs0 c1t9d0 c0t9d0 which would mean that the fist disk in my zpool would be mirrored and nothing else. A weird config to be sure but .. is this what will happen? Yep, that's exactly what will happen. Lather, rinse, repeat for the other disks in the pool, and you should be exactly where you want to be. Okay .. phasars on stun and in I go . I ask all this in painful boring detail because I have no way to backup this zpool other than tar to a DLT. The last thing I want to do is destroy my data when I am trying to add redundency. Any thoughts ? What you figured out is exactly the right thing. If you decide you want to undo it, just use zpool detach. The only reason that I asked is that there is no explicit EXAMPLE in the manpage that says HOW TO UPGRADE FROM STRIPE TO MIRRORED STRIPES or maybe something that says RAID 0+1 or RAID 1+0. Just a bit more info in the ZFS manpages because that is the first place any admin will look. Not an online PDF file somewhere. Often times all I have to see what is going on in my server is s DEC VT220 terminal. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS extra slow?
Another thing to keep an eye out for is disk caching. With ZFS, whenever the NFS server tells us to make sure something is on disk, we actually make sure it's on disk by asking the drive to flush dirty data in its write cache out to the media. Needless to say, this takes a while. With UFS, it isn't aware of the extra level of caching, and happily pretends it's in a world where once the drive ACKs a write, it's on stable storage. If you use format(1M) and take a look at whether or not the drive's write cache is enabled, that should shed some light on this. If it's on, try turning it off and re-run your NFS tests on ZFS vs. UFS. Either way, let us know what you find out. Slightly OT but you just reminded me of why I like disks that have Sun firmware on them. They never have write cache on. At least I have never seen it. Read cache yes but write cache never. At least in the Seagates and Fujitsus Ultra320 SCSI/FCAL disks that have a Sun logo on them. I have no idea what else that Sun firmware does on a SCSI disk but I'd love to know :-) Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re[2]: ZFS in a SAN environment
no no .. its a feature. :-P If it walks like a duck and quacks like a duck then its a duck. a kernel panic that brings down a system is a bug. Plain and simple. I disagree (nit). A hardware fault can also cause a panic. Faults != bugs. ha ha .. yeah. If the sysadm walks over to a machine an pour coffee in it then I guess it will fault all over the place. No appreciation for coffee I guess. however ... when it comes to storage I expect that a disk failure or hot swap will not cause a fault if and only if there still remains some other storage device that holds the bits in a redundant fashion. so .. disks can fail. That should be okay. even memory and processors can fail. within reason. I do agree in principle, though. Panics should be avoided whenever possible. coffee spillage also .. Incidentally, we do track the panic rate and collect panic strings. The last detailed analysis I saw on the data showed that the vast majority were hardware induced. This was a bit of a bummer because we were hoping that the tracking data would lead to identifying software bugs. but it does imply that the software is way better than the hardware eh ? -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re[2]: ZFS in a SAN environment
Anton B. Rang wrote: INFORMATION: If a member of this striped zpool becomes unavailable or develops corruption, Solaris will kernel panic and reboot to protect your data. OK, I'm puzzled. Am I the only one on this list who believes that a kernel panic, instead of EIO, represents a bug? Nope. I'm with you. no no .. its a feature. :-P If it walks like a duck and quacks like a duck then its a duck. a kernel panic that brings down a system is a bug. Plain and simple. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: bare metal ZFS ? How To ?
Excuse me if I'm mistaken, but I think the question is on the lines of how to access and more importantly - Backup zfs pools/filesystems present on a system by just booting from a CD/DVD. I think the answer would be on the lines of (forced?) importing of zfs pools present on the system and then using zfs send /foo | star The OP might be looking at something convenient along the lines of ufsdump. I think there is a need of a zfsdump tool (script?) or even better - zfs integration in star. Maybe Jörg should chip in :-) As a matter of fact you nailed down exactly what I was doing. Except star had a problem with locking an object or some similar error message. I simply tried the following : power up the Sun machine look at the ok prompt type 'boot net -srv' wait for a while until I get a SINGLE USER MODE hash prompt type zfs import thus . . . Requesting System Maintenance Mode SINGLE USER MODE # zpool import pool: zfs0 id: 13628474126490156099 state: ONLINE action: The pool can be imported using its name or numeric identifier. The pool may be active on on another system, but can be imported using the '-f' flag. config: zfs0ONLINE c1t9d0ONLINE c1t10d0 ONLINE c1t11d0 ONLINE c1t12d0 ONLINE c1t13d0 ONLINE c1t14d0 ONLINE then I did this # mkdir /tmp/root/foo # zpool import -f -R /tmp/root/foo 13628474126490156099 Then I could cd to various places in /tmp/root/foo and attempt to run star to do a backup to tape. That didn't go so well as I got an error about not being able to lock an object in memory. Also, you can't get star unless you ftp it in from somewhere or have it on floppy/CDROM etc etc. I reverted to good old tar like so : tar -cvfPE /dev/rmt/0mbn . then that blew up ( after three hours or more ) because I hit the end of the tape and the process died. So the long and short of it is that you can't drop a ZFS filesystem to tape easily with any built in tools in the SXCR these days. There is already an RFE filed on that but I think its low priority. You can recover a zpool easily enough with zpool import but if you ever lose a few disks or some disaster hits then you had better have Veritas NetBackup or similar in place. Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] bare metal ZFS ? How To ?
One of the things that I have taken for granted was that I can *always* boot a Sun server with a CDROM or DVD or jumpstart boot net -srv and get to a prompt. That allows me to fsck filesystems and ufsdump to tape if needed. In fact, I have generally done obscure things like fully install a server and then, with everything working fine, booted with CDROM and dumped the system to tape and I call that my ground zero backup tape. Everything that follows after that is incremental for a while until the next level 0 dump. [1] Well I just did a boot net -sv on my server here and have snv-b52 at the prompt. I was able to fsck the basic UFS filesystems on this machine as it was running snv-b46. I attached a tape drive before booting and am therefore able to ufsdump and verify the contents of the snv-b46 UFS filesystems to tape. This is all a good thing. My problem is that the snv-46 machine also had a zpool and multiple ZFS filesystems. Simply running zfs list here at the prompt gets me nothing of course. Can I create or otherwise recover some XML file from the snv-46 server filesystems in order to gain access to those ZFS filesystems from this noot boot shell? Is there any way to backup those ZFS filesystems while booted from CDROM/DVD or boot net ? Essentially, if I had nothing but bare metal here and a tape drive can I access the zpool that resides on six 36GB disks on controller 2 or am I dead in the water ? -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bare metal ZFS ? How To ?
On 11/23/06, James Dickens [EMAIL PROTECTED] wrote: On 11/23/06, Dennis Clarke [EMAIL PROTECTED] wrote: assume worst case someone walks up to you and drops an array on you. They say its ZFS an' I need that der stuff 'k? all while chewing on a cig. what do you do ? besides run ? same thing.. plug it in... run zpool import and get a list of pool... and import, renaming the pool if necessary... well golly gee .. that works real slick . . . Requesting System Maintenance Mode SINGLE USER MODE # zpool import pool: zfs0 id: 13628474126490156099 state: ONLINE action: The pool can be imported using its name or numeric identifier. The pool may be active on on another system, but can be imported using the '-f' flag. config: zfs0ONLINE c1t9d0ONLINE c1t10d0 ONLINE c1t11d0 ONLINE c1t12d0 ONLINE c1t13d0 ONLINE c1t14d0 ONLINE # besides the grammer error above it all looks perfect. I can search the source code to find the double on on error there and then someone else can file a bug report. Right now I think I'll see if I can import this puppy. and do zpool import -R /test foreignarray; zpool status foreignarray; zfs list foreignarray okay .. that comes next. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] poor NFS/ZFS performance
Have a gander below : Agreed - it sucks - especially for small file use. Here's a 5,000 ft view of the performance while unzipping and extracting a tar archive. First the test is run on a SPARC 280R running Build 51a with dual 900MHz USIII CPUs and 4Gb of RAM: $ cp emacs-21.4a.tar.gz /tmp $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - real 13.092 user2.083 sys 0.183 here is my machine here ( Solaris 8 Ultra 2 200MHz ) # cd /tmp # ptime /export/home/dclarke/star -x -time -z file=/tmp/emacs-21.4a.tar.gz /export/home/dclarke/star: 7457 blocks + 0 bytes (total of 76359680 bytes = 74570.00k). /export/home/dclarke/star: Total time 11.057sec (6744 kBytes/sec) real 11.146 user0.300 sys 1.762 and the same test on the same machine with a local UFS filesystem : # cd /mnt/test # ptime /export/home/dclarke/star -x -time -z file=/tmp/emacs-21.4a.tar.gz /export/home/dclarke/star: 7457 blocks + 0 bytes (total of 76359680 bytes = 74570.00k). /export/home/dclarke/star: Total time 92.378sec (807 kBytes/sec) real 1:32.463 user0.351 sys 3.658 Pretty much what I expect for an old old Solaris 8 box. Then I try using a mounted NFS filesystem shared from ZFS on snv_46 # cat /etc/release Solaris Nevada snv_46 SPARC Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 14 August 2006 # zfs set sharenfs=nosub,nosuid,rw=pluto,root=pluto zfs0/backup # zfs get sharenfs zfs0/backup NAME PROPERTY VALUE SOURCE zfs0/backup sharenfs nosub,nosuid,rw=pluto,root=pluto local # # tip hardwire connected pluto console login: root Password: Nov 22 18:41:50 pluto login: ROOT LOGIN /dev/console Last login: Tue Nov 21 02:07:39 on console Sun Microsystems Inc. SunOS 5.8 Generic Patch February 2004 # cat /etc/release Solaris 8 2/04 s28s_hw4wos_05a SPARC Copyright 2004 Sun Microsystems, Inc. All Rights Reserved. Assembled 08 January 2004 # dfshares mars RESOURCE SERVER ACCESSTRANSPORT mars:/export/zfs/backup mars - - mars:/export/zfs/qemu mars - - # # mkdir /export/nfs # mount -F nfs -o bg,intr,nosuid mars:/export/zfs/backup /export/nfs # # cd /export/nfs/titan # ls -lap total 142780 drwxr-xr-x 3 dclarke other 8 Nov 22 19:08 ./ drwxr-xr-x 9 root sys 12 Nov 15 20:14 ../ -rw-r--r-- 1 phil csw13102 Jul 12 12:32 README.csw -rw-r--r-- 1 dclarke csw 189389 Sep 14 19:33 ae-2.2.0.tar.gz -rw-r--r-- 1 dclarke csw 91965440 Jul 25 12:56 dclarke.tar -rw-r--r-- 1 dclarke csw 20403483 Nov 22 19:07 emacs-21.4a.tar.gz -rw-r--r-- 1 dclarke csw 5468160 Jul 25 12:57 root.tar drwxr-xr-x 5 dclarke csw5 May 24 2006 schily/ # Now that my Solaris 8 box has a mounted ZFS/NFS filesystem I test again # ptime /export/home/dclarke/star -x -time -z file=/tmp/emacs-21.4a.tar.gz /export/home/dclarke/star: 7457 blocks + 0 bytes (total of 76359680 bytes = 74570.00k). /export/home/dclarke/star: Total time 215.958sec (345 kBytes/sec) real 3:36.048 user0.397 sys 5.961 # That is based on the ZFS/NFS mounted filesystem. What if I run the same test on my server locally? On ZFS ? # ptime /root/bin/star -x -time -z file=/tmp/emacs-21.4a.tar.gz /root/bin/star: 7457 blocks + 0 bytes (total of 76359680 bytes = 74570.00k). /root/bin/star: Total time 32.238sec (2313 kBytes/sec) real 32.680 user6.973 sys 9.945 # So gee ... thats all pretty slow but really really slow with ZFS shared out via NFS. wow .. good to know. I *never* would have seen that coming. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAID-10
On Sun, 22 Oct 2006, Stephen Le wrote: Is it possible to construct a RAID-10 array with ZFS? I've read through the ZFS documentation, and it appears that the only way to create a RAID-10 array would be to create two mirrored (RAID-1) emulated volumes in ZFS and combine those to create the outer RAID-0 volume. Am I approaching this in the wrong way? Should I be using SVM to create my RAID-1 volumes and then create a ZFS filesystem from those volumes? No - don't do that. Here is a ZFS version of a RAID 10 config with 4 disks: - from 817.2271.pdf - Creating a Mirrored Storage Pool To create a mirrored pool, use the mirror keyword, followed by any number of storage devices that will comprise the mirror. Multiple mirrors can be specied by repeating the mirror keyword on the command line. The following command creates a pool with two, two-way mirrors: # zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0 The second mirror keyword indicates that a new top-level virtual device is being specied. Data is dynamically striped across both mirrors, with data being replicated between each disk appropriately. We need to keep in mind that the exact same result may be achieved with simple SVM : d1 1 2 /dev/dsk/c1d0s0 /dev/dsk/c3d0s0 -i 512b d2 1 2 /dev/dsk/c2d0s0 /dev/dsk/c4d0s0 -i 512b d3 -m d1 metainit d1 metainit d2 metainit d3 metattach d3 d2 At this point, if and only if all stripe components come from exactly identical geometry disks or slices, you get a stripe of mirrors and not just a mirror of stripes. While ZFS may do a similar thing *I don't know* if there is a published document yet that shows conclusively that ZFS will survive multiple disk failures. However ZFS brings a lot of other great features. Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAID-10
Dennis Clarke wrote: While ZFS may do a similar thing *I don't know* if there is a published document yet that shows conclusively that ZFS will survive multiple disk failures. ?? why not? Perhaps this is just too simple and therefore doesn't get explained well. That is not what I wrote. Once again, for the sake of clarity, I don't know if there is a published document, anywhere, that shows by way of a concise experiment, that ZFS will actually perform RAID 1+0 and survive multiple disk failures gracefully. I do not see why it would not. But there is no conclusive proof that it will. Note that SVM (nee Solstice Disksuite) did not always do RAID-1+0, for many years it would do RAID-0+1. However, the data availability for RAID-1+0 is better than for an equivalent sized RAID-0+1, so it is just as well that ZFS does stripes of mirrors. -- richard My understanding is that SVM will do stripes of mirrors if all of the disk or stripe components have the same geometry. This has been documented, well described and laid out bare for years. One may easily create two identical stripes and then mirror them. Then pull out multiple disks on both sides of the mirror and life goes on. So long as one does not remove identical mirror components on both sides at the same time. Common sense really. Anyways, the point is that SVM does do RAID 1+0 and has for years. ZFS probably does the same thing but it adds in a boatload of new features that leaves SVM lightyears behind. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ?: ZFS and POSIX
Steffen Weiberle wrote: Customer asks whether ZFS is fully POSIX compliant, such as flock? ZFS is not currently fully POSIX compliant. Making ZFS fully POSIX compliant is still planned and we are currently addressing bugs in this area. Interfaces such as flock() should work just fine now. The flock interface is implemented in both the VFS and in ZFS. As far as I know we have no known issues with flock. Most of the POSIX related issues are in edge conditions such as removing directories when the file system is 100% full. Sometimes I think people roll out the old is it POSIX compliant question for the sake of argument. I think that the standards manpage has a LOT to say on the matter *but* it does mention Solaris 10 specifically. No mention of Solaris Nevada or Solaris 11 or ZFS in there. http://www.blastwave.org/man/standards_5.html So, while its nice that the manpage is there, its not so nice that it makes no mention of the OS rev on which we found it. Just a thought. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] no tool to get expected disk usage reports
- Original Message - Subject: no tool to get expected disk usage reports From:Dennis Clarke [EMAIL PROTECTED] Date:Fri, October 13, 2006 14:29 To: zfs-discuss@opensolaris.org given : bash-3.1# uname -a SunOS mars 5.11 snv_46 sun4u sparc SUNW,Ultra-2 bash-3.1# zfs list NAME USED AVAIL REFER MOUNTPOINT zfs0 89.4G 110G 24.5K legacy zfs0/backup 65.8G 6.19G 65.8G /export/zfs/backup zfs0/kayak23.3G 8.69G 23.3G /export/zfs/kayak zfs0/zoner 279M 63.7G 24.5K legacy zfs0/zoner/common 53K 16.0G 24.5K legacy zfs0/zoner/common/postgres 28.5K 4.00G 28.5K /export/zfs/postgres zfs0/zoner/postgres279M 7.73G 279M /export/zfs/zone/postgres bash-3.1# bash-3.1# zfs get all zfs0/kayak NAME PROPERTY VALUE SOURCE zfs0/kayak type filesystem - zfs0/kayak creation Sun Oct 1 23:42 2006 - zfs0/kayak used 23.3G - zfs0/kayak available 8.69G - zfs0/kayak referenced 23.3G - zfs0/kayak compressratio 1.19x - zfs0/kayak mountedyes- zfs0/kayak quota 32Glocal zfs0/kayak reservationnone default zfs0/kayak recordsize 128K default zfs0/kayak mountpoint /export/zfs/kayak local zfs0/kayak sharenfs offdefault zfs0/kayak checksum on default zfs0/kayak compressionon inherited from zfs0 zfs0/kayak atime on default zfs0/kayak deviceson default zfs0/kayak exec on default zfs0/kayak setuid on default zfs0/kayak readonly offdefault zfs0/kayak zoned offdefault zfs0/kayak snapdirhidden default zfs0/kayak aclmodegroupmask default zfs0/kayak aclinherit secure default bash-3.1# pwd /export/zfs/kayak bash-3.1# ls c d e f g bash-3.1# du -sk c 1246404 c bash-3.1# find c -type f -ls | awk 'BEGIN{ ttl=0 }{ ttl+=$7 }END{ print Total size ttl }' Total size 1752184261 Due to compression there is no easy way to get the expected total size of a tree of files and directories. worse, there may be various ways to get a sum total of files in a tree but the results may be wildly different from what du reports thus : bash-3.1# find f -type f -ls | awk 'BEGIN{ ttl=0 }{ ttl+=$7 }END{ print Total size ttl }' Total size 3387278008853146 bash-3.1# du -sk f 22672288 f bash-3.1# Is there a way to modify du or perhaps create a new tool ? Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Excuses; I did indeed overlook the obvious
Yes, before the flames come in I finally realize where I went wrong last night. Mistook the discussion lists as a mere forum [i]and[/i] also assumed that by participating with a new discussion I could automaticly participate in full. I'll keep that in mind for a possible next time but for now I think I'd better keep to the common forums. Sorry for causing any possible inconvenience for people only following this through e-mail. I had no problem with your email thread at all. No worries and I don't any cause for concern. my 0.02 $ -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] system unresponsive after issuing a zpool attach
Who hoo! It looks like the resilver completed sometime over night. The system appears to be running normally, (after one final reboot): [EMAIL PROTECTED]: zpool status pool: storage state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t2d0s4 ONLINE 0 0 0 c1t1d0s4 ONLINE 0 0 0 errors: No known data errors looks nice :-) I took a poke at the zfs bugs on SunSolve again, and found one that is the likely culprit: 6355416 zpool scrubbing consumes all memory, system hung Appears that a fix is in Nevada 36, hopefully it'll be back ported to a patch for 10. whoa whoa ... just one bloody second .. whoa .. That looks like a real nasty bug description there. What are the details on that? Is this particular to a given system or controller config or something liek that or are we talking global to Solaris 10 Update 2 everywhere ?? :-( Bug ID: 6355416 Synopsis: zpool scrubbing consumes all memory, system hung Category: kernel Subcategory: zfs State: 10-Fix Delivered -- in a patch somewhere ? Description: On a 6800 domain with 8G of RAM I created a zpool using a single 18G drive and on that pool created a file system and a zvol. The zvol was filled with data. # zfs list NAME USED AVAIL REFER MOUNTPOINT pool 11.0G 5.58G 9.00K /pool pool/fs 8K 5.58G 8K /pool/fs pool/[EMAIL PROTECTED] 0 - 8K - pool/root 11.0G 5.58G 11.0G - pool/[EMAIL PROTECTED]783K - 11.0G - # I then attached a second 18g drive to the pool and all seemed well. After a few minutes however the system ground to a halt. No response from the keyboard. Aborting the system it failed to dump due to the dump device being to small. On rebooting it did not make it into multi user. Booting milestone=none and then bringing it up by had I could see it hung doing zfs mount -a. Booting milestone=none again I was able to export the pool and then the system would come up into multiuser. Any attempt to import the pool would hang the system , running vmstat showing it consumed all available memory. With the pool exported I reinstalled the system with a larger dump device and then imported the pool. The same hung occurred however this time I got the crash dump. Dumps can be found here: /net/enospc.uk/export/esc/pts-crashdumps/zfs_nomemory Dump 0 is from stock build 72a dump 1 from my workspace and had KMF_AUDIT set. The only change in my workspace is to the isp driver. ::kmausers gives: ::kmausers 365010944 bytes for 44557 allocations with data size 8192: kmem_cache_alloc+0x148 segkmem_xalloc+0x40 segkmem_alloc+0x9c vmem_xalloc+0x554 vmem_alloc+0x214 kmem_slab_create+0x44 kmem_slab_alloc+0x3c kmem_cache_alloc+0x148 kmem_zalloc+0x28 zio_create+0x3c zio_vdev_child_io+0xc4 vdev_mirror_io_start+0x1ac spa_scrub_cb+0xe4 traverse_segment+0x2e8 traverse_more+0x7c 362520576 bytes for 44253 allocations with data size 8192: kmem_cache_alloc+0x148 segkmem_xalloc+0x40 segkmem_alloc+0x9c vmem_xalloc+0x554 vmem_alloc+0x214 kmem_slab_create+0x44 kmem_slab_alloc+0x3c kmem_cache_alloc+0x148 kmem_zalloc+0x28 zio_create+0x3c zio_read+0x54 spa_scrub_io_start+0x88 spa_scrub_cb+0xe4 traverse_segment+0x2e8 traverse_more+0x7c 241177600 bytes for 376840 allocations with data size 640: kmem_cache_alloc+0x88 kmem_zalloc+0x28 zio_create+0x3c zio_vdev_child_io+0xc4 vdev_mirror_io_done+0x254 taskq_thread+0x1a0 209665920 bytes for 327603 allocations with data size 640: kmem_cache_alloc+0x88 kmem_zalloc+0x28 zio_create+0x3c zio_read+0x54 spa_scrub_io_start+0x88 spa_scrub_cb+0xe4 traverse_segment+0x2e8 traverse_more+0x7c I have attached the full output. If I am quick I can detatch the disk and the export the pool before the system grinds to a halt. Then reimporting the pool I can access the data. Attaching the disk again results in the system using all the memory again. Date Modified: 2005-11-25 09:03:07 GMT+00:00 Work Around: Suggested Fix: Evaluation: Fixed by patch: Integrated in Build: snv_36 Duplicate of: Related Change Request(s):6352306 6384439 6385428 Date Modified: 2006-03-23 23:58:15 GMT+00:00 Public Summary: ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] system unresponsive after issuing a zpool attach
Today I attempted to upgrade to S10_U2 and migrate some mirrored UFS SVM partitions to ZFS. I used Live Upgrade to migrate from U1 to U2 and that went without a hitch on my SunBlade 2000. And the initial conversion of one side of the UFS mirrors to a ZFS pool and subsequent data migration went fine. However, when I attempted to attach the second side mirrors as a mirror of the ZFS pool, all hell broke loose. The system more or less became unresponsive after a few minutes. It appeared that ZFS had taken all available memory because I saw tons of errors on the console about failed memory allocations. Any thoughts/suggestions? The data I migrated consisted of about 80GB. Here's the general flow of what I did: 1. break the SVM mirrors metadetach d5 d51 metadetach d6 d61 metadetach d7 d71 2. remove the SVM mirrors metaclear d51 metaclear d61 metaclear d71 3. combine the partitions with format. They were contiguous partitions on s4, s5 s6 of the disk, I just made a single partition on s4 and cleared s5 s6. 4. create the pool zpool create storage cXtXdXs4 5. create three filesystems zfs create storage/app zfs create storage/work zfs create storage/extra 6. migrate the data cd /app; find . -depth -print | cpio -pdmv /storage/app cd /work; find . -depth -print | cpio -pdmv /storage/work cd /extra; find . -depth -print | cpio -pdmv /storage/extra 7. remove the other SVM mirrors umount /app; metaclear d5 d50 umount /work; metaclear d6 d60 umount /extra; metaclear d7 d70 before you went any further here did you issue a metastat command and also did you have any metadb's on that other disk before you nuked those slices ? just asking here I am hoping that you did a metaclear d5 and then metaclear d50 in order to clear out both the one sided mirror as well as its component. I'm just fishing around here .. 8. combine the partitions with format. They were contiguous partitions on s4, s5 s6 of the disk, I just made a single partition on s4 and cleared s5 s6. okay .. I hope that SVM was not looking for them. I guess you would get a nasty stack of errors in that case. 9. attach the partition to the pool as a mirror zpool attach storage cXtXdXs4 cYtYdYs4 So you wanted a mirror ? Like : # zpool status pool: storage state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s4 ONLINE 0 0 0 c0t1d0s4 ONLINE 0 0 0 errors: No known data errors that sort of deal ? Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS needs a viable backup mechanism
0 0 0 mirror ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 errors: No known data errors ... with no way to back it up to tape ? Someone please enlighten me. Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss