[zfs-discuss] Odp: Kernel panic at zpool import
Dnia 7-08-2008 o godz. 13:20 Borys Saulyak napisał(a): Hi, I have problem with Solaris 10. I know that this forum is for OpenSolaris but may be someone will have an idea. My box is crashing on any attempt to import zfs pool. First crash happened on export operation and since then I cannot import pool anymore due to kernel panics. Is there any way of getting it imported or fixed? Removal of zpool.cache did not help. Here are details: SunOS omases11 5.10 Generic_137112-02 i86pc i386 i86pc [EMAIL PROTECTED]:~[8]#zpool import pool: public id: 10521132528798740070 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: public ONLINE c7t60060160CBA21000A5D22553CA91DC11d0 ONLINE pool: private id: 3180576189687249855 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: private ONLINE c7t60060160CBA21000A6D22553CA91DC11d0 ONLINE Try to change uberblock http://www.opensolaris.org/jive/thread.jspa?messageID=217097 this might help. --Lukas Karwacki Wytworne szmatki, luksusowe auta, efekciarskie gadżety. Serwis dla koneserów prawdziwego luksusu. http://klik.wp.pl/?adr=www.LuxClub.plsid=443 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS data recovery
I have a a problem with zpool import after having problems with 2 disks in RAID 5 (hardware raid). There are some bad blocks on that disks. #zpool import .. state: FAULTED status: The pool metadata is corrupted. .. #zdb -l /dev/rdsk/c4t600C0FF009258F4855B59001d0s0 is OK. I managed to find that uberblock is ok, but import fails on reading first dataset. All 3 blocks gots checksum error ( output from mdb ). 0x495ef80::print -a -t mirror_map_t mm_child[0] { 495ef90 vdev_t *mm_child[0].mc_vd = 0x49b20c0 495ef98 uint64_t mm_child[0].mc_offset = 0x2023216000 495efa0 int mm_child[0].mc_error = 0x32 495efa4 short mm_child[0].mc_tried = 0x1 495efa6 short mm_child[0].mc_skipped = 0 } 0x495ef80::print -a -t mirror_map_t mm_child[1] { 495efa8 vdev_t *mm_child[1].mc_vd = 0x49b20c0 495efb0 uint64_t mm_child[1].mc_offset = 0x166234f3a00 495efb8 int mm_child[1].mc_error = 0x32 495efbc short mm_child[1].mc_tried = 0x1 495efbe short mm_child[1].mc_skipped = 0 } 0x495ef80::print -a -t mirror_map_t mm_child[2] { 495efc0 vdev_t *mm_child[2].mc_vd = 0x49b20c0 495efc8 uint64_t mm_child[2].mc_offset = 0x2ba4ac88a00 495efd0 int mm_child[2].mc_error = 0x32 495efd4 short mm_child[2].mc_tried = 0x1 495efd6 short mm_child[2].mc_skipped = 0 } What can I do to get my data back ? Is there a way to import zpool using other uberblock ( from previous txg ) ? --Lukas Karwacki This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Backup/replication system
Hi I'm using ZFS on few X4500 and I need to backup them. The data on source pool keeps changing so the online replication would be the best solution. As I know AVS doesn't support ZFS - there is a problem with mounting backup pool. Other backup systems (disk-to-disk or block-to-block) have the same problem with mounting ZFS pool. I hope I'm wrong ? In case of any problem I want the backup pool to be operational within 1 hour. Do you know any solution ? --Lukas Zagłosuj i zgarnij 10.000 złotych! Wybierz z nami Internetowego SportoWWWca Roku. Oddaj swój głos na najlepszego. - Kliknij: http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fsportowiec2007.htmlsid=166 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backup/replication system
Dnia 10-01-2008 o godz. 16:11 Jim Dunham napisał(a): Łukasz K wrote: Hi I'm using ZFS on few X4500 and I need to backup them. The data on source pool keeps changing so the online replication would be the best solution. As I know AVS doesn't support ZFS - there is a problem with mounting backup pool. This is not true, if replication is configured correctly. Where are you getting information about the aforementioned problem? I red on zfs-discuss. Have you looked at the following? http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ I have seen that. I want to configure x4500 A to replicate data on x4500 B - asynchronous replication - synchronous will block I/O on A. Let's say I have a crash on A, I want to use backup pool B The B pool can be mounted with force option. How much data will I lose and is there guarantee that pool B is consistent. Other backup systems (disk-to-disk or block-to-block) have the same problem with mounting ZFS pool. I hope I'm wrong ? In case of any problem I want the backup pool to be operational within 1 hour. Do you know any solution ? Zagłosuj i zgarnij 10.000 złotych! Wybierz z nami Internetowego SportoWWWca Roku. Oddaj swój głos na najlepszego. - Kliknij: http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fsportowiec2007.htmlsid=166 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backup/replication system
Dnia 10-01-2008 o godz. 17:45 eric kustarz napisał(a): On Jan 10, 2008, at 4:50 AM, Łukasz K wrote: Hi I'm using ZFS on few X4500 and I need to backup them. The data on source pool keeps changing so the online replication would be the best solution. As I know AVS doesn't support ZFS - there is a problem with mounting backup pool. Other backup systems (disk-to-disk or block-to-block) have the same problem with mounting ZFS pool. I hope I'm wrong ? In case of any problem I want the backup pool to be operational within 1 hour. Do you know any solution ? If it doesn't need to be synchronous, then you can use 'zfs send -R'. I need automatic system. Now I'm using zfs send but it takes too much human resources to control it. eric Czy Dygant ma sztuczny biust? Zobacz! http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fdygat.htmlsid=175 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slow file system access on zfs
Dnia 8-11-2007 o godz. 7:58 Walter Faleiro napisał(a): Hi Lukasz, The output of the first sript gives bash-3.00# ./test.sh dtrace: script './test.sh' matched 4 probes CPU ID FUNCTION:NAME 0 42681 :tick-10s 0 42681 :tick-10s 0 42681 :tick-10s 0 42681 :tick-10s 0 42681 :tick-10s 0 42681 :tick-10s 0 42681 :tick-10s and it goes on.It means that you have free blocks :) , or you do not have any I/O writes.run:#zpool iostat 1 and #iostat -zxc 1 The second script gives: checking pool map size [B]: filer mdb: failed to dereference symbol: unknown symbol name 423917216903435 Which Solaris version do you use ?Maybe you should patch kernel.Also you can check if there are problems with zfs sync phase.Run #dtrace -n fbt::txg_wait_open:entry'{ stack(); ustack(); }'and wait 10 minutesalso give more information about pool#zfs get all filerI assume 'filer' is you pool name.RegardsLukasOn 11/7/07, Łukasz K [EMAIL PROTECTED] wrote: Hi,I think your problem is filesystem fragmentation.When available space is less than 40% ZFS might have problems withfinding free blocks. Use this script to check it:#!/usr/sbin/dtrace -s fbt::space_map_alloc:entry{ self-s = arg1;}fbt::space_map_alloc:return/arg1 != -1/{self-s = 0;}fbt::space_map_alloc:return/self-s (arg1 == -1)/ {@s = quantize(self-s);self-s = 0;}tick-10s{printa(@s);}Run script for few minutes.You might also have problems with space map size.This script will show you size of space map on disk: #!/bin/shecho '::spa' | mdb -k | grep ACTIVE \| while read pool_ptr state pool_namedoecho "checking pool map size [B]: $pool_name"echo "${pool_ptr}::walk metaslab|::print -d struct metaslab ms_smo.smo_objsize" \| mdb -k \| nawk '{sub("^0t","",$3);sum+=$3}END{print sum}'doneIn memory space map takes 5 times more.All space map is loaded into memory all the time, but for example during snapshot remove all space map might be loaded, so checkif you have enough RAM available on machine.Check ::kmastat in mdb.Space map uses kmem_alloc_40( on thumpers this is a real problem ) Workaround:1. first you can change pool recordsizezfs set recordsize=64K POOLMaybe you wil have to use 32K or even 16K2. You will have to disable ZIL, becuase ZIL always takes 128kBblocks. 3. Try to disable cache, tune vdev cache. Check:http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_GuideLukas Karwacki Dnia 7-11-2007 o godz. 1:49 Walter Faleiro napisał(a): Hi, We have a zfs file system configured using a Sunfire 280R with a 10T Raidweb array bash-3.00# zpool list NAMESIZEUSED AVAILCAPHEALTH ALTROOT filer 9.44T 6.97T 2.47T73%ONLINE - bash-3.00# zpool status pool: backupstate: ONLINEscrub: none requested config: NAMESTATE READ WRITE CKSUM filerONLINE 0 0 0 c1t2d1ONLINE 0 0 0 c1t2d2ONLINE 0 0 0 c1t2d3ONLINE 0 0 0 c1t2d4ONLINE 0 0 0 c1t2d5ONLINE 0 0 0 the file system is shared via nfs. Off late we have seen that the file system access slows down considerably. Running commands like find, du on the zfs system did slow it down, but the intermittent slowdowns cannot be explained. Is there a way to trace the I/O on the zfs so that we can list out heavy read/writes to the file system to be responsible for the slowness. Thanks, --Walter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discussWojna z terrorem wkracza w decydującą fazę:Robert Redford, Meryl Streep i Tom Cruise w filmie UKRYTA STRATEGIA - w kinach od 9 listopada!http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fstrategia.htmlsid=90 Wojna z terrorem wkracza w decydującą fazę:Robert Redford, Meryl Streep i Tom Cruise w filmieUKRYTA STRATEGIA - w kinach od 9 listopada!http://klik.wp.pl/?adr=http://corto.www.wp.pl/as/strategia.html=90 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Space Map optimalization
Now space maps, intent log, spa history are compressed. All normal metadata (including space maps and spa history) is always compressed. The intent log is never compressed. Can you tell me where space map is compressed ? Buffer is filled up with: 468 *entry++ = SM_OFFSET_ENCODE(start) | 469 SM_TYPE_ENCODE(maptype) | 470 SM_RUN_ENCODE(run_len); and later dmu_write is called. I want to propose few optimalization here: - space map block size schould be dynamin ( 4KB buffer is a bug ) My space map on thumper takes over 3,5 GB / 4kB = 855k blocks - space map should be compressed before dividing: 1. FILL LARGER BLOCK with data 2. compress it 3. divide to blocks and then write - other thing is memory usage, space map is using kmem_alloc_40 for allocating space map in memory. During sync phase after removing snapshot kmem_alloc_40 takes over 13GB RAM and system is swapping. My question is when are you going to optimalize space map ? We are having big problems here with ZFS due to space map and fragmentation. We have to lower recordsize and disable zil. Potrzebujesz samochodu? Mamy dla Ciebie auto tylko za 70 zł dziennie! Oferta specjalna Express Rent a Car - Kliknij: http://klik.wp.pl/?adr=https%3A%2F%2Fwynajemsamochodow.wp.pl%2Fsid=58 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Space Map optimalization
On Sep 14, 2007, at 8:16 AM, Łukasz wrote: I have a huge problem with space maps on thumper. Space maps takes over 3GB and write operations generates massive read operations. Before every spa sync phase zfs reads space maps from disk. I decided to turn on compression for pool ( only for pool, not filesystems ) and it helps. Now space maps, intent log, spa history are compressed. How did you do that? # zfs list NAME USED AVAIL REFER MOUNTPOINT zpool 7.99G 59.0G19K /zpool zpool/data 7.98G 59.0G 6.16G /zpool/data Then: # zfs set compress=off zpool/data # zfs set compress=on zpool If you will noe keep any files in /zpool, then only metadata blocks will be compressed. # zdb -bbb zpool Blocks LSIZE PSIZE ASIZE avgcomp %Total Type 116K 1K 3.00K 3.00K 16.00 0.00 L1 deferred free 3 12.0K 2K 6.00K 2K6.00 0.00 L0 deferred free 4 28.0K 3.00K 9.00K 2.25K9.33 0.00 deferred free - - - - - -- SPA space map header 5 80.0K 6.00K 18.0K 3.60K 13.33 0.00 L1 SPA space map 56 224K158K473K 8.44K1.42 0.01 L0 SPA space map 61 304K164K491K 8.04K1.86 0.01 SPA space map Not I'm thinking about disabling checksums. All metadata are written in 2 copies, so when I have compression=on do I need checksums ? They are separate things. If you want data integrity, then you need to leave checksums enabled. Why not to keep checksum in compressed block after compressed data ? We do not have to use 2 blocks then. Does zfs will try to read second block when zio_decompress_data will return error ? Is there other way to check space map compression ratio ? Now I'm using #zdb -bb pool but it takes hours. #zdb -v pool ... Traversing all blocks to verify checksums and verify nothing leaked ... I don't want to traverse all blocks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Space Map optimalization
I have a huge problem with space maps on thumper. Space maps takes over 3GB and write operations generates massive read operations. Before every spa sync phase zfs reads space maps from disk. I decided to turn on compression for pool ( only for pool, not filesystems ) and it helps. Now space maps, intent log, spa history are compressed. Not I'm thinking about disabling checksums. All metadata are written in 2 copies, so when I have compression=on do I need checksums ? Does zfs will try to read second block when zio_decompress_data will return error ? Is there other way to check space map compression ratio ? Now I'm using #zdb -bb pool but it takes hours. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Odp: zfs destroy takes long time
Dnia 23-08-2007 o godz. 22:15 Igor Brezac napisał(a): We are on Solaris 10 U3 with relatively recent recommended patches applied. zfs destroy of a filesystem takes a very long time; 20GB usage and about 5 million objects takes about 10 minutes to destroy. zfs pool is a 2 drive stripe, nothing too fancy. We do not have any snapshots. Any ideas? Maybe your pool is fragmented and pool space map i very big. Run this script: #!/bin/sh echo '::spa' | mdb -k | grep ACTIVE \ | while read pool_ptr state pool_name do echo checking pool map size [B]: $pool_name echo ${pool_ptr}::walk metaslab|::print -d struct metaslab ms_smo.smo_objsize \ | mdb -k \ | nawk '{sub(^0t,,$3);sum+=$3}END{print sum}' done This will show the size of pool space map on disk ( in bytes ). Then destroying filesystem or snapshot on fragmented pool kernel will have to: 1. read space map ( in memory space map will take 4x more RAM ) 2. do changes 3. write space map ( space map is kept on disks it 2 copies ) I don't know any workaround for this bug. Lukas Poznaj nowego wybrańca Boga... i jego trzódkę! Rewelacyjna komedia Evan Wszechmogący w kinach od 24 sierpnia. http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fevanw.htmlsid=1270 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS+NFS on storedge 6120 (sun t4)
I think you have a problem with pool fragmentation. We have the same problem and changing recordsize will help. You have to set smaller recordsize for pool ( all filesystem must have the same size or smaller size ). First check if you have problems with finding blocks with this dtrace script: #!/usr/sbin/dtrace -s fbt::space_map_alloc:entry { self-s = arg1; } fbt::space_map_alloc:return /arg1 != -1/ { self-s = 0; } fbt::space_map_alloc:return /self-s (arg1 == -1)/ { @s = quantize(self-s); self-s = 0; } tick-10s { printa(@s); } This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Odp: Re[2]: Re: Re[2]: Re: Re: Re: Snapshots impact on performance
Dnia 26-07-2007 o godz. 13:31 Robert Milkowski napisał(a): Hello Victor, Wednesday, June 27, 2007, 1:19:44 PM, you wrote: VL Gino wrote: Same problem here (snv_60). Robert, did you find any solutions? VL Couple of week ago I put together an implementation of space maps which VL completely eliminates loops and recursion from space map alloc VL operation, and allows to implement different allocation strategies quite VL easily (of which I put together 3 more). It looks like it works for me VL on thumper and my notebook with ZFS Root though I have almost no time to VL test it more these days due to year end. I haven't done SPARC build yet VL and I do not have test case to test against. VL Also, it comes at a price - I have to spend some more time (logarithmic, VL though) during all other operations on space maps and is not optimized now. Lukasz (cc) - maybe you can test it and even help on tuning it? Yes, I can test it. I'm building environment to compile opensolaris and test zfs. I will be ready next week. Victor, can you tell me where to look for your changes ? How to change allocation strategy ? I can see that changing space_map_ops_t I can declare diffrent callback functions. Lukas Tylko od nich zależy czy przeżyją tę noc. Jak uciec, gdy oni widzą wszystko? Kate Beckinsale w mrocznym thrillerze MOTEL - kinach od 3 sierpnia! http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fmotel.htmlsid=1236 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots impact on performance
Same problem here (snv_60). Robert, did you find any solutions? gino check this http://www.opensolaris.org/jive/thread.jspa?threadID=34423tstart=0 Check spa_sync function time remember to change POOL_NAME ! dtrace -q -n fbt::spa_sync:entry'/(char *)(((spa_t*)arg0)-spa_name) == POOL_NAME/{ self-t = timestamp; }' -n fbt::spa_sync:return'/self-t/{ @m = max((timestamp - self-t)/100); self-t = 0; }' -n tick-10m'{ printa([EMAIL PROTECTED],@m); exit(0); }' If you have long spa_sync times, try to check if you have problems with finding new blocks in space map with this script: #!/usr/sbin/dtrace -s fbt::space_map_alloc:entry { self-s = arg1; } fbt::space_map_alloc:return /arg1 != -1/ { self-s = 0; } fbt::space_map_alloc:return /self-s (arg1 == -1)/ { @s = quantize(self-s); self-s = 0; } tick-10s { printa(@s); } Then change zfs set recordsize=XX POOL_NAME. Make sure that all filesystem inherits recordsize. #zfs get -r recordsize POOL_NAME Other thing is space map size. check map size echo '::spa' | mdb -k | grep 'f[0-9]*-[0-9]*' \ | while read pool_ptr state pool_name do echo ${pool_ptr}::walk metaslab|::print -d struct metaslab ms_smo.smo_objsize \ | mdb -k \ | nawk '{sub(^0t,,$3);sum+=$3}END{print sum}' done The value you will get is space map size on disk. In memory space map will have about 4 *size_on_disk. Sometimes during snapshot remove kernel will have to load all space maps to memory. For example if space map on disk takes 1GB then: - kernel in spa_sync funtion will read 1GB from disk ( or from cache ) - allocate 4GB for avl trees - do all operations on avl trees - save maps It is good to have enough free memory for this operations. You can reduce space map by coping all filesystems on other pool. I recommend zfs send. regards Lukas This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send needs optimalization
Ł I want to parallize zfs send to make it faster. Ł dmu_sendbackup could allocate buffer, that will be used for buffering output. Ł Few threads can traverse dataset, few threads would be used for async read operations. Ł I think it could speed up zfs send operation 10x. Ł What do you think about it ? You're right that we need to issue more i/os in parallel -- see 6333409 traversal code should be able to issue multiple reads in parallel When do you think it will be available ? However, it may be much more straightforward to just issue prefetches appropriately, rather than attempt to coordinate multiple threads. That said, feel free to experiment. How can I prefetch data ? Traverse dataset in second thread ? Correct me if I'm wrong. Adding simple buffering could speed up sending operation. Now for each packet we are calling [b]vn_rdwr[/b] function. What do you think about smaller dmu_replay_record_t struct. Remove char drr_toname[MAXNAMELEN]; from drr_begin struct and for DRR_BEGIN command add read/write MAXNAMELEN bytes. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS pool fragmentation
I have a huge problem with ZFS pool fragmentation. I started investigating problem about 2 weeks ago http://www.opensolaris.org/jive/thread.jspa?threadID=34423tstart=0 I found workaround for now - changing recordsize - but I want better solution. The best solution would be a defragmentator tool, but I can see that it is not easy. When ZFS pool is fragmented then: 1. spa_sync function is executing very long ( 5 seconds ) 2. spa_sync thread often takes 100% CPU 3. metaslab space map is very big There are some changes hidding the problem like this http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6512391 and I hope there will be available in Solaris 10 update 4 But I suggest that: 1. in sync phase when for the first time we did not found block we need ( for example 128k ), pool schould remember this for some time ( 5 minutes ) and stop asking for this kind of blocks. 2. We should be more careful with unloading space maps. At the end of sync phase space maps for metaslabs without active flag are unloaded. On my fragmented pool spacemap with 800MB space available ( from 2GB ) is unloaded because there was no 128K blocks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance and memory consumption
When tuning recordsize for things like databases, we try to recommend that the customer's recordsize match the I/O size of the database record. On this filesystem I have: - file links and they are rather static - small files ( about 8kB ) that keeps changing - big files ( 1MB - 20 MB ) used as temporary files( create, write, read, unlink ) and operations on theses files is about 50% off all I/O I think I need defragmentator tool. Do you think there will be any ? Now all I can do is copy filesystems from this zpool to another. After this operation new zpool will not be fragmentated. But it takes time and I have several zpool's like this. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance and memory consumption
Field ms_smo.smo_objsize in metaslab struct is size of data on disk. I checked the size of metaslabs in memory: ::walk spa | ::walk metaslab | ::print struct metaslab ms_map.sm_root.avl_numnodes I got 1GB But only some metaslabs are loaded: ::walk spa | ::walk metaslab | ::print struct metaslab ms_map.sm_root.avl_numnodes ! grep 0x | wc -l 231 from 664 metaslabs. And number of metaslabs is changing very fast. Is there a way to keep all metaslabs ion RAM ? Is there any limit ? I encourage other administrators to check free map space size. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance and memory consumption
After few hours with dtrace and source code browsing I found that in my space map there are no 128K blocks left. Try this on your ZFS. dtrace -n fbt::metaslab_group_alloc:return'/arg1 == -1/{} If you will get probes, then you also have the same problem. Allocating from space map works like this: 1. metaslab_group_alloc want to allocate 128K block size 2. for (all metaslabs) { read space map and check 128K block size if no block then remove flag METASLAB_ACTIVE_MASK } 3. unload maps for all metaslabs without METASLAB_ACTIVE_MASK Thats is why spa_sync take so much time. Now the workaround: zfs set recordsize=8K pool Now the spa_sync functions takes 1-2 seconds, processor is idle, only few metaslabs space maps are loaded: 0600103ee500::walk metaslab |::print struct metaslab ms_map.sm_loaded ! grep -c 0x 3 But now I have another question. How 8k blocks will impact on performance ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance and memory consumption
If you want to know which blocks you do not have: dtrace -n fbt::metaslab_group_alloc:entry'{ self-s = arg1; }' -n fbt::metaslab_group_alloc:return'/arg1 != -1/{ self-s = 0 }' -n fbt::metaslab_group_alloc:return'/self-s (arg1 == -1)/{ @s = quantize(self-s); self-s = 0; }' -n tick-10s'{ printa(@s); }' and which blocks you do not have in some metaslabs: dtrace -n fbt::space_map_alloc:entry'{ self-s = arg1; }' -n fbt::space_map_alloc:return'/arg1 != -1/{ self-s = 0 }' -n fbt::space_map_alloc:return'/self-s (arg1 == -1)/{ @s = quantize(self-s); self-s = 0; }' -n tick-10s'{ printa(@s); }' If metaslabs_group_alloc looks like this value - Distribution - count 65536 | 0 131072 |@@ 9065 262144 | 0 then you can set zfs record size to 64k This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS performance and memory consumption
Hello, I'm investigating problem with ZFS over NFS. The problems started about 2 weeks ago, most nfs threads are hanging in txg_wait_open. Sync thread is consuming one processor all the time, Average spa_sync function times from entry to return is 2 minutes. I can't use dtrace to examine problem, because I keep getting: dtrace: processing aborted: Abort due to systemic unresponsiveness Using mdb and examining tx_sync_thread with ::findstack I keep getting this stack: fe8002da1410 _resume_from_idle+0xf8() ] fe8002da1570 avl_walk+0x39() fe8002da15a0 space_map_alloc+0x21() fe8002da1620 metaslab_group_alloc+0x1a2() fe8002da16b0 metaslab_alloc_dva+0xab() fe8002da1700 metaslab_alloc+0x51() fe8002da1720 zio_dva_allocate+0x3f() fe8002da1730 zio_next_stage+0x72() fe8002da1750 zio_checksum_generate+0x5f() fe8002da1760 zio_next_stage+0x72() fe8002da17b0 zio_write_compress+0x136() fe8002da17c0 zio_next_stage+0x72() fe8002da17f0 zio_wait_for_children+0x49() fe8002da1800 zio_wait_children_ready+0x15() fe8002da1810 zio_next_stage_async+0xae() fe8002da1820 zio_nowait+9() fe8002da18b0 arc_write+0xe7() fe8002da19a0 dbuf_sync+0x274() fe8002da1a10 dnode_sync+0x2e3() fe8002da1a60 dmu_objset_sync_dnodes+0x7b() fe8002da1af0 dmu_objset_sync+0x6a() fe8002da1b10 dsl_dataset_sync+0x23() fe8002da1b60 dsl_pool_sync+0x7b() fe8002da1bd0 spa_sync+0x116() I also managed to sum metaslabs space maps: ::walk spa | ::walk metaslab | ::print struct metaslab ms_smo.smo_objsize and I got 1GB. I have a pool 1,3T with 500G avail space. Pool was created about 3 months ago. I'm using solaris 10 u3 Do you think changing system to nevada will help ? I red that there are some changes that can help: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6512391 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6532056 This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS filesystem online backup question
I have to backup many filesystems, which are changing and machines are heavy loaded. The idea is to backup online - this should avoid I/O read operations from disks, data should go from cache. Now I'm using script that does snapshot and zfs send. I want to automate this operation and add new option to zfs send zfs send [-w sec ] [-i snapshot] snapshot for example zfs send -w 10 pool/[EMAIL PROTECTED] zfs send then would: 1. create replicate snapshot if it does not exist 2. send data 3. wait 10 seconds 4. rename snapshot to replicate_previous ( destroy previous if exists ) 5. goto 1. All snapshot operations are done in kernel - it works faster then. I have implemented this mechanism and it works. Do you think this change will be integrated to opensolaris ? Is there chance this option will be available in Solaris update 4 ? Maybe there is other way to backup filesystem online ? I tried to traverse changing filesystem, but it does not work. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: asize is 300MB smaller than lsize - why?
I have other question about replication in this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=27082tstart=0 This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS filesystem online backup question
Out of curiosity, what is the timing difference between a userland script and performing the operations in the kernel? [EMAIL PROTECTED] ~]# time zfs destroy solaris/[EMAIL PROTECTED] ; time zfs rename solaris/[EMAIL PROTECTED] solaris/[EMAIL PROTECTED]; time zfs snapshot solaris/[EMAIL PROTECTED] real0m5.220s user0m0.010s sys 0m0.023s real0m5.856s user0m0.010s sys 0m0.023s real0m7.620s user0m0.009s sys 0m0.029s [EMAIL PROTECTED] ~]# time zfs destroy solaris/[EMAIL PROTECTED] ; time zfs rename solaris/[EMAIL PROTECTED] solaris/[EMAIL PROTECTED]; time zfs snapshot solaris/[EMAIL PROTECTED] real0m7.363s user0m0.010s sys 0m0.031s real0m5.107s user0m0.010s sys 0m0.022s real0m7.888s user0m0.009s sys 0m0.024s Operation takes 15 - 20 seconds In kernel it takes ( time in ms ): 0 42867 dmu_objset_snapshot:return time 2471 1 42867 dmu_objset_snapshot:return time 10803 1 42867 dmu_objset_snapshot:return time 7968 0 42867 dmu_objset_snapshot:return time 14139 0 42867 dmu_objset_snapshot:return time 14405 1 42867 dmu_objset_snapshot:return time 8883 0 42867 dmu_objset_snapshot:return time 4960 Now the code in kernel is without optimalization zfs_unmount_snap(snap_previous, NULL); dmu_objset_destroy(snap_previous); zfs_unmount_snap(zc-zc_value, NULL); dmu_objset_rename(zc-zc_value, snap_previous); error = dmu_objset_snapshot(zc-zc_name, REPLICATE_SNAPSHOT_LATEST, 0); In kernel operation can be optimized and done in one dsl_sync_task_do call. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] crash during snapshot operations
When I'm trying to do in kernel in zfs ioctl: 1. snapshot destroy PREVIOS 2. snapshot rename LATEST-PREVIOUS 3. snapshot create LATEST code is: /* delete previous snapshot */ zfs_unmount_snap(snap_previous, NULL); dmu_objset_destroy(snap_previous); /* rename snapshot */ zfs_unmount_snap(snap_latest, NULL); dmu_objset_rename(snap_latest, snap_previous); /* create snapshot */ dmu_objset_snapshot(zc-zc_name, REPLICATE_SNAPSHOT_LATEST, 0); I get kernel panic. MDB ::status debugging crash dump vmcore.3 (32-bit) from zfs.dev operating system: 5.11 snv_56 (i86pc) panic message: BAD TRAP: type=8 (#df Double fault) rp=fec244f8 addr=d5904ffc dump content: kernel pages only This happens only when the ZFS filesystem is loaded with I/O operations. ( I copy studio11 folder on this filesystem. ) MDB ::stack show nothing, but walking threads I found: stack pointer for thread d8ff9e00: d421b028 d421b04c zio_pop_transform+0x45(d9aba380, d421b090, d421b070, d421b078) d421b094 zio_clear_transform_stack+0x23(d9aba380) d421b200 zio_done+0x12b(d9aba380) d421b21c zio_next_stage+0x66(d9aba380) d421b230 zio_checksum_verify+0x17(d9aba380) d421b24c zio_next_stage+0x66(d9aba380) d421b26c zio_wait_for_children+0x46(d9aba380, 11, d9aba570) d421b280 zio_wait_children_done+0x18(d9aba380) d421b298 zio_next_stage+0x66(d9aba380) d421b2d0 zio_vdev_io_assess+0x11a(d9aba380) d421b2e8 zio_next_stage+0x66(d9aba380) d421b368 vdev_cache_read+0x157(d9aba380) d421b394 vdev_disk_io_start+0x35(d9aba380) d421b3a4 vdev_io_start+0x18(d9aba380) d421b3d0 zio_vdev_io_start+0x142(d9aba380) d421b3e4 zio_next_stage_async+0xac(d9aba380) d421b3f4 zio_nowait+0xe(d9aba380) d421b424 vdev_mirror_io_start+0x151(deab5cc0) d421b450 zio_vdev_io_start+0x14f(deab5cc0) d421b460 zio_next_stage+0x66(deab5cc0) d421b470 zio_ready+0x124(deab5cc0) d421b48c zio_next_stage+0x66(deab5cc0) d421b4ac zio_wait_for_children+0x46(deab5cc0, 1, deab5ea8) d421b4c0 zio_wait_children_ready+0x18(deab5cc0) d421b4d4 zio_next_stage_async+0xac(deab5cc0) d421b4e4 zio_nowait+0xe(deab5cc0) d421b520 arc_read+0x3cc(d8a2cd00, da9f6ac0, d418e840, f9e55e5c, f9e249b0, d515c010) d421b590 dbuf_read_impl+0x11b(d515c010, d8a2cd00, d421b5cc) d421b5bc dbuf_read+0xa5(d515c010, d8a2cd00, 2) d421b5fc dmu_buf_hold+0x7c(d47cb854, 4, 0, 0, 0, 0) d421b654 zap_lockdir+0x38(d47cb854, 4, 0, 0, 1, 1) d421b690 zap_lookup+0x23(d47cb854, 4, 0, d421b6e0, 8, 0) d421b804 dsl_dir_open_spa+0x10a(da9f6ac0, d8fde000, f9e7378f, d421b85c, d421b860) d421b864 dsl_dataset_open_spa+0x2c(0, d8fde000, 1, debe83c0, d421b938) d421b88c dsl_dataset_open+0x19(d8fde000, 1, debe83c0, d421b938) d421b940 dmu_objset_open+0x2e(d8fde000, 5, 1, d421b970) d421b974 dmu_objset_snapshot_one+0x2c(d8fde000, d421b998) d421bdb0 dmu_objset_snapshot+0xaf(d8fde000, d4c6a3e8, 0) d421c9e8 zfs_ioc_replicate_send+0x1ab(d8fde000) d421ce18 zfs_ioc_sendbackup+0x126() d421ce40 zfsdev_ioctl+0x100(2d8, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ce6c cdev_ioctl+0x2e(2d8, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ce94 spec_ioctl+0x65(d6591780, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ced4 fop_ioctl+0x27(d6591780, 5a1e, 8046cac, 13, d5938650, d421cf78) d421cf84 ioctl+0x151() d421cfac sys_sysenter+0x101() $r %cs = 0x0158%eax = 0x %ds = 0x0160%ebx = 0xe58abac0 %ss = 0x0160%ecx = 0x %es = 0x0160%edx = 0x0018 %fs = 0x%esi = 0x %gs = 0x01b0%edi = 0x %eip = 0xfe8ebd71 kmem_free+0x111 %ebp = 0x %esp = 0xfec24530 %eflags = 0x00010246 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status=of,df,IF,tf,sf,ZF,af,PF,cf %uesp = 0xd5905000 %trapno = 0x8 %err = 0x0 I was trying to cause error from command line: [EMAIL PROTECTED] ~]# zfs destroy solaris/[EMAIL PROTECTED] ; zfs rename solaris/[EMAIL PROTECTED] solaris/[EMAIL PROTECTED]; zfs snapshot solaris/[EMAIL PROTECTED] but without success. Any idea ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: asize is 300MB smaller than lsize - why?
How it got that way, I couldn't really say without looking at your code. It works like this: In new ioctl operation zfs_ioc_replicate_send(zfs_cmd_t *zc) we open filesystem ( not snapshot ) dmu_objset_open(zc-zc_name, DMU_OST_ANY, DS_MODE_STANDARD | DS_MODE_READONLY, filesystem); call dmu replicate send function dmu_replicate_send(filesystem, txg, ...); ( txg - is tranzaction group number ) we set max_txg ba.max_txg = (spa_get_dsl(filesystem-os-os_spa))-dp_tx.tx_synced_txg; and call traverse_dsl_dataset traverse_dsl_dataset(filesystem-os-os_dsl_dataset, *txg, ADVANCE_PRE | ADVANCE_HOLES | ADVANCE_DATA | ADVANCE_NOLOCK, replicate_cb, ba); after traversing next txg is returned if (ba.got_data != 0) *txg = ba.max_txg + 1; in replicate_cb we do the same what backup_cb does, but at the beginning we are checking txg: /* remember last txg */ if (bc-bc_blkptr.blk_birth) { if (bc-bc_blkptr.blk_birth ba-max_txg) return; ba-got_data = 1; } After 5 seconds delay we call ioctl with txg returned from last operation. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: crash during snapshot operations
Thanks for advice. I removed my buffers snap_previous and snap_latest and it helped. I'm using zc-value as buffer. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss