[zfs-discuss] Odp: Kernel panic at zpool import

2008-08-11 Thread Łukasz K
Dnia 7-08-2008 o godz. 13:20 Borys Saulyak napisał(a):
 Hi,
 
 I have problem with Solaris 10. I know that this forum is for
 OpenSolaris but may be someone will have an idea.
 My box is crashing on any attempt to import zfs pool. First crash
 happened on export operation and since then I cannot import pool anymore
 due to kernel panics. Is there any way of getting it imported or fixed?
 Removal of zpool.cache did not help.
 
 Here are details:
 SunOS omases11 5.10 Generic_137112-02 i86pc i386 i86pc
 
 [EMAIL PROTECTED]:~[8]#zpool import
 pool: public
 id: 10521132528798740070
 state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:
 
 public ONLINE
 c7t60060160CBA21000A5D22553CA91DC11d0 ONLINE
 
 pool: private
 id: 3180576189687249855
 state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:
 
 private ONLINE
 c7t60060160CBA21000A6D22553CA91DC11d0 ONLINE
 


Try to change uberblock

http://www.opensolaris.org/jive/thread.jspa?messageID=217097

this might help.


--Lukas Karwacki


Wytworne szmatki, luksusowe auta, efekciarskie gadżety.
Serwis dla koneserów prawdziwego luksusu.
http://klik.wp.pl/?adr=www.LuxClub.plsid=443


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS data recovery

2008-03-14 Thread Łukasz
I have a a problem with zpool import after having 
problems with 2 disks in RAID 5 (hardware raid). There are some bad blocks on 
that disks.

#zpool import
..
state: FAULTED
status: The pool metadata is corrupted.
..

#zdb -l /dev/rdsk/c4t600C0FF009258F4855B59001d0s0
   is OK.

I managed to find that uberblock is ok, but import fails on reading first 
dataset.

All 3 blocks gots checksum error ( output from mdb ).

 0x495ef80::print -a -t mirror_map_t mm_child[0]
{
495ef90 vdev_t *mm_child[0].mc_vd = 0x49b20c0
495ef98 uint64_t mm_child[0].mc_offset = 0x2023216000
495efa0 int mm_child[0].mc_error = 0x32
495efa4 short mm_child[0].mc_tried = 0x1
495efa6 short mm_child[0].mc_skipped = 0
}

 0x495ef80::print -a -t mirror_map_t mm_child[1]
{
495efa8 vdev_t *mm_child[1].mc_vd = 0x49b20c0
495efb0 uint64_t mm_child[1].mc_offset = 0x166234f3a00
495efb8 int mm_child[1].mc_error = 0x32
495efbc short mm_child[1].mc_tried = 0x1
495efbe short mm_child[1].mc_skipped = 0
}

 0x495ef80::print -a -t mirror_map_t mm_child[2]
{
495efc0 vdev_t *mm_child[2].mc_vd = 0x49b20c0
495efc8 uint64_t mm_child[2].mc_offset = 0x2ba4ac88a00
495efd0 int mm_child[2].mc_error = 0x32
495efd4 short mm_child[2].mc_tried = 0x1
495efd6 short mm_child[2].mc_skipped = 0
}

What can I do to get my data back ?

Is there a way to import zpool using other uberblock ( from previous txg ) ?

--Lukas Karwacki
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Backup/replication system

2008-01-10 Thread Łukasz K
Hi
I'm using ZFS on few X4500 and I need to backup them.
The data on source pool keeps changing so the online replication
would be the best solution.

As I know AVS doesn't support ZFS - there is a problem with
mounting backup pool.
Other backup systems (disk-to-disk or block-to-block) have the
same problem with mounting ZFS pool.
I hope I'm wrong ?

In case of any problem I want the backup pool to be operational
within 1 hour.

Do you know any solution ?

--Lukas


Zagłosuj i zgarnij 10.000 złotych! 
Wybierz z nami Internetowego SportoWWWca Roku.
Oddaj swój głos na najlepszego. - Kliknij:
http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fsportowiec2007.htmlsid=166


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup/replication system

2008-01-10 Thread Łukasz K
Dnia 10-01-2008 o godz. 16:11 Jim Dunham napisał(a):
 Łukasz K wrote:
 
  Hi
 I'm using ZFS on few X4500 and I need to backup them.
  The data on source pool keeps changing so the online replication
  would be the best solution.
 
 As I know AVS doesn't support ZFS - there is a problem with
  mounting backup pool.
 
 This is not true, if replication is configured correctly.
 Where are you getting information about the aforementioned problem?

I red on zfs-discuss.

 
 Have you looked at the following?
 
 http://blogs.sun.com/avs
 http://www.opensolaris.org/os/project/avs/

I have seen that.

I want to configure x4500 A to replicate data on x4500 B
 - asynchronous replication - synchronous will block I/O on A.
 
Let's say I have a crash on A, I want to use backup pool B
The B pool can be mounted with force option.
How much data will I lose and
is there guarantee that pool B is consistent.

 
 
 Other backup systems (disk-to-disk or block-to-block) have the
  same problem with mounting ZFS pool.
 I hope I'm wrong ?
 
 In case of any problem I want the backup pool to be operational
  within 1 hour.
 
  Do you know any solution ?
 


Zagłosuj i zgarnij 10.000 złotych! 
Wybierz z nami Internetowego SportoWWWca Roku.
Oddaj swój głos na najlepszego. - Kliknij:
http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fsportowiec2007.htmlsid=166


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup/replication system

2008-01-10 Thread Łukasz K
Dnia 10-01-2008 o godz. 17:45 eric kustarz napisał(a):
 On Jan 10, 2008, at 4:50 AM, Łukasz K wrote:
 
  Hi
  I'm using ZFS on few X4500 and I need to backup them.
  The data on source pool keeps changing so the online replication
  would be the best solution.
 
  As I know AVS doesn't support ZFS - there is a problem with
  mounting backup pool.
  Other backup systems (disk-to-disk or block-to-block) have the
  same problem with mounting ZFS pool.
  I hope I'm wrong ?
 
  In case of any problem I want the backup pool to be operational
  within 1 hour.
 
  Do you know any solution ?
 
 If it doesn't need to be synchronous, then you can use 'zfs send -R'.

I need automatic system. Now I'm using zfs send but it
takes too much human resources to control it.

 
 eric


Czy Dygant ma sztuczny biust? 
Zobacz! 
http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fdygat.htmlsid=175


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow file system access on zfs

2007-11-08 Thread Łukasz K


Dnia 8-11-2007 o godz. 7:58 Walter Faleiro napisał(a):
Hi Lukasz,
The output of the first sript gives 
bash-3.00# ./test.sh 
dtrace: script './test.sh' matched 4 probes
CPU
ID
FUNCTION:NAME
 0
42681
:tick-10s 

 0
42681
:tick-10s 

 0
42681
:tick-10s 

 0
42681
:tick-10s 

 0
42681
:tick-10s 

 0
42681
:tick-10s 

 0
42681
:tick-10s 



and it goes on.It means that you have free blocks :) , or you do not have any I/O writes.run:#zpool iostat 1 and #iostat -zxc 1

The second script gives:

checking pool map size [B]: filer
mdb: failed to dereference symbol: unknown symbol name
423917216903435
Which Solaris version do you use ?Maybe you should patch kernel.Also you can check if there are problems with zfs sync phase.Run #dtrace -n fbt::txg_wait_open:entry'{ stack(); ustack(); }'and wait 10 minutesalso give more information about pool#zfs get all filerI assume 'filer' is you pool name.RegardsLukasOn 11/7/07, Łukasz K [EMAIL PROTECTED] wrote:
Hi,I think your problem is filesystem fragmentation.When available space is less than 40% ZFS might have problems withfinding free blocks. Use this script to check it:#!/usr/sbin/dtrace -s
fbt::space_map_alloc:entry{ self-s = arg1;}fbt::space_map_alloc:return/arg1 != -1/{self-s = 0;}fbt::space_map_alloc:return/self-s  (arg1 == -1)/
{@s = quantize(self-s);self-s = 0;}tick-10s{printa(@s);}Run script for few minutes.You might also have problems with space map size.This script will show you size of space map on disk:
#!/bin/shecho '::spa' | mdb -k | grep ACTIVE \| while read pool_ptr state pool_namedoecho "checking pool map size [B]: $pool_name"echo "${pool_ptr}::walk metaslab|::print -d struct metaslab
ms_smo.smo_objsize" \| mdb -k \| nawk '{sub("^0t","",$3);sum+=$3}END{print sum}'doneIn memory space map takes 5 times more.All space map is loaded into memory all the time, but for example
during snapshot remove all space map might be loaded, so checkif you have enough RAM available on machine.Check ::kmastat in mdb.Space map uses kmem_alloc_40( on thumpers this is a real problem )
Workaround:1. first you can change pool recordsizezfs set recordsize=64K POOLMaybe you wil have to use 32K or even 16K2. You will have to disable ZIL, becuase ZIL always takes 128kBblocks.
3. Try to disable cache, tune vdev cache. Check:http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_GuideLukas Karwacki
Dnia 7-11-2007 o godz. 1:49 Walter Faleiro napisał(a): Hi, We have a zfs file system configured using a Sunfire 280R with a 10T Raidweb array bash-3.00# zpool list
NAMESIZEUSED
AVAILCAPHEALTH
ALTROOT
filer
9.44T 6.97T
2.47T73%ONLINE
- bash-3.00# zpool status pool: backupstate: ONLINEscrub: none requested config:
NAMESTATE
READ WRITE CKSUM
filerONLINE
0 0 0
c1t2d1ONLINE
0 0 0
c1t2d2ONLINE
0 0 0
c1t2d3ONLINE
0 0 0
c1t2d4ONLINE
0 0 0
c1t2d5ONLINE
0 0 0 the file system is shared via nfs. Off late we have seen that the file system access slows down considerably. Running commands like find, du on the zfs system did slow it down, but the intermittent slowdowns
 cannot be explained. Is there a way to trace the I/O on the zfs so that we can list out heavy read/writes to the file system to be responsible for the slowness. Thanks,
 --Walter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discussWojna z terrorem wkracza w decydującą fazę:Robert Redford, Meryl Streep i Tom Cruise w filmie
UKRYTA STRATEGIA - w kinach od 9 listopada!http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fstrategia.htmlsid=90

Wojna z terrorem wkracza w decydującą fazę:Robert Redford, Meryl Streep i Tom Cruise w filmieUKRYTA STRATEGIA - w kinach od 9 listopada!http://klik.wp.pl/?adr=http://corto.www.wp.pl/as/strategia.html=90


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Space Map optimalization

2007-10-11 Thread Łukasz K
  Now space maps, intent log, spa history are compressed.
 
 All normal metadata (including space maps and spa history) is always
 compressed.  The intent log is never compressed.

Can you tell me where space map is compressed ?

Buffer is filled up with:
  468   *entry++ = SM_OFFSET_ENCODE(start) |
  469   SM_TYPE_ENCODE(maptype) |
  470   SM_RUN_ENCODE(run_len);
and later dmu_write is called.

I want to propose few optimalization here:
 - space map block size schould be dynamin ( 4KB buffer is a bug )
   My space map on thumper takes over 3,5 GB / 4kB = 855k blocks

 - space map should be compressed before dividing:
  1. FILL LARGER BLOCK with data
  2. compress it
  3. divide to blocks and then write

 - other thing is memory usage, space map is using kmem_alloc_40
   for allocating space map in memory. During sync phase after
   removing snapshot kmem_alloc_40 takes over 13GB RAM and system
   is swapping.

My question is when are you going to optimalize space map ?
We are having big problems here with ZFS due to space map and
fragmentation. We have to lower recordsize and disable zil.


Potrzebujesz samochodu? Mamy dla Ciebie auto tylko za 70 zł dziennie!
Oferta specjalna Express Rent a Car - Kliknij:
http://klik.wp.pl/?adr=https%3A%2F%2Fwynajemsamochodow.wp.pl%2Fsid=58


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Space Map optimalization

2007-09-24 Thread Łukasz
 
 On Sep 14, 2007, at 8:16 AM, Łukasz wrote:
 
  I have a huge problem with space maps on thumper.
 Space maps takes  
  over 3GB
  and write operations generates massive read
 operations.
  Before every spa sync phase zfs reads space maps
 from disk.
 
  I decided to turn on compression for pool ( only
 for pool, not  
  filesystems ) and it helps.
  Now space maps, intent log, spa history are
 compressed.
 
 How did you do that?

# zfs list
NAME USED  AVAIL  REFER  MOUNTPOINT
zpool   7.99G  59.0G19K  /zpool
zpool/data   7.98G  59.0G  6.16G  /zpool/data

Then:
# zfs set compress=off zpool/data
# zfs set compress=on zpool

If you will noe keep any files in /zpool,
then only metadata blocks will be compressed.

# zdb -bbb zpool
Blocks  LSIZE   PSIZE   ASIZE avgcomp   %Total  Type
 116K  1K   3.00K   3.00K   16.00 0.00  L1 deferred free
 3  12.0K  2K   6.00K  2K6.00 0.00  L0 deferred free
 4  28.0K   3.00K   9.00K   2.25K9.33 0.00  deferred free
 -  -   -   -   -   --  SPA space map header
 5  80.0K   6.00K   18.0K   3.60K   13.33 0.00  L1 SPA space map
56   224K158K473K   8.44K1.42 0.01  L0 SPA space map
61   304K164K491K   8.04K1.86 0.01  SPA space map

 
 
  Not I'm thinking about disabling checksums. All
 metadata are  
  written in 2 copies,
  so when I have compression=on do I need checksums ?
 
 They are separate things.  If you want data
 integrity, then you need  
 to leave checksums enabled.

Why not to keep checksum in compressed block after compressed data ?
We do not have to use 2 blocks then.

 
  Does zfs will try to read second block when
 zio_decompress_data  
  will return error ?
 
  Is there other way to check space map compression
 ratio ?
  Now I'm using #zdb -bb pool but it takes hours.
 
 #zdb -v pool
 ...
 Traversing all blocks to verify checksums and verify
 nothing leaked ...

I don't  want to traverse all blocks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Space Map optimalization

2007-09-14 Thread Łukasz
I have a huge problem with space maps on thumper. Space maps takes over 3GB
and write operations generates massive read operations. 
Before every spa sync phase zfs reads space maps from disk.

I decided to turn on compression for pool ( only for pool, not filesystems ) 
and it helps.
Now space maps, intent log, spa history are compressed.

Not I'm thinking about disabling checksums. All metadata are written in 2 
copies,
so when I have compression=on do I need checksums ? 
Does zfs will try to read second block when zio_decompress_data will return 
error ?

Is there other way to check space map compression ratio ?
Now I'm using #zdb -bb pool but it takes hours.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Odp: zfs destroy takes long time

2007-08-24 Thread Łukasz K
Dnia 23-08-2007 o godz. 22:15 Igor Brezac napisał(a):
 We are on Solaris 10 U3 with relatively recent recommended patches
 applied.  zfs destroy of a filesystem takes a very long time; 20GB usage
 and about 5 million objects takes about 10 minutes to destroy.  zfs pool
 is a 2 drive stripe, nothing too fancy.  We do not have any snapshots.
 
 Any ideas?

Maybe your pool is fragmented and pool space map i very big.

Run this script:

#!/bin/sh


echo '::spa' | mdb -k | grep ACTIVE \
  | while read pool_ptr state pool_name
do
  echo checking pool map size [B]: $pool_name

  echo ${pool_ptr}::walk metaslab|::print -d struct metaslab 
ms_smo.smo_objsize \
| mdb -k \
| nawk '{sub(^0t,,$3);sum+=$3}END{print sum}'
done

This will show the size of pool space map on disk ( in bytes ).
Then destroying filesystem or snapshot on fragmented pool kernel
will have to:
1. read space map ( in memory space map will take
4x more RAM )
2. do changes
3. write space map ( space map is kept on disks it 2 copies )

I don't know any workaround for this bug.


Lukas


Poznaj nowego wybrańca Boga... i jego trzódkę! 
Rewelacyjna komedia Evan Wszechmogący w kinach od 24 sierpnia.
http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fevanw.htmlsid=1270


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS+NFS on storedge 6120 (sun t4)

2007-08-06 Thread Łukasz
I think you have a problem with pool fragmentation. We have the same problem 
and changing
recordsize will help. You have to set smaller recordsize for pool ( all 
filesystem must have the same size or smaller size ). First check if you have 
problems with finding blocks with this dtrace script:

#!/usr/sbin/dtrace -s


fbt::space_map_alloc:entry
{
   self-s = arg1;
}

fbt::space_map_alloc:return
/arg1 != -1/
{
  self-s = 0;
}

fbt::space_map_alloc:return
/self-s  (arg1 == -1)/
{
  @s = quantize(self-s);
  self-s = 0;
}

tick-10s
{
  printa(@s);
}
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Odp: Re[2]: Re: Re[2]: Re: Re: Re: Snapshots impact on performance

2007-07-27 Thread Łukasz K
Dnia 26-07-2007 o godz. 13:31 Robert Milkowski napisał(a):
 Hello Victor,
 
 Wednesday, June 27, 2007, 1:19:44 PM, you wrote:
 
 VL Gino wrote:
  Same problem here (snv_60).
  Robert, did you find any solutions?
 
 VL Couple of week ago I put together an implementation of space maps
 which
 VL completely eliminates loops and recursion from space map alloc
 VL operation, and allows to implement different allocation strategies
 quite
 VL easily (of which I put together 3 more). It looks like it works for me
 VL on thumper and my notebook with ZFS Root though I have almost no
 time to
 VL test it more these days due to year end. I haven't done SPARC build
 yet
 VL and I do not have test case to test against.
 
 VL Also, it comes at a price - I have to spend some more time
 (logarithmic,
 VL though) during all other operations on space maps and is not
 optimized now.
 
 Lukasz (cc) - maybe you can test it and even help on tuning it?
 
Yes, I can test it. I'm building environment to compile opensolaris
and test zfs. I will be ready next week.

Victor, can you tell me where to look for your changes ?
How to change allocation strategy ?
I can see that changing space_map_ops_t
I can declare diffrent callback functions.

Lukas


Tylko od nich zależy czy przeżyją tę noc. Jak uciec, gdy 
oni widzą wszystko? Kate Beckinsale w mrocznym thrillerze
MOTEL - kinach od 3 sierpnia!
http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fmotel.htmlsid=1236


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshots impact on performance

2007-07-27 Thread Łukasz
 Same problem here (snv_60).
 Robert, did you find any solutions?
 
 gino

check this http://www.opensolaris.org/jive/thread.jspa?threadID=34423tstart=0

Check spa_sync function time
remember to change POOL_NAME !

dtrace -q -n fbt::spa_sync:entry'/(char *)(((spa_t*)arg0)-spa_name) == 
POOL_NAME/{ self-t = timestamp; }' -n fbt::spa_sync:return'/self-t/{ @m = 
max((timestamp - self-t)/100); self-t = 0; }' -n tick-10m'{ 
printa([EMAIL PROTECTED],@m); exit(0); }'

If you have long spa_sync times, try to check if you have 
problems with finding new blocks in space map with this script:

#!/usr/sbin/dtrace -s

fbt::space_map_alloc:entry
{
   self-s = arg1;
}

fbt::space_map_alloc:return
/arg1 != -1/
{
  self-s = 0;
}

fbt::space_map_alloc:return
/self-s  (arg1 == -1)/
{
  @s = quantize(self-s);
  self-s = 0;
}

tick-10s
{
  printa(@s);
}

Then change zfs set recordsize=XX POOL_NAME. Make sure that all filesystem 
inherits
recordsize. 
  #zfs get -r recordsize POOL_NAME

Other thing is space map size. 

check map size

echo '::spa' | mdb -k | grep 'f[0-9]*-[0-9]*' \
  | while read pool_ptr state pool_name
do
  echo ${pool_ptr}::walk metaslab|::print -d struct metaslab 
ms_smo.smo_objsize \
| mdb -k \
| nawk '{sub(^0t,,$3);sum+=$3}END{print sum}' 
done

The value you will get is space map size on disk. In memory space map will have 
about 4 *size_on_disk. Sometimes during snapshot remove kernel will have to load
all space maps to memory. For example
if space map on disk takes 1GB then:
 - kernel in spa_sync funtion will read 1GB from disk ( or from cache )
 - allocate 4GB for avl trees
 - do all operations on avl trees
 - save maps

It is good to have enough free memory for this operations.

You can reduce space map by coping all filesystems on other pool. I recommend 
zfs send.

regards

Lukas
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS send needs optimalization

2007-07-24 Thread Łukasz
  Ł I want to parallize zfs send to make it faster. 
  Ł dmu_sendbackup could allocate buffer, that will
 be used for buffering output.
  Ł Few threads can traverse dataset, few threads
 would be used for async read operations.
  
  Ł I think it could speed up zfs send operation
 10x.
  
  Ł What do you think about it ?
 
 You're right that we need to issue more i/os in
 parallel -- see 6333409 
 traversal code should be able to issue multiple
 reads in parallel

When do you think it will be available ?

 However, it may be much more straightforward to just
 issue prefetches 
 appropriately, rather than attempt to coordinate
 multiple threads.  That 
 said, feel free to experiment.

How can I prefetch data ? Traverse dataset in second thread ?

Correct me if I'm wrong. 
Adding simple buffering could speed up sending operation. Now for each packet
we are calling [b]vn_rdwr[/b] function.

What do you think about smaller dmu_replay_record_t struct.
Remove
  char drr_toname[MAXNAMELEN];
from drr_begin struct and for DRR_BEGIN command add read/write MAXNAMELEN bytes.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS pool fragmentation

2007-07-10 Thread Łukasz
I have a huge problem with ZFS pool fragmentation. 
I started investigating problem about 2 weeks ago 
http://www.opensolaris.org/jive/thread.jspa?threadID=34423tstart=0

I found workaround for now - changing recordsize -  but I want better solution. 
The best solution would be a defragmentator tool, but I can see that it is not 
easy.

When ZFS pool is fragmented then:
 1. spa_sync function is executing very long (  5 seconds )
 2. spa_sync thread often takes 100% CPU
 3. metaslab space map is very big

There are some changes hidding the problem like this
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6512391
and I hope there will be available in Solaris 10 update 4

But I suggest that:
1. in sync phase when for the first time we did not found block we need 
( for example 128k ), pool schould remember this for some time ( 5 minutes ) 
and stop asking for this kind of blocks. 

2. We should be more careful with unloading space maps.
At the end of sync phase space maps for metaslabs without active flag are 
unloaded.
On my fragmented pool spacemap with 800MB space available ( from 2GB ) 
is unloaded because there was no 128K blocks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance and memory consumption

2007-07-07 Thread Łukasz
 When tuning recordsize for things like databases, we
 try to recommend
 that the customer's recordsize match the I/O size of
 the database
 record.

On this filesystem I have:
 - file links and they are rather static
 - small files ( about 8kB ) that keeps changing
 - big files ( 1MB - 20 MB ) used as temporary files( create, write, read, 
unlink ) 
 and operations on theses files is about 50% off all I/O

I think I need defragmentator tool. Do you think there will be any ?

Now all I can do is copy filesystems from this zpool to another. 
After this operation new zpool will not be fragmentated. 
But it takes time and I have several zpool's like this.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance and memory consumption

2007-07-06 Thread Łukasz
Field ms_smo.smo_objsize in metaslab struct is size of data on disk. 
I checked the size of metaslabs in memory:
::walk spa | ::walk metaslab | ::print struct metaslab 
ms_map.sm_root.avl_numnodes
I got 1GB

But only some metaslabs are loaded:
::walk spa | ::walk metaslab | ::print struct metaslab 
ms_map.sm_root.avl_numnodes ! grep 0x | wc -l
 231
  from 664 metaslabs. And number of metaslabs is changing very fast.

Is there a way to keep all metaslabs ion RAM ? Is there any limit ?
I encourage other administrators to check free map space size.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance and memory consumption

2007-07-06 Thread Łukasz
After few hours with dtrace and source code browsing I found that in my space 
map there are no 128K blocks left. 
Try this on your ZFS. 
  dtrace -n fbt::metaslab_group_alloc:return'/arg1 == -1/{}

If you will get probes, then you also have the same problem.
Allocating from space map works like this:
1. metaslab_group_alloc want to allocate 128K block size
2. for (all metaslabs) {
   read space map and check 128K block size
   if no block then remove flag METASLAB_ACTIVE_MASK
}
3. unload maps for all metaslabs without METASLAB_ACTIVE_MASK

Thats is why spa_sync take so much time.

Now the workaround:
 zfs set recordsize=8K pool

Now the spa_sync functions takes 1-2 seconds, processor is idle, 
only few metaslabs space maps are loaded:
 0600103ee500::walk metaslab |::print struct metaslab ms_map.sm_loaded ! 
 grep -c 0x
3

But now I have another question.
How 8k blocks will impact on performance ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance and memory consumption

2007-07-06 Thread Łukasz
If you want to know which blocks you do not have:
  dtrace -n fbt::metaslab_group_alloc:entry'{ self-s = arg1; }' -n 
fbt::metaslab_group_alloc:return'/arg1 != -1/{ self-s = 0 }' -n 
fbt::metaslab_group_alloc:return'/self-s  (arg1 == -1)/{ @s = 
quantize(self-s); self-s = 0; }' -n tick-10s'{ printa(@s); }'

and which blocks you do not have in some metaslabs:
 dtrace -n fbt::space_map_alloc:entry'{ self-s = arg1; }' -n 
fbt::space_map_alloc:return'/arg1 != -1/{ self-s = 0 }' -n 
fbt::space_map_alloc:return'/self-s  (arg1 == -1)/{ @s = quantize(self-s); 
self-s = 0; }' -n tick-10s'{ printa(@s); }'

If metaslabs_group_alloc looks like this
  value  - Distribution - count
   65536 |  0
  131072 |@@  9065
  262144 |  0

then you can set zfs record size to 64k
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS performance and memory consumption

2007-07-05 Thread Łukasz
Hello,
I'm investigating problem with ZFS over NFS. 
The problems started about 2 weeks ago, most nfs threads are hanging in 
txg_wait_open.
Sync thread is consuming one processor all the time,
Average spa_sync function times from entry to return is 2 minutes.
I can't use dtrace to examine problem, because I keep getting:
 dtrace: processing aborted: Abort due to systemic unresponsiveness

Using mdb and examining tx_sync_thread with ::findstack I keep getting this 
stack:
 fe8002da1410 _resume_from_idle+0xf8() ]
  fe8002da1570 avl_walk+0x39()
  fe8002da15a0 space_map_alloc+0x21()
  fe8002da1620 metaslab_group_alloc+0x1a2()
  fe8002da16b0 metaslab_alloc_dva+0xab()
  fe8002da1700 metaslab_alloc+0x51()
  fe8002da1720 zio_dva_allocate+0x3f()
  fe8002da1730 zio_next_stage+0x72()
  fe8002da1750 zio_checksum_generate+0x5f()
  fe8002da1760 zio_next_stage+0x72()
  fe8002da17b0 zio_write_compress+0x136()
  fe8002da17c0 zio_next_stage+0x72()
  fe8002da17f0 zio_wait_for_children+0x49()
  fe8002da1800 zio_wait_children_ready+0x15()
  fe8002da1810 zio_next_stage_async+0xae()
  fe8002da1820 zio_nowait+9()
  fe8002da18b0 arc_write+0xe7()
  fe8002da19a0 dbuf_sync+0x274()
  fe8002da1a10 dnode_sync+0x2e3()
  fe8002da1a60 dmu_objset_sync_dnodes+0x7b()
  fe8002da1af0 dmu_objset_sync+0x6a()
  fe8002da1b10 dsl_dataset_sync+0x23()
  fe8002da1b60 dsl_pool_sync+0x7b()
  fe8002da1bd0 spa_sync+0x116()

I also managed to sum metaslabs space maps:
  ::walk spa | ::walk metaslab | ::print struct metaslab ms_smo.smo_objsize 
and I got 1GB.

I have a pool 1,3T with 500G avail space. 
Pool was created about 3 months ago.
I'm using solaris 10 u3

Do you think changing system to nevada will help ?
I red that there are some changes that can help:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6512391
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6532056
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS filesystem online backup question

2007-03-27 Thread Łukasz
I have to backup many filesystems, which are changing and machines are heavy 
loaded.
The idea is to backup online - this should avoid I/O read operations from 
disks, 
data should go from cache.

Now I'm using script that does snapshot and zfs send.
I want to automate this operation and add new option to zfs send 

 zfs send [-w sec ] [-i snapshot] snapshot

for example

zfs send -w 10 pool/[EMAIL PROTECTED]

zfs send then would:
1. create replicate snapshot if it does not exist
2. send data
3. wait 10 seconds
4. rename snapshot to replicate_previous ( destroy previous if exists )
5. goto 1.

All snapshot operations are done in kernel - it works faster then.
I have implemented this mechanism and it works.

Do you think this change will be integrated to opensolaris ?
Is there chance this option will be available in Solaris update 4 ?

Maybe there is other way to backup filesystem online ?
I tried to traverse changing filesystem, but it does not work.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: asize is 300MB smaller than lsize - why?

2007-03-27 Thread Łukasz
I have other question about replication in this thread:

http://www.opensolaris.org/jive/thread.jspa?threadID=27082tstart=0
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS filesystem online backup question

2007-03-27 Thread Łukasz
Out of curiosity, what is the timing difference between a userland script
and performing the operations in the kernel?

[EMAIL PROTECTED] ~]# time zfs destroy solaris/[EMAIL PROTECTED] ; time zfs 
rename solaris/[EMAIL PROTECTED] solaris/[EMAIL PROTECTED]; time zfs snapshot 
solaris/[EMAIL PROTECTED]

real0m5.220s
user0m0.010s
sys 0m0.023s

real0m5.856s
user0m0.010s
sys 0m0.023s

real0m7.620s
user0m0.009s
sys 0m0.029s
[EMAIL PROTECTED] ~]# time zfs destroy solaris/[EMAIL PROTECTED] ; time zfs 
rename solaris/[EMAIL PROTECTED] solaris/[EMAIL PROTECTED]; time zfs snapshot 
solaris/[EMAIL PROTECTED]

real0m7.363s
user0m0.010s
sys 0m0.031s

real0m5.107s
user0m0.010s
sys 0m0.022s

real0m7.888s
user0m0.009s
sys 0m0.024s

Operation takes 15 - 20 seconds

In kernel it takes ( time in ms ):
  0  42867   dmu_objset_snapshot:return time 2471
  1  42867   dmu_objset_snapshot:return time 10803
  1  42867   dmu_objset_snapshot:return time 7968
  0  42867   dmu_objset_snapshot:return time 14139
  0  42867   dmu_objset_snapshot:return time 14405
  1  42867   dmu_objset_snapshot:return time 8883
  0  42867   dmu_objset_snapshot:return time 4960

Now the code in kernel is without optimalization

zfs_unmount_snap(snap_previous, NULL);
dmu_objset_destroy(snap_previous);
zfs_unmount_snap(zc-zc_value, NULL);
dmu_objset_rename(zc-zc_value, snap_previous); 
error = dmu_objset_snapshot(zc-zc_name,

REPLICATE_SNAPSHOT_LATEST, 0);

In kernel operation can be optimized and done in one dsl_sync_task_do call.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] crash during snapshot operations

2007-03-23 Thread Łukasz
When I'm trying to do in kernel in zfs ioctl:
 1. snapshot destroy PREVIOS
 2. snapshot rename LATEST-PREVIOUS
 3. snapshot create LATEST

code is:
/* delete previous snapshot */
zfs_unmount_snap(snap_previous, NULL);
dmu_objset_destroy(snap_previous);

/* rename snapshot */
zfs_unmount_snap(snap_latest, NULL);
dmu_objset_rename(snap_latest, snap_previous);

/* create snapshot */
dmu_objset_snapshot(zc-zc_name,
REPLICATE_SNAPSHOT_LATEST, 0);

I get kernel panic. 

MDB
 ::status
debugging crash dump vmcore.3 (32-bit) from zfs.dev
operating system: 5.11 snv_56 (i86pc)
panic message: BAD TRAP: type=8 (#df Double fault) rp=fec244f8 addr=d5904ffc
dump content: kernel pages only

This happens only when the ZFS filesystem is loaded with I/O operations.
( I copy studio11 folder on this filesystem. )

MDB ::stack show nothing, but walking threads I found:

stack pointer for thread d8ff9e00: d421b028
  d421b04c zio_pop_transform+0x45(d9aba380, d421b090, d421b070, d421b078)
  d421b094 zio_clear_transform_stack+0x23(d9aba380)
  d421b200 zio_done+0x12b(d9aba380)
  d421b21c zio_next_stage+0x66(d9aba380)
  d421b230 zio_checksum_verify+0x17(d9aba380)
  d421b24c zio_next_stage+0x66(d9aba380)
  d421b26c zio_wait_for_children+0x46(d9aba380, 11, d9aba570)
  d421b280 zio_wait_children_done+0x18(d9aba380)
  d421b298 zio_next_stage+0x66(d9aba380)
  d421b2d0 zio_vdev_io_assess+0x11a(d9aba380)
  d421b2e8 zio_next_stage+0x66(d9aba380)
  d421b368 vdev_cache_read+0x157(d9aba380)
  d421b394 vdev_disk_io_start+0x35(d9aba380)
  d421b3a4 vdev_io_start+0x18(d9aba380)
  d421b3d0 zio_vdev_io_start+0x142(d9aba380)
  d421b3e4 zio_next_stage_async+0xac(d9aba380)
  d421b3f4 zio_nowait+0xe(d9aba380)
  d421b424 vdev_mirror_io_start+0x151(deab5cc0)
  d421b450 zio_vdev_io_start+0x14f(deab5cc0)
  d421b460 zio_next_stage+0x66(deab5cc0)
  d421b470 zio_ready+0x124(deab5cc0)
  d421b48c zio_next_stage+0x66(deab5cc0)
  d421b4ac zio_wait_for_children+0x46(deab5cc0, 1, deab5ea8)
  d421b4c0 zio_wait_children_ready+0x18(deab5cc0)
  d421b4d4 zio_next_stage_async+0xac(deab5cc0)
  d421b4e4 zio_nowait+0xe(deab5cc0)
  d421b520 arc_read+0x3cc(d8a2cd00, da9f6ac0, d418e840, f9e55e5c, f9e249b0, 
d515c010)
  d421b590 dbuf_read_impl+0x11b(d515c010, d8a2cd00, d421b5cc)
  d421b5bc dbuf_read+0xa5(d515c010, d8a2cd00, 2)
  d421b5fc dmu_buf_hold+0x7c(d47cb854, 4, 0, 0, 0, 0)
  d421b654 zap_lockdir+0x38(d47cb854, 4, 0, 0, 1, 1)
  d421b690 zap_lookup+0x23(d47cb854, 4, 0, d421b6e0, 8, 0)
  d421b804 dsl_dir_open_spa+0x10a(da9f6ac0, d8fde000, f9e7378f, d421b85c, 
d421b860)
  d421b864 dsl_dataset_open_spa+0x2c(0, d8fde000, 1, debe83c0, d421b938)
  d421b88c dsl_dataset_open+0x19(d8fde000, 1, debe83c0, d421b938)
  d421b940 dmu_objset_open+0x2e(d8fde000, 5, 1, d421b970)
  d421b974 dmu_objset_snapshot_one+0x2c(d8fde000, d421b998)
  d421bdb0 dmu_objset_snapshot+0xaf(d8fde000, d4c6a3e8, 0)
  d421c9e8 zfs_ioc_replicate_send+0x1ab(d8fde000)
  d421ce18 zfs_ioc_sendbackup+0x126()
  d421ce40 zfsdev_ioctl+0x100(2d8, 5a1e, 8046cac, 13, d5938650, 
d421cf78)
  d421ce6c cdev_ioctl+0x2e(2d8, 5a1e, 8046cac, 13, d5938650, d421cf78)
  d421ce94 spec_ioctl+0x65(d6591780, 5a1e, 8046cac, 13, d5938650, d421cf78)
  d421ced4 fop_ioctl+0x27(d6591780, 5a1e, 8046cac, 13, d5938650, d421cf78)
  d421cf84 ioctl+0x151()
  d421cfac sys_sysenter+0x101()

 $r
%cs = 0x0158%eax = 0x
%ds = 0x0160%ebx = 0xe58abac0
%ss = 0x0160%ecx = 0x
%es = 0x0160%edx = 0x0018
%fs = 0x%esi = 0x
%gs = 0x01b0%edi = 0x

%eip = 0xfe8ebd71 kmem_free+0x111
%ebp = 0x
%esp = 0xfec24530

%eflags = 0x00010246
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=of,df,IF,tf,sf,ZF,af,PF,cf

  %uesp = 0xd5905000
%trapno = 0x8
   %err = 0x0

I was trying to cause error from command line:
  [EMAIL PROTECTED] ~]# zfs destroy solaris/[EMAIL PROTECTED] ; zfs rename 
solaris/[EMAIL PROTECTED] solaris/[EMAIL PROTECTED]; zfs snapshot 
solaris/[EMAIL PROTECTED]

but without success.
Any idea ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

2007-03-23 Thread Łukasz
 How it got that way, I couldn't really say without looking at your code. 

It works like this:

In new ioctl operation
zfs_ioc_replicate_send(zfs_cmd_t *zc) 
   
we open filesystem ( not snapshot )

   dmu_objset_open(zc-zc_name, DMU_OST_ANY,
DS_MODE_STANDARD | DS_MODE_READONLY, filesystem);

call dmu replicate send function
   
   dmu_replicate_send(filesystem,  txg, ...);
 (  txg  - is tranzaction group number  )

we set max_txg
  ba.max_txg = (spa_get_dsl(filesystem-os-os_spa))-dp_tx.tx_synced_txg;

and call traverse_dsl_dataset

   traverse_dsl_dataset(filesystem-os-os_dsl_dataset, *txg,
ADVANCE_PRE | ADVANCE_HOLES | ADVANCE_DATA | ADVANCE_NOLOCK,
replicate_cb, ba);

after traversing next txg is returned

   if (ba.got_data != 0)
  *txg = ba.max_txg + 1;
 
in replicate_cb we do the same what backup_cb does, but at the beginning we 
are checking txg:

/* remember last txg */
if (bc-bc_blkptr.blk_birth) {

if (bc-bc_blkptr.blk_birth  ba-max_txg) return;

ba-got_data = 1;
}

After 5 seconds delay we call ioctl with txg returned from last operation.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: crash during snapshot operations

2007-03-23 Thread Łukasz
Thanks for advice.
I removed my buffers snap_previous and snap_latest and it helped.
I'm using zc-value as buffer.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss