[zfs-discuss] A simple script to measure SYNC writes

2009-02-10 Thread Sanjeev Bagewadi

Hi,

There was a requirement to measure all the OSYNC writes.
Attached is a simple DTrace script which does this using the
fsinfo provider and fbt::fop_write.

I was wondering if this accurate enough or if I missed any other cases.
I am sure this can be improved in many ways.

Thanks and regards,
Sanjeev

#!/usr/sbin/dtrace -Cs

/* CDDL HEADER START
 *
 * The contents of this file are subject to the terms of the
 * Common Development and Distribution License, Version 1.0 only
 * (the License).  You may not use this file except in compliance
 * with the License.
 *
 * You can obtain a copy of the license at Docs/cddl1.txt
 * or http://www.opensolaris.org/os/licensing.
 * See the License for the specific language governing permissions
 * and limitations under the License.
 *
 * CDDL HEADER END
 *
 * Author: Sanjeev Bagewadi [Bangalore, India]
 */

#pragma D option quiet

#include sys/file.h

BEGIN
{
	secs = 10;
}

fbt::fop_write:entry
/arg2  (FSYNC | FDSYNC)/
{
	self-trace = 1;
}

fbt::fop_write:return
/self-trace/
{
	self-trace = 0;
}

fsinfo::fop_write:write
/self-trace/
{
	vp = (vnode_t *) arg0;
	vfs = (vfs_t *) vp-v_vfsp;
	mnt_pt = (char *)((refstr_t *)vfs-vfs_mntpt-rs_string);
	uio = (uio_t *) arg1;
	/*...@writes[stringof(mnt_pt)] = sum(uio-uio_resid);*/
	@writes[args[0]-fi_mount] = sum(args[1]);
}

tick-1s
/secs != 0/
{
	secs--;
}

tick-1s
/secs == 0/
{
	exit(0);
}

END
{
	printa(%s %...@8d\n, @writes);
}

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot remove a file on a GOOD ZFS filesystem

2008-12-30 Thread Sanjeev Bagewadi
Marcelo,

Thanks for the details ! This rules out a bug that I was suspecting :
http://bugs.opensolaris.org/view_bug.do?bug_id=6664765

This needs more analysis.
What does the rm command fail with ?
We could probably run truss on the rm command like : truss -o 
/tmp/rm.truss rm filename
You then pass on the file : /tmp/rm.truss

This would show us which system call is failing and why. That would give 
us a good idea of what
is going wrong.

Thanks and regards,
Sanjeev.

Marcelo Leal wrote:
 Hello all,

 # zpool status
   pool: mypool
  state: ONLINE
  scrub: scrub completed after 0h2m with 0 errors on Fri Dec 19 09:32:42 2008
 config:

 NAME STATE READ WRITE CKSUM
 storage  ONLINE   0 0 0
   mirror ONLINE   0 0 0
 c0t2d0   ONLINE   0 0 0
 c0t3d0   ONLINE   0 0 0
   mirror ONLINE   0 0 0
 c0t4d0   ONLINE   0 0 0
 c0t5d0   ONLINE   0 0 0
   mirror ONLINE   0 0 0
 c0t6d0   ONLINE   0 0 0
 c0t7d0   ONLINE   0 0 0
   mirror ONLINE   0 0 0
 c0t8d0   ONLINE   0 0 0
 c0t9d0   ONLINE   0 0 0
   mirror ONLINE   0 0 0
 c0t10d0  ONLINE   0 0 0
 c0t11d0  ONLINE   0 0 0
   mirror ONLINE   0 0 0
 c0t12d0  ONLINE   0 0 0
 c0t13d0  ONLINE   0 0 0
 logs ONLINE   0 0 0
   c0t1d0 ONLINE   0 0 0

 errors: No known data errors

 -  zfs list -r  shows eight filesystems, and nine snapshots per filesystem.
 ...
 mypool/colorado 1.83G  4.00T  1.13G  
 /mypool/colorado
 mypool/color...@centenario-2008-12-28-01:00:00   40.3M  -  1.46G  -
 mypool/color...@centenario-2008-12-29-01:00:00   30.0M  -  1.54G  -
 mypool/color...@campeao-2008-12-29-09:00:00  10.4M  -  1.24G  -
 mypool/color...@campeao-2008-12-29-13:00:00  31.5M  -  1.29G  -
 mypool/color...@campeao-2008-12-29-17:00:00  5.46M  -  1.10G  -
 mypool/color...@campeao-2008-12-29-21:00:00  4.23M  -  1.13G  -
 mypool/color...@centenario-2008-12-30-01:00:00   0  -  1.16G  -
 mypool/color...@campeao-2008-12-30-01:00:00  0  -  1.16G  -
 mypool/color...@campeao-2008-12-30-05:00:00  6.24M  -  1.16G  -
 ...
  
  - How many entries does it have ?
  Now there is just one file, the problematic one... but before the whole 
 problem, four or five small files (the whole pool is pretty empty).
 - Which filesystem (of the zpool) does it belong to ?
  See above...

  Thanks a lot!
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot remove a file on a GOOD ZFS filesystem

2008-12-30 Thread Sanjeev Bagewadi
Marcelo,

Thanks for the details.
Comments inline...

Marcelo Leal wrote:
 execve(/usr/bin/rm, 0x08047DBC, 0x08047DC8)  argc = 2
 mmap(0x, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, 
 -1, 0) = 0xFEFF
 resolvepath(/usr/lib/ld.so.1, /lib/ld.so.1, 1023) = 12
 resolvepath(/usr/bin/rm, /usr/bin/rm, 1023) = 11
 sysconfig(_CONFIG_PAGESIZE) = 4096
 xstat(2, /usr/bin/rm, 0x08047A68) = 0
 open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT
 xstat(2, /lib/libc.so.1, 0x080471C8)  = 0
 resolvepath(/lib/libc.so.1, /lib/libc.so.1, 1023) = 14
 open(/lib/libc.so.1, O_RDONLY)= 3
   
 fstatat64(AT_FDCWD, Arquivos.file, 0x08047C80, 0x1000) Err#2 ENOENT
   
This is interesting !
Note that the fstatat64() call is failing with ENOENT. So, there is 
something we are missing.
I assume you are able to list the directory contents and ascertain that 
the file exists.
Can you please provide the directory listing (ls -l) of the directory 
in question ?
Note that a ls -l would use fstat64 to get the stats of the files. So, 
truss on ls -l would
also help.

Thanks and regards,
Sanjeev.
 fstat64(2, 0x08046CE0)  = 0
 write(2,  r m :  , 4) = 4
 write(2,  Arquivos . fil.., 13)  = 13
 write(2,  :  , 2) = 2
 write(2,  N o   s u c h   f i l e.., 25)  = 25
 write(2, \n, 1)   = 1
 _exit(2)
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot remove a file on a GOOD ZFS filesystem

2008-12-29 Thread Sanjeev Bagewadi
Marcelo,

Marcelo Leal wrote:
 Hello all...
  Can that be caused by some cache on the LSI controller? 
  Some flush that the controller or disk did not honour?
   
More details on the problem would help. Can you please give the 
following details :
- zpool status
- zfs list -r
- The details of the directory :
- How many entries does it have ?
- Which filesystem (of the zpool) does it belong to ?

Thanks and regards,
Sanjeev.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] `zfs list` doesn't show my snapshot

2008-11-23 Thread Sanjeev Bagewadi
Jel,

Jens Elkner wrote:
 On Fri, Nov 21, 2008 at 03:42:17PM -0800, David Pacheco wrote:
   
 Pawel Tecza wrote:
 
 But I still don't understand why `zfs list` doesn't display snapshots
 by default. I saw it in the Net many times at the examples of zfs usage.
   
 This was PSARC/2008/469 - excluding snapshot info from 'zfs list'

 http://opensolaris.org/os/community/on/flag-days/pages/2008091003/
 

 The uncomplete one - where is the '-t all' option? It's really annoying,
 error prone, time consuming to type stories on the command line ...
 Does anybody remember the keep it small and simple thing?
   
This change was made because there were a lot of users who has a large 
number of
snapshots. This would cause 2 problems :
- The listing of all the snapshots and filesystems would be really long.
- Also, this would take rather long time...

For those who want the older behaviour can still set the listsnapshots 
property accordingly.

Hope that helps.

Regards,
Sanjeev.

 Regards,
 jel.
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How can i make my zpool as faulted.

2008-10-20 Thread Sanjeev Bagewadi
Yuvraj,

I see that you are using files as disks.
You could write a few random bytes to one of the files and that would 
induce corruption.
To make a particular disk faulty you could mv the file to a new name.

Also, you can explore the zinject from zfs testsuite . Probably it has a 
way to induce fault.

Thanks and regards,
Sanjeev

yuvraj wrote:
 Hi Sanjeev,
 I am herewith giving all the details of my zpool by 
 firirng #zpool status command on commandline. Please go through the same and 
 help me out.

  Thanks in advance.

   
 Regards,
   Yuvraj 
 Balkrishna Jadhav.

 ==

 # zpool status
   pool: mypool1
  state: ONLINE
  scrub: none requested
 config:

 NAMESTATE READ WRITE CKSUM
 mypool1 ONLINE   0 0 0
   /disk1ONLINE   0 0 0
   /disk2ONLINE   0 0 0

 errors: No known data errors

   pool: zpool21
  state: ONLINE
  scrub: scrub completed with 0 errors on Sat Oct 18 13:01:52 2008
 config:

 NAMESTATE READ WRITE CKSUM
 zpool21 ONLINE   0 0 0
   /disk3ONLINE   0 0 0
   /disk4ONLINE   0 0 0

 errors: No known data errors
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver running for 35 trillion years

2008-06-24 Thread Sanjeev Bagewadi
Mike,

Indeed an interesting result :) !
This is a known problem with VirtualBox :)
They have fixed it in the latest release
-- snip --

#1639: Solaris Virtual box guest keeps getting its time reset after resuming VM
from suspend
+---
  Reporter:  nagki  |  Owner:  
  Type:  defect | Status:  closed  
  Priority:  major  |Version:  VirtualBox 1.6.0
Resolution:  duplicate  |   Keywords:  time reset guest
+---
Changes (by sandervl73):

  * status:  new = closed
  * resolution:  = duplicate

Comment:

 Duplicate and fixed in 1.6.2 (due out in a day or two)
-- snip --

Cheers,
Sanjeev.


Mike Gerdts wrote:
 This is good for a chuckle.

 # zpool status
   pool: rpool
  state: ONLINE
 status: One or more devices is currently being resilvered.  The pool will
 continue to function, possibly in a degraded state.
 action: Wait for the resilver to complete.
  scrub: resilver in progress for 307445734561488536h47m, 19.31% done,
 307445734560416371h48m to go
 config:

 NAMESTATE READ WRITE CKSUM
 rpool   ONLINE   0 0 0
   mirrorONLINE   0 0 0
 c7d0s0  ONLINE   0 0 0
 c7d1s0  ONLINE   0 0 0

 errors: No known data errors


 I'm all for the 128 bit file system being able to use every atom in
 the universe for storage, but I doubt that this pool has been
 resilvering for over 35 trillion years.  If it has, I'm certainly not
 staying up to wait for it to finish...

 How did this happen?  According to the timestamps in my prompt, I'm
 thinking that virtualbox reset the time to zero while the command was
 running.  This seems to happen from time to time, but this is the most
 entertaining result I have seen.

   


-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation

2008-06-18 Thread Sanjeev Bagewadi
Lance,

This could be bug#*6596237 Stop looking and start ganging 
http://monaco.sfbay/detail.jsf?cr=6596237.*
The fix is in progress and Victor Latushkin is working on it.

We have an IDR based on the patches 127127-11/127128-11 which has the 
first cut of the fix.
You could raise an escalation and get these IDRs.
Thanks and regards,
Sanjeev.

Lance wrote:
 Any progress on a defragmentation utility?  We appear to be having a severe 
 fragmentation problem on an X4500, vanilla S10U4, no additional patches.  
 500GB disks in 4 x 11 disk RAIDZ2 vdevs.  It hit 97% full and fell off a 
 cliff...about 50KB/sec on writes.  Deleting files so the zpool is at 92% has 
 not helped.  I rebooted the host...no difference.  I lowered the recordsize 
 from 128KB to 8KB.  That has boosted performance to 250-500KB/sec on writes 
 (still 10x-100x too slow).  Reads have been fine all along.

 This is one big zpool and one file system of 16TB.  Approximately 25-30M 
 files, some of which change often.  Lots of small, changing files, which are 
 probably aggravating the problem.  Due to the Marvell driver bug, I have SATA 
 NCQ turned off in /etc/system via set sata:sata_func_enable=0x5.  We plan 
 to go to the most recent patch set so I can remove that, but I'm not 
 convinced patching will fix the slowness we're seeing.

 We'll try to delete more files, but having a defragmentation utility might 
 help in this case.  It seems a shame to waste 10-20% of your disk space to 
 maintain moderate performance, though I guess that's what we'll have to do.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   
-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS space map causing slow performance

2008-06-09 Thread Sanjeev Bagewadi
Scott,

This looks more like  bug#*6596237 Stop looking and start ganging 
http://monaco.sfbay/detail.jsf?cr=6596237.
*
What version of Solaris are the production servers running (S10 or 
Opensolaris) ?

Thanks and regards,
Sanjeev.

Scott wrote:
 Hello,

 I have several ~12TB storage servers using Solaris with ZFS.  Two of them 
 have recently developed performance issues where the majority of time in an 
 spa_sync() will be spent in the space_map_*() functions.  During this time, 
 zpool iostat will show 0 writes to disk, while it does hundreds or 
 thousands of small (~3KB) reads each second, presumably reading space map 
 data from disk to find places to put the new blocks.  The result is that it 
 can take several minutes for an spa_sync() to complete, even if I'm only 
 writing a single 128KB block.

 Using DTrace, I can see that space_map_alloc() frequently returns -1 for 
 128KB blocks.  From my understanding of the ZFS code, that means that one or 
 more metaslabs has no 128KB blocks available.  Because of that, it seems to 
 be spending a lot of time going through different space maps which aren't 
 able to all be cached in RAM at the same time, thus causing bad performance 
 as it has to read from the disks.  The on-disk space map size seems to be 
 about 500MB.

 I assume the simple solution is to leave enough free space available so that 
 the space map functions don't have to hunt around so much.  This problem 
 starts happening when there's about 1TB free out of the 12TB.  It seems like 
 such a shame to waste that much space, so if anyone has any suggestions, I'd 
 be glad to hear them.

 1) Is there anything I can do to temporarily fix the servers that are having 
 this problem? They are production servers, and I have customers complaining, 
 so a temporary fix is needed.

 2) Is there any sort of tuning I can do with future servers to prevent this 
 from becoming a problem?  Perhaps a way to make sure all the space maps are 
 always in RAM?

 3) I set recordsize=32K and turned off compression, thinking that should fix 
 the performance problem for now.  However, using a DTrace script to watch 
 calls to space_map_alloc(), I see that it's still looking for 128KB blocks 
 (!!!) for reasons that are unclear to me, thus it hasn't helped the problem.

 Thanks,
 Scott
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread Sanjeev Bagewadi
William,

It should be fairly easy to find the record size using DTrace. Take an 
aggregation of the
the writes happening (aggregate on size for all the write(2) system calls).

This would give fair idea of the IO size pattern.

Does RRD4J have a record size mentioned ? Usually if it is a 
database-application they have a record-size
option when the DB is created (based on my limited knowledge about DBs).

Thanks and regards,
Sanjeev.

PS : Here is a simple script which just aggregates on the write size and 
executable name :
-- snip --
#!/usr/sbin/dtrace -s


syscall::write:entry
{
wsize = (size_t) arg2;
@write[wsize, execname] = count();
}
-- snip --

William Fretts-Saxton wrote:
 Unfortunately, I don't know the record size of the writes.  Is it as simple 
 as looking @ the size of a file, before and after a client request, and 
 noting the difference in size?  This is binary data, so I don't know if that 
 makes a difference, but the average write size is a lot smaller than the file 
 size.  

 Should the recordsize be in place BEFORE data is written to the file system, 
 or can it be changed after the fact?  I might try a bunch of different 
 settings for trial and error.

 The I/O is actually done by RRD4J, which is a round-robin database library.  
 It is a Java version of 'rrdtool' which saves data into a binary format, but 
 also cleans up the data according to its age, saving less of the older data 
 as time goes on.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs device busy

2008-01-04 Thread Sanjeev Bagewadi
Carol,

Probably /mnt is already in use ie. some other filesystem is mounted 
there.
Can you please verify ?

What is the original mountpoint of pool/zfs1 ?

Regards,
Sanjeev.

Caroline Carol wrote:
 Hi all,
  
 When i modify zfs FS propreties I get device busy
  
 -bash-3.00# zfs set mountpoint=/mnt1 pool/zfs1
 cannot unmount '/mnt': Device busy
  
  
 Do you know how to identify porcess accessing this FS ?
 fuser doesn't work with zfs!
  
  
 Thanks a lot
 regards
  
 Carol

 
 Ne gardez plus qu'une seule adresse mail ! Copiez vos mails 
 http://fr.rd.yahoo.com/mail/mail_taglines/trueswitch/*http://www.trueswitch.com/yahoo-fr/
  
 vers Yahoo! Mail
 

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs won't import a pool automatically at boot

2007-10-16 Thread Sanjeev Bagewadi
Michael,

If you don't call zpool export -f tank it should work.
However, it would be necessary to understand why you are using the above 
command after creation of the zpool.

Can you avoid exporting after the creation ?

Regards,
Sanjeev


Michael Goff wrote:
 Hi,

 When jumpstarting s10x_u4_fcs onto a machine, I have a postinstall script 
 which does:

 zpool create tank c1d0s7 c2d0s7 c3d0s7 c4d0s7
 zfs create tank/data
 zfs set mountpoint=/data tank/data
 zpool export -f tank

 When jumpstart finishes and the node reboots, the pool is not imported 
 automatically. I have to do:

 zpool import tank

 for it to show up. Then on subsequent reboots it imports and mounts 
 automatically. What I can I do to get it to mount automatically the first 
 time? When I didn't have the zpool export I would get an message that I 
 needed to use zpool import -f because it wasn't exported properly from 
 another machine. So it looks like the state of the pool created during the 
 jumpstart install was lost.

 BTW, I love using zfs commands to manage filesystems. They are so easy and 
 intuitive!

 thanks,
 Mike
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs won't import a pool automatically at boot

2007-10-16 Thread Sanjeev Bagewadi
Thanks Robert ! I missed that part.

-- Sanjeev.

Michael Goff wrote:
 Great, thanks Robert. That's what I was looking for. I was thinking 
 that I would have to transfer the state somehow from the temporary 
 jumpstart environment to /a so that it would be persistent. I'll test 
 it out tomorrow.

 Sanjeev, when I did not have the zpool export, it still did not import 
 automatically upon reboot after the jumpstart. And when I imported it 
 manually, if gave an error. So that's why I added the export.

 Mike

 Robert Milkowski wrote:
 Hello Sanjeev,

 Tuesday, October 16, 2007, 10:14:01 AM, you wrote:

 SB Michael,

 SB If you don't call zpool export -f tank it should work.
 SB However, it would be necessary to understand why you are using 
 the above
 SB command after creation of the zpool.

 SB Can you avoid exporting after the creation ?


 It won't help during jumpstart as /etc is not the same one as after he
 will boot.

 Before you export a pool put in your finish script:

 cp -p /etc/zfs/zpool.cache /a/etc/zfs/

 Then export a pool. It should do the trick.



-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Directory

2007-05-31 Thread Sanjeev Bagewadi

Kanishk,

Directories are implemented as ZAP objects.

Look at the routines in that order :
- zfs_lookup()
- zfs_dirlook()
- zfs_dirent_lock()
- zap_lookup

Hope that helps.

Regards,
Sanjeev.

kanishk wrote:

i wanted to know how does ZFS finds an entry of a file from its 
dirctory object.

anylinks to the code will be highly appriciated.

thankx regards
kanishk

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-30 Thread Sanjeev Bagewadi

Nathan,

Some answers inline...

Nathan Huisman wrote:


= PROBLEM

To create a disk storage system that will act as an archive point for
user data (Non-recoverable data), and also act as a back end storage
unit for virtual machines at a block level.

= BUDGET

Currently I have about 25-30k to start the project, more could be
allocated in the next fiscal year for perhaps a backup solution.

= TIMEFRAME

I have 8 days to cut a P.O. before our fiscal year ends.

= STORAGE REQUIREMENTS

5-10tb of redundant fairly high speed storage


= QUESTION #1

What is the best way to mirror two zfs pools in order to achieve a sort
of HA storage system? I don't want to have to physically swap my disks
into another system if any of the hardware on the ZFS server dies. If I
have the following configuration what is the best way to mirror these in
near real time?

BOX 1 (JBOD-ZFS) BOX 2 (JBOD-ZFS)

I've seen the zfs send and recieve commands but I'm not sure how well
that would work with a close to real time mirror.


If you want close to realtime mirroring (across pools in this case) AVS 
would

be a better option in my opinion.
Refer to : http://www.opensolaris.org/os/project/avs/Demos/AVS-ZFS-Demo-V1/




= QUESTION #2

Can ZFS be exported via iscsi and then imported as a disk to a linux
system and then be formated with another file system. I wish to use ZFS
as a block level file systems for my virtual machines. Specifically
using xen. If this is possible, how stable is this? How is error
checking handled if the zfs is exported via iscsi and then the block
device formated to ext3? Will zfs still be able to check for errors?
If this is possible and this all works, then are there ways to expand a
zfs iscsi exported volume and then expand the ext3 file system on the
remote host?


Yes, you can create volumes (ZVOL) in a Zpool and export them over iscsi.
The ZVOL would guarantee the data consistency at the block level.

Expanding the ZVOL should be possible. However, I am not sure if/how 
iSCSI behaves here.

You might need to try it out.



= QUESTION #3

How does zfs handle a bad drive? What process must I go through in
order to take out a bad drive and replace it with a good one?


# zpool replace poolname bad-drive new-good-drive

The other option would be configure hot-spares and they will kickin 
automatically

when a bad-drive is detected.



= QUESTION #4

What is a good way to back up this HA storage unit? Snapshots will
provide an easy way to do it live, but should it be dumped into a tape
library, or an third offsite zfs pool using zfs send/recieve or ?

= QUESTION #5

Does the following setup work?

BOX 1 (JBOD) - iscsi export - BOX 2 ZFS.

In other words, can I setup a bunch of thin storage boxes with low cpu
and ram instead of using sas or fc to supply the jbod to the zfs server?


Should be feasible. Just that you would then need a robust LAN and that 
would be flooded.


Thanks and regards,
Sanjeev.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Contents of transaction group?

2007-04-09 Thread Sanjeev Bagewadi

Atul,

Atul Vidwansa wrote:

Hi,
   I have few questions about the way a transaction group is created.

1. Is it possible to group transactions related to multiple operations
in same group? For example, an rmdir foo followed by mkdir bar,
can these end up in same transaction group?
Each TXG is 5 sec long (in normal cases unless some operation forcefully 
closed it).
So, it is quite possible that the 2 syscalls can end up in the same TXG. 
But, is not guaranteed.


If it has to be guaranteed then this logic will have to be built into 
the VNODE ops code. ie. ZPL
code. However, that would be tricky as rmdir and mkdir are 2 different 
syscalls and I am not sure what locking

issues you would need to take care.


2. Is it possible for an operation (say write()) to occupie multiple
transaction groups?

Yes.


3. Is it possible to know the thread id(s) for every commited txg_id?

The TXG is always synced by the txg threads. Not sure why you want it.

Regards,
Sanjeev.



Regards,
-Atul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: simple Raid-Z question

2007-04-08 Thread Sanjeev Bagewadi

MC,

If you originally had 4 * 500 GB disks configured in RAID-Z, you cannot 
add 1 single disk and grow

the capacity of the pool (with same protection). This is not allowed.

Regards,
Sanjeev.

MC wrote:

Two conflicting answers to the same question?  I guess we need someone to break 
the tie :)

  

Hello,

I have been reading alot of good things about Raid-z,
but before I jump into it I have one unanswered
question i can't find a clear answer for.

Is it possible to enlarge the initial RAID size by
adding single drives later on?

If i start off with 4*500gb (should give me 1.5tb),
can I add one 500gb to the raid later, and will the
total size then grow 500gb and still have the same
protection?

 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  



--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Kstats

2007-03-27 Thread Sanjeev Bagewadi

Atul,

libkstat(3LIB) is the library.
man -s 3KSTAT kstat should give a good start.

Regards,
Sanjeev.

Atul Vidwansa wrote:

Peter,
   How do I get those stats programatically? Any clues?
Regards,
_Atul

On 3/27/07, Peter Tribble [EMAIL PROTECTED] wrote:

On 3/27/07, Atul Vidwansa [EMAIL PROTECTED] wrote:
 Hi,
Does ZFS has support for kstats? If I want to extract information
 like no of files commited to disk during an interval, no of
 transactions performed, I/O bandwidth etc, how can I get that
 information?

From the command line, look at the fsstat utility.

If you want the raw kstats then you need to look for ones
of the form 'unix:0:vopstats_*' where there are two forms:
with the name of the filesystem type (eg zfs or ufs) on the
end, or the device id of the individual filesystem.

--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Firewire/USB enclosures

2007-03-20 Thread Sanjeev Bagewadi

Mike,

We have used 4 disks (2X80GB disks and 2X250GB disks) on USB and things 
worked well.

Hot plugging the disks was not all that smooth for us.

Other than that we had no issues using the disks. We used this setup for 
demos at the FOSS 2007 conference
at Bangalore and that went through several destructive tests for a 
period of 3 days and the setup survied well.

(It never let us down in front of the customers :-)

The disks we used had individual enclosures, which was a bit clunky.
It would be nice to have a single enclosure for all the disks (which can 
power the disks).



Thanks and regards,
Sanjeev.

Bev Crair wrote:


Mike,
Take a look at
http://video.google.com/videoplay?docid=8100808442979626078q=CSI%3Amunich 



Granted, this was for demo purposes, but the team in Munich is clearly 
leveraging USB sticks for their purposes.

HTH,
Bev.

mike wrote:


I still haven't got any warm and fuzzy responses yet solidifying ZFS
in combination with Firewire or USB enclosures.

I am looking for 4-10 drive enclosures for quiet SOHO desktop-ish use.
I am trying to confirm that OpenSolaris+ZFS would be stable with this,
if exported out as JBOD and allow ZFS to manage each disk
individually.

Enclosure idea (choose one): 
http://fwdepot.com/thestore/default.php/cPath/1_88

Would be looking to use 750GB SATA2 drives, or IDE is fine too.

Would anyone be willing to speak up and give me some faith in this
before I invest money into a solution that won't work? I don't intend
on hot-plugging any of these devices, just using Firewire (or USB, if
I can find a big enclosure) since it is a cheap and reliable
interconnect (eSATA seems to be a little too new for use with
OpenSolaris unless I have some PCI-X slots)

Any help is appreciated. I'd most likely use a Shuttle XPC as the
head unit for all of this - it is quiet and small. (I'm looking to
downsize my beefy huge noisy heavy tower with limited space
availability) - obviously bandwidth on the bus would be limited the
more drives sharing the same cable. That would be my only design
constraint.

Thanks a ton. Again, any input (good, bad, ugly, personal experiences
or opinions) is appreciated A LOT!

- mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DMU interfaces

2007-03-05 Thread Sanjeev Bagewadi

Manoj,

Welcome back on the alias :-)

I don't think the interfaces are documented. However, refering to ZPL 
should be a good place to start.
The ZPL code interacts with DMU and obviously it is using the DMU 
interfaces.


However, I am not sure whether there is any gaurantee that they will not 
change.


Thanks and regards,
Sanjeev.

Manoj Joseph wrote:


Hi,

I believe, ZFS, at least in the design ;) , provides APIs other than 
POSIX (for databases and other applications) to directly talk to the DMU.


Are such interfaces ready/documented? If this is documented somewhere, 
could you point me to it?


Regards,
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] suggestion: directory promotion to filesystem

2007-02-21 Thread Sanjeev Bagewadi

Adrian,

Seems like a cool idea to me :-) Not sure if there is anything of this 
kind being thought about...

Would be a good idea to file an RFE.

Regards,
Sanjeev

Adrian Saul wrote:

Not sure how technically feasible it is, but something I thought of while 
shuffling some files around my home server.  My poor understanding of ZFS 
internals is that the entire pool is effectivly a tree structure, with nodes 
either being data or metadata.  Given that, couldnt ZFS just change a directory 
node to a filesystem with little effort, allowing me do everything ZFS does 
with filesystems on a subset of my filesystem :)

Say you have some filesystems you created early on before you had a good idea 
of usage.  Say for example I made a large share filesystem and started filling 
it up with photos and movies and some assorted downloads.  A few months later I 
realise it would be so much nicer to be able to snapshot my movies and photos 
seperatly for backups, instead of doing the whole share.

Not hard to work around - zfs create and a mv/tar command and it is done... some time later.  If 
there was say  a zfs graft directory newfs command, you could just break 
of the directory as a new filesystem and away you go - no copying, no risking cleaning up the wrong 
files etc.

Corollary - zfs merge - take a filesystem and merge it into an existing 
filesystem.

Just a thought - any comments welcome.
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Limit ZFS Memory Utilization

2007-02-06 Thread Sanjeev Bagewadi

Richard,

Richard L. Hamilton wrote:


If I understand correctly, at least some systems claim not to guarantee
consistency between changes to a file via write(2) and changes via mmap(2).
But historically, at least in the case of regular files on local UFS, since 
Solaris
used the page cache for both cases, the results should have been consistent.

Since zfs uses somewhat different mechanisms, does it still have the same
consistency between write(2) and mmap(2) that was historically present
(whether or not guaranteed) when using UFS on Solaris?
 


Yes, it does have the consistency. There is specific code to keep
the page cache (needed in case of mmaped files) and the ARC caches 
consistent.


Thanks and regards,
Sanjeev.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Meta data corruptions on ZFS.

2007-02-06 Thread Sanjeev Bagewadi

Masthan,



*/dudekula mastan [EMAIL PROTECTED]/* wrote:


Hi All,
 
In my test set up, I have one zpool of size 1000M bytes.
 

Is this the size given by zfs list ? Or is the amount of disk space that 
you had ?
The reason I ask this is because ZFS/Zpool takes up some amount of space 
for its house keeping.
So, if you add 1G worth of disk space to the pool the effective space 
available is a little less (few MBs)

than 1G.


On this zpool, my application writes 100 files each of size 10 MB.
 
First 96 files were written successfully with out any problem.


Here you are filling the FS to the brim. This is a border case and the 
copy-on-write nature of ZFS

could lead to the behaviour that you are seeing.

 
But the 97 file is not written successfully , it written only 5 MB

(the return value of write() call ).
 
Since it is short write my application tried to truncate it to

5MB. But ftruncate is failing with an erroe message saying that No
space on the devices.

This is expected because of the copy-onwrite nature of ZFS. During 
truncate it is trying to allocate

new disk blocks probably to write the new metadata and fails to find them.

 
Have you people ever seen these kind of error message ?



Yes, there are others who have seen these errors.

 
After ftruncate failure I checked the size of 97 th file, it is

strange. The size is 7 MB but the expected size is only 5 MB.



Is there any particular reason that you are pushing the filesystem to 
the brim ?
Is this part of some test ? Please, help us understand what you are 
trying to test.


Thanks and regards,
Sanjeev.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-11 Thread Sanjeev Bagewadi

Robert,

Comments inline...
Robert Milkowski wrote:

Hello Jason,

Wednesday, January 10, 2007, 9:45:05 PM, you wrote:

JJWW Sanjeev  Robert,

JJWW Thanks guys. We put that in place last night and it seems to be doing
JJWW a lot better job of consuming less RAM. We set it to 4GB and each of
JJWW our 2 MySQL instances on the box to a max of 4GB. So hopefully slush
JJWW of 4GB on the Thumper is enough. I would be interested in what the
JJWW other ZFS modules memory behaviors are. I'll take a perusal through
JJWW the archives. In general it seems to me that a max cap for ZFS whether
JJWW set through a series of individual tunables or a single root tunable
JJWW would be very helpful.

Yes it would. Better yet would be if memory consumed by ZFS for
caching (dnodes, vnodes, data, ...) would behave similar to page cache
like with UFS so applications will be able to get back almost all
memory used for ZFS caches if needed.

I guess (and it's really a guess only based on some emails here) that
in worst case scenario ZFS caches would consume about:

  arc_max + 3*arc_max + memory lost for fragmentation
  

This is not true from what I know :-) How did you get to this number ?

From my knowledge it uses :
c_max + (some memory for other caches)

NOTE : (some memory for other caches) is not as large as c_max. It is 
probably just x% of it

 and not multiples of c_max.

So I guess with arc_max set to 1GB you can lost even 5GB (or more) and
currently only that first 1GB can be get back automatically.
  

This doesn't seem right based on my knowledge of ZFS.

Regards,
Sanjeev.



  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-11 Thread Sanjeev Bagewadi

Jason,

Jason J. W. Williams wrote:

Hi Robert,

We've got the default ncsize. I didn't see any advantage to increasing
it outside of NFS serving...which this server is not. For speed the
X4500 is showing to be a killer MySQL platform. Between the blazing
fast procs and the sheer number of spindles, its perfromance is
tremendous. If MySQL cluster had full disk-based support, scale-out
with X4500s a-la Greenplum would be terrific solution.

At this point, the ZFS memory gobbling is the main roadblock to being
a good database platform.

Regarding the paging activity, we too saw tremendous paging of up to
24% of the X4500s CPU being used for that with the default arc_max.
After changing it to 4GB, we haven't seen anything much over 5-10%.
Remember that ZFS does not use the standard solaris paging architecture 
for caching.
Instead it uses ARC for all its caching. And that is the reason tuning 
the ARC should

help in your case.

The zio_bufs that you referred to in the previous are the caches used by 
ARC for caching

various things (including the metadata and the data).

Thanks and regards,
Sanjeev.


Best Regards,
Jason

On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Thursday, January 11, 2007, 12:36:46 AM, you wrote:

JJWW Hi Robert,

JJWW Thank you! Holy mackerel! That's a lot of memory. With that 
type of a
JJWW calculation my 4GB arc_max setting is still in the danger zone 
on a
JJWW Thumper. I wonder if any of the ZFS developers could shed some 
light

JJWW on the calculation?

JJWW That kind of memory loss makes ZFS almost unusable for a 
database system.



If you leave ncsize with default value then I belive it won't consume
that much memory.


JJWW I agree that a page cache similar to UFS would be much better.  
Linux
JJWW works similarly to free pages, and it has been effective enough 
in the
JJWW past. Though I'm equally unhappy about Linux's tendency to grab 
every

JJWW bit of free RAM available for filesystem caching, and then cause
JJWW massive memory thrashing as it frees it for applications.

Page cache won't be better - just better memory control for ZFS caches
is strongly desired. Unfortunately from time to time ZFS makes servers
to page enormously :(


--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-09 Thread Sanjeev Bagewadi

Jason,

Apologies.. I missed out this mail yesterday...

I am not too familiar with the options. Someoen else will have to answer 
this.


Thanks and regards,
Sanjeev.

Jason J. W. Williams wrote:


Sanjeev,

Could you point me in the right direction as to how to convert the
following GCC compile flags to Studio 11 compile flags? Any help is
greatly appreciated. We're trying to recompile MySQL to give a
stacktrace and core file to track down exactly why its
crashing...hopefully it will illuminate if memory truly is the issue.
Thank you very much in advance!

-felide-constructors
-fno-exceptions -fno-rtti

Best Regards,
Jason

On 1/7/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote:


Jason,

There is no documented way of limiting the memory consumption.
The ARC section of ZFS tries to adapt to the memory pressure of the 
system.

However, in your case probably it is not quick enough I guess.

One way of limiting the memory consumption would be limit the arc.c_max
This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than
memory available).
This is done when the ZFS is loaded (arc_init()).

You should be able to change the value of arc.c_max through mdb and set
it to the value
you want. Exercise caution while setting it. Make sure you don't have
active zpools during this operation.

Thanks and regards,
Sanjeev.

Jason J. W. Williams wrote:

 Hello,

 Is there a way to set a max memory utilization for ZFS? We're trying
 to debug an issue where the ZFS is sucking all the RAM out of the box,
 and its crashing MySQL as a result we think. Will ZFS reduce its cache
 size if it feels memory pressure? Any help is greatly appreciated.

 Best Regards,
 Jason
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hot Spare Behavior

2007-01-09 Thread Sanjeev Bagewadi

Rob,

It (hot-spare) should have kicked in. How long did you wait for it ?
Was there any IO happening on the pool ? Try doing some IO to the disk 
and see if it kicks in.


Also, another point to note is the size of the the hotspares. Please 
ensure that the hot-spares
are of the same size as the mirrors. I think the hot-spares don't kickin 
if there is a size mismatch.


If none of the above works then we will have a take a closer look at the 
details :-)


Regards,
Sanjeev.
Rob wrote:


I physically removed a disk (c3t8d0 used by ZFS 'pool01') from a 3310 JBOD 
connected to a V210 running s10u3 (11/06) and 'zpool status' reported this:

# zpool status
 pool: pool01
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
   the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
  see: http://www.sun.com/msg/ZFS-8000-D3
scrub: resilver completed with 0 errors on Mon Jan  8 15:56:20 2007
config:

   NAMESTATE READ WRITE CKSUM
   pool01  DEGRADED 0 0 0
 mirrorDEGRADED 0 0 0
   c2t4d0  ONLINE   0 0 0
   c3t8d0  UNAVAIL  0 0 0  cannot open
 mirrorONLINE   0 0 0
   c2t5d0  ONLINE   0 0 0
   c3t9d0  ONLINE   0 0 0
   spares
 c2t8d0AVAIL   
 c3t10d0   AVAIL 


Why doesn't ZFS automatically use one of the hot spares? Is this expected 
behavior or a bug?

Rob


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-07 Thread Sanjeev Bagewadi

Jason,

There is no documented way of limiting the memory consumption.
The ARC section of ZFS tries to adapt to the memory pressure of the system.
However, in your case probably it is not quick enough I guess.

One way of limiting the memory consumption would be limit the arc.c_max
This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than 
memory available).

This is done when the ZFS is loaded (arc_init()).

You should be able to change the value of arc.c_max through mdb and set 
it to the value
you want. Exercise caution while setting it. Make sure you don't have 
active zpools during this operation.


Thanks and regards,
Sanjeev.

Jason J. W. Williams wrote:


Hello,

Is there a way to set a max memory utilization for ZFS? We're trying
to debug an issue where the ZFS is sucking all the RAM out of the box,
and its crashing MySQL as a result we think. Will ZFS reduce its cache
size if it feels memory pressure? Any help is greatly appreciated.

Best Regards,
Jason
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: zfs hot spare not automatically getting used

2006-11-29 Thread Sanjeev Bagewadi

Jim,

That is good news !! Let's us know how it goes.


Regards,
Sanjeev.
PS : I am out of office a couple of days.

Jim Hranicky wrote:


OK, spun down the drives again. Here's that output:

 http://www.cise.ufl.edu/~jfh/zfs/threads
   



I just realized that I changed the configuration, so that doesn't reflect 
a system with spares, sorry. 

However, I reinitialized the pool and spun down one of the drives and 
everything is working as it should:


pool: zmir
state: DEGRADED
   status: One or more devices could not be opened.  Sufficient replicas exist 
for
   the pool to continue functioning in a degraded state.
   action: Attach the missing device and online it using 'zpool online'.
  see: http://www.sun.com/msg/ZFS-8000-D3
scrub: resilver completed with 0 errors on Wed Nov 29 16:29:53 2006
   config:

   NAME  STATE READ WRITE CKSUM
   zmir  DEGRADED 0 0 0
 mirror  DEGRADED 0 0 0
   c0t0d0ONLINE   0 0 0
   spare DEGRADED 0 0 0
 c3t1d0  UNAVAIL 10 28.88 0  cannot open
 c3t3d0  ONLINE   0 0 0
   spares
 c3t3d0  INUSE currently in use
 c3t4d0  AVAIL

   errors: No known data errors

I'm just not sure if it will always work. 


I'll try a few different configs and see what happens.


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs hot spare not automatically getting used

2006-11-28 Thread Sanjeev Bagewadi

Jim,

James F. Hranicky wrote:


Sanjeev Bagewadi wrote:
 


Jim,

We did hit similar issue yesterday on build 50 and build 45 although the
node did not hang.
In one of the cases we saw that the hot spare was not of the same
size... can you check
if this true ?
   



It looks like they're all slightly different sizes.
 

Interestingly during our demo runs at the recent FOSS event 
(http://foss.in) we had no issues

with this (snv build 45). We had a RAIDZ config of 3 disks and 1 spare disk.
And what we found was that the spare kicked in.

Here is how we tried it :
- Plugged out one of the 3 disks
- Kicked of a write to the FS on the pool (ie. dd to a new file in the FS).
- The spare kicked in after a while. I guess there is some delay in the 
detection. I am not sure
 if there is some threshold beyond which it kicks in. Need to check the 
code for this.


 


Do you have a threadlist from the node when it was hung ? That would
reveal some info.
   



Unfortunately I don't. Do you mean the output of

::threadlist -v
 


Yes. That would be useful. Also, check the zpool status output.


from

mdb -k
 


Run the following :
# echo ::threadlist -v | mdb -k  /var/tmp/threadlist.out

Regards,
Sanjeev.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs hot spare not automatically getting used

2006-11-21 Thread Sanjeev Bagewadi

Jim,

We did hit similar issue yesterday on build 50 and build 45 although the 
node did not hang.
In one of the cases we saw that the hot spare was not of the same 
size... can you check

if this true ?

Do you have a threadlist from the node when it was hung ? That would 
reveal some info.


Thanks and regards,
Sanjeev.

Jim Hranicky wrote:


OS: Nevada build 51 x86

I recently upgraded Sol10x86 6/6 to Nevada build 51. I'm testing out zfs
on a machine and set up a pool with a mirror of two drives and two hot
spares. I then spun down a drive in the mirror which caused the machine 
to hang, so I rebooted the host. After a reboot, the mirror came up in 
degraded mode but neither of the spares were automatically used. 


Is there something I need to tweak to get this to work?


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48

2006-11-15 Thread Sanjeev Bagewadi

Tomas,

Apologies for delayed response...

Tomas Ögren wrote:


Interesting ! So, it is not the ARC which is consuming too much memory
It is some other piece (not sure if it belongs to ZFS) which is causing 
the crunch...


Or the other possibility is that ARC ate up too much and caused a near 
crunch situation
and the kmem hit back and caused ARC to free up it's buffers (hence the 
no_grow flag enabled).
So, it (ARC) could be osscillating between large caching and then 
purging the caches.


You might want to keep track of these values (ARC size and no_grow flag) 
and see how they

change over a period of time. This would help us understand the pattern.
   



I would guess it grows after boot until it hits some max and then stays
there.. but I can check it out..
 

No, that is not true. Its shrinks when there is memory pressure. The 
values of 'c' and 'p' are

adjusted accordingly.

And if we know it ARC which is causing the crunch we could manually 
change the values of
c_max to a comfortable value and that would limit the size of ARC. 
   



But in the ZFS world, DNLC is part of the ARC, right?
 

Not really... ZFS uses the regular DNLC for lookup optimization. 
However, the metadata/data

is cached in the ARC.


My original question was how to get rid of data cache, but keep
metadata cache (such as DNLC)...
 

This is good question. AFAIK ARC does not really differentiate between 
metadata and data.
So, I am not sure if we can control it. However, as I mentioned above 
ZFS still uses the DNLC caching.


 


However, I would suggest
that you try it out on a non-production machine first.

By, default the c_max is set to 75% of physmem and that is the hard 
limit. c is the soft limit and
ARC would try and grow upto 'c. The value of c is adjusted when there 
is a need to cache more

but, it will never exceed c_max.

Regarding the huge number of reads, I am sure you have already tried 
disabling the VDEV prefetch.

If not, it is worth a try.
   



That was part of my original question, how? :)
 

Apologies :-) I was digging around the code and I find that 
zfs_vdev_cache_bshift is the one which would
control the amount that is read. Currenty it is set to 16. So, we should 
be able to modify this and reduce

the prefetch.

However, I will have to double check with more people and get back to you.

Thanks and regards,
Sanjeev.


/Tomas
 




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48

2006-11-12 Thread Sanjeev Bagewadi

Tomas,

comments inline...


Tomas Ögren wrote:


On 10 November, 2006 - Sanjeev Bagewadi sent me these 3,5K bytes:

 


1. DNLC-through-ZFS doesn't seem to listen to ncsize.

The filesystem currently has ~550k inodes and large portions of it is
frequently looked over with rsync (over nfs). mdb said ncsize was 
about

68k and vmstat -s  said we had a hitrate of ~30%, so I set ncsize to
600k and rebooted.. Didn't seem to change much, still seeing 
hitrates at

about the same and manual find(1) doesn't seem to be that cached
(according to vmstat and dnlcsnoop.d).
When booting, the following message came up, not sure if it matters 
or not:

NOTICE: setting nrnode to max value of 351642
NOTICE: setting nrnode to max value of 235577

Is there a separate ZFS-DNLC knob to adjust for this? Wild guess is 
that

it has its own implementation which is integrated with the rest of the
ZFS cache which throws out metadata cache in favour of data cache.. or
something.. 
   


Current memory usage (for some values of usage ;):
# echo ::memstat|mdb -k
Page SummaryPagesMB  %Tot
     
Kernel  95584   746   75%
Anon20868   163   16%
Exec and libs1703131%
Page cache   1007 71%
Free (cachelist)   97 00%
Free (freelist)  7745606%

Total  127004   992
Physical   125192   978


/Tomas


   


This memory usage shows nearly all of memory consumed by the kernel
and probably by ZFS.  ZFS can't add any more DNLC entries due to lack of
memory without purging others. This can be seen from  the number of
dnlc_nentries being way less than ncsize.
I don't know if there's a DMU or ARC bug to reduce the memory footprint
of their internal structures for situations like this, but we are 
aware of the

issue.
 


Can you please check the zio buffers and the arc status ?

Here is how you can do it :
- Start mdb : ie. mdb -k

   


::kmem_cache
 

- In the output generated above check the amount consumed by the 
zio_buf_*, arc_buf_t and

arc_buf_hdr_t.
   



ADDR NAME  FLAG  CFLAG  BUFSIZE  BUFTOTL

030002640a08 zio_buf_512    02  512   102675
030002640c88 zio_buf_1024  0200 02 1024   48
030002640f08 zio_buf_1536  0200 02 1536   70
030002641188 zio_buf_2048  0200 02 2048   16
030002641408 zio_buf_2560  0200 02 25609
030002641688 zio_buf_3072  0200 02 3072   16
030002641908 zio_buf_3584  0200 02 3584   18
030002641b88 zio_buf_4096  0200 02 4096   12
030002668008 zio_buf_5120  0200 02 5120   32
030002668288 zio_buf_6144  0200 02 61448
030002668508 zio_buf_7168  0200 02 7168 1032
030002668788 zio_buf_8192  0200 02 81928
030002668a08 zio_buf_10240 0200 02102408
030002668c88 zio_buf_12288 0200 02122884
030002668f08 zio_buf_14336 0200 0214336  468
030002669188 zio_buf_16384 0200 0216384 3326
030002669408 zio_buf_20480 0200 0220480   16
030002669688 zio_buf_24576 0200 02245763
030002669908 zio_buf_28672 0200 0228672   12
030002669b88 zio_buf_32768 0200 0232768 1935
03000266c008 zio_buf_40960 0200 0240960   13
03000266c288 zio_buf_49152 0200 02491529
03000266c508 zio_buf_57344 0200 02573447
03000266c788 zio_buf_65536 0200 0265536 3272
03000266ca08 zio_buf_73728 0200 0273728   10
03000266cc88 zio_buf_81920 0200 02819207
03000266cf08 zio_buf_90112 0200 02901125
03000266d188 zio_buf_98304 0200 02983047
03000266d408 zio_buf_1064960200 02   106496   12
03000266d688 zio_buf_1146880200 02   1146886
03000266d908 zio_buf_1228800200 02   1228805
03000266db88 zio_buf_1310720200 02   131072   92

030002670508 arc_buf_hdr_t  00  12811970
030002670788 arc_buf_t  00   40 7308

 


- Dump the values of arc

   


arc::print struct arc
 



 

arc::print struct arc

Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48

2006-11-10 Thread Sanjeev Bagewadi

Comments in line...

Neil Perrin wrote:


1. DNLC-through-ZFS doesn't seem to listen to ncsize.

The filesystem currently has ~550k inodes and large portions of it is
frequently looked over with rsync (over nfs). mdb said ncsize was 
about

68k and vmstat -s  said we had a hitrate of ~30%, so I set ncsize to
600k and rebooted.. Didn't seem to change much, still seeing 
hitrates at

about the same and manual find(1) doesn't seem to be that cached
(according to vmstat and dnlcsnoop.d).
When booting, the following message came up, not sure if it matters 
or not:

NOTICE: setting nrnode to max value of 351642
NOTICE: setting nrnode to max value of 235577

Is there a separate ZFS-DNLC knob to adjust for this? Wild guess is 
that

it has its own implementation which is integrated with the rest of the
ZFS cache which throws out metadata cache in favour of data cache.. or
something.. 



Current memory usage (for some values of usage ;):
# echo ::memstat|mdb -k
Page SummaryPagesMB  %Tot
     
Kernel  95584   746   75%
Anon20868   163   16%
Exec and libs1703131%
Page cache   1007 71%
Free (cachelist)   97 00%
Free (freelist)  7745606%

Total  127004   992
Physical   125192   978


/Tomas
 


This memory usage shows nearly all of memory consumed by the kernel
and probably by ZFS.  ZFS can't add any more DNLC entries due to lack of
memory without purging others. This can be seen from  the number of
dnlc_nentries being way less than ncsize.
I don't know if there's a DMU or ARC bug to reduce the memory footprint
of their internal structures for situations like this, but we are 
aware of the

issue.


Can you please check the zio buffers and the arc status ?

Here is how you can do it :
- Start mdb : ie. mdb -k

 ::kmem_cache

- In the output generated above check the amount consumed by the 
zio_buf_*, arc_buf_t and

 arc_buf_hdr_t.

- Dump the values of arc

 arc::print struct arc

- This should give you some like below.
-- snip--
 arc::print struct arc
{
   anon = ARC_anon
   mru = ARC_mru
   mru_ghost = ARC_mru_ghost
   mfu = ARC_mfu
   mfu_ghost = ARC_mfu_ghost
   size = 0x3e2   -- tells you the current memory 
consumed by ARC buffer (including the
   the memory consumed 
for the data cached ie. zio_buff_*

   p = 0x1d06a06
   c = 0x400
   c_min = 0x400
   c_max = 0x2f9aa800
   hits = 0x2fd2
   misses = 0xd1c
   deleted = 0x296
   skipped = 0
   hash_elements = 0xa85
   hash_elements_max = 0xcc0
   hash_collisions = 0x173
   hash_chains = 0xbe
   hash_chain_max = 0x2
   no_grow = 0   -- This would be set to 1 if we have a 
memory crunch

}
-- snip --

And as Niel pointed out we would probably need some way of limiting the 
ARC consumption.


Regards,
Sanjeev.



Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss