[zfs-discuss] Re: Re: ZFS Support for remote mirroring

2007-05-10 Thread Anantha N. Srirama
To clarify further; EMC note EMC Host Connectivity Guide for Solaris 
indicates that ZFS is supported on 11/06 (aka Update 3) and onwards. However, 
they sneak in a cautionary disclaimer that snapshot and clone features are 
supported by Sun. If one reads it carefully it appears that they do support ZFS 
(not that they should care what FS or not is on their disks) but want to make a 
big deal about ZFS by inserting this superfluous comment. Jealousy I suppose, I 
don't see a comparable disclaimer against VxFS for example.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Extremely long ZFS destroy operations

2007-05-09 Thread Anantha N. Srirama
We've Solaris 10 Update 3 (aka 11/06) running on an E2900 (24 x 96). On this 
server we've been running a large SAS environment totalling well over 2TB. We 
also take daily snapshots of the filesystems and clone them for use by a local 
zone. This setup has been in use for well over 6 months.

Starting Monday I started making a second clone from the same snapshot to 
facilitate quick access to day old image of data in the global zone. I've 
started noticing that my ZFS destroy operations are inordinately long with the 
second clone in place (I'm using zfs destroy -Rf snap name). The degradation 
is close to an order of magnitude; my destroys now take 6-7 minutes while they 
took sub minute in the past.

Any thoughts? Thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS Support for remote mirroring

2007-05-09 Thread Anantha N. Srirama
For whatever reason EMC notes (on PowerLink) suggest that ZFS is not supported 
on their arrays. If one is going to use a ZFS filesystem on top of a EMC array 
be warned about support issues.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Extremely long ZFS destroy operations

2007-05-09 Thread Anantha N. Srirama
I've since stopped making the second clone when I realized the 
.zfs/snapshot/snapname still exists after the clone operation is completed. 
So my need for the local clone is met by the direct access to the snapshot.

However, the poor performance of the destroy is still valid. It is quite 
possible that we might create another clone for reasons beyond my original 
reason.

Why is the destroy so slow with the second clone in play? Thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS performance with Oracle

2007-03-18 Thread Anantha N. Srirama
I'm sorry dude, I can't make head or tail from your post. What is your point?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Need help making lsof work with ZFS

2007-02-18 Thread Anantha N. Srirama
I think so. After all there are features shipped which are not fully 
baked/guranteed like the send/receive. Isn't shipping the header files better 
than letting developers guess their structure and possibly make mistakes? Of 
course the developer can compile against OpenSolaris source but far easier to 
compile against the shipped version from Sun. In this age of FOSS lot of people 
expect to download and compile the source and that won't possible with tools 
that interact with ZFS, right?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Need help making lsof work with ZFS

2007-02-13 Thread Anantha N. Srirama
I contacted the author of 'lsof' regarding the missing ZFS support. The command 
works but fails to display any files that are opened by the process in a ZFS 
filesystem. He indicates that the required ZFS kernel structure definitions 
(header files) are not shipped with the OS. He further indicated that he 
rummaged through the OpenSolaris source tree and the files doesn't match either 
of the Solaris 10 Update 2 or 3.

Can one of the ZFS maestros point me in the direction of where these files can 
be found. I find it hard to believe that the header files are not shipped with 
the OS.

Thanks, any help will be appreciated since ya'll agree that 'lsof' is an 
invaluable tool. Sooner it is available better it is for ZFS users.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Need help making lsof work with ZFS

2007-02-13 Thread Anantha N. Srirama
I did find zfs.h and libzfs.h (thanks Eric). However, when I try to compile the 
latest version (4.87C) of lsof it finds the following files missing: dmu.h 
zfs_acl.h zfs_debug.h zfs_rlock.h zil.h spa.h zfs_context.h zfs_dir.h 
zfs_vfsops.h zio.h txg.h zfs_ctldir.h zfs_ioctl.h zfs_znode.h zio_impl.h.

I looked on my server which has the full cluster of Solaris 10 Update 2 and 
can't find these files. Thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Disk Failure Rates and Error Rates -- ( Off topic: Jim Gray lost at sea)

2007-02-12 Thread Anantha N. Srirama
Here's another website working on his rescue, myy prayers are for a safe return 
of this CS icon.

http://www.helpfindjim.com/
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Re: ZFS or UFS - what to do?

2007-01-28 Thread Anantha N. Srirama
You're right that storage level snapshots are filesystem agnostic. I'm not sure 
why you believe you won't be able to restore individual files by using a NetApp 
snapshot? In the case of ZFS you'd take a periodic snapshot and use it to 
restore files, in the case of NetApp you can do the same (of course you've to 
have the additional step to mount the new snapshot volume.) Is this convenience 
tipping the scales for you to pursue ZFS?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: ZFS or UFS - what to do?

2007-01-28 Thread Anantha N. Srirama
Agreed, I guess I didn't articulate my point/thought very well. The best config 
is to present JBoDs and let ZFS provide the data protection. This has been a 
very stimulating conversation thread; it is shedding new light into how to best 
use ZFS.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Anantha N. Srirama
I've used ZFS since July/August 2006 when Sol 10 Update 2 came out (first 
release to integrate ZFS.) I've used it on three servers (E25K domain, and 2 
E2900s) extensivesely; two them are production. I've over 3TB of storage from 
an EMC SAN under ZFS management for no less than 6 months. Like your 
configuration we've defered data redundancy to SAN. My observations are:

1. ZFS is stable to a very large extent. There are two known issues that I'm 
aware of:
  a. You can end up in an endless 'reboot' cycle when you've a corrupt zpool. I 
came across this when I had data corruption due to a HBA mismatch with EMC SAN. 
This mismatch injected data corruption in transit and the EMC faithfully wrote 
bad data, upon reading this bad data ZFS threw up all over the floor for that 
pool. There is a documented workaround to snap out of the 'reboot' cycle, I've 
not checked if this is fixed in 11/06 update 3.
  b. Your server will hang when one of the underlying disks disappear. In our 
case we had a T2000 running 11/06 and had a mirrored zpool against two internal 
drives. When we pulled one of the drives abruptly the server simply hung. I 
believe this is a known bug, workaround?

2. When you've I/O operations that either request fsync or open files with 
O_DSYNC option coupled with high I/O ZFS will choke. It won't crash but the 
filesystem I/O runs like molases on a cold morning.

All my feedback is based on Solaris 10 Update 2 (aka 06/06) and I've no 
comments on NFS. I strongly recommend that you use ZFS data redundancy (z1, z2, 
or mirror) and simply delegate the Engenio to stripe the data for performance.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Can you turn on zfs compression when the fs is already populated?

2007-01-24 Thread Anantha N. Srirama
I've used the COMPRESS feature for quite a while and you can flip back and 
forth without any problem. When you turn the compress ON nothing happens to the 
existing data. However when you start updating your files all new blocks will 
be compressed; so it is possible to have your file be composed of both 
compressed and uncompressed blocks!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Converting home directory from ufs to zfs

2007-01-24 Thread Anantha N. Srirama
No such facility exists to automagically convert an existing UFS filesystem to 
ZFS. You've to create a new ZFS pool/filesystem and then move your data.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: How much do we really want zpool remove?

2007-01-18 Thread Anantha N. Srirama
I can vouch for this situation. I had to go through a long maintenance to 
accomplish the following:

- 50 x 64GB drives in a zpool; needed to seperate out 15 of them out due to 
performance issues. There was no need to increase storage capacity.

Because I couldn't yank 15 drives from the existing pool to create a UFS 
filesystem I had to go evacuate the entire 50 disk pool, recreate a new pool 
and the UFS filesystem, and then repopulate the filesystems.

I think this feature will add to the adoption rate of ZFS. However, I feel that 
this shouldn't be at the top of the 'to-do' list. I'll trade this feature for 
some of the performance enhancements that've been discussed on this group.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Heavy writes freezing system

2007-01-17 Thread Anantha N. Srirama
Bug 6413510 is the root cause. ZFS maestros please correct me if I'm quoting an 
incorrect bug.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Heavy writes freezing system

2007-01-17 Thread Anantha N. Srirama
Bag-o-tricks-r-us, I suggest the following in such a case:

- Two ZFS pools
  - One for production
  - One for Education
  - Isolate the LUNs feeding the pools if possible, don't share spindles. 
Remember on EMC/Hitachi you've logical LUNs created by striping/concat'ng 
carved up physical disks, so you could have two LUNs that share the same 
spindle. Don't believe one word from your storage admin about we've lot of 
cache to abstract the physical structure; Oracle can push any storage 
sub-system over the edge. Almost all of the storage vendors prevent one LUN 
from flooding the cache with writes, EMC gives no more than 8x the initial 
allocation of cache (total cache/total disk space) and after that it'll stall 
your writes until destage is complete.

- At least two ZFS filesystems under Production pool
  - One for online redo logs and control files. If need be you can further 
seggregate them onto two seperate ZFS filesystems.
  - One for db files. If need be you can isolate further by data, index, temp, 
archived redo, ...
  - Don't host the 'temp' on ZFS, just feed it plain old UFS or raw disk.
  - Match up your ZFS recordsize with your DB blocksize * multi block read 
count. Don't do this for the index filesystem, just the filesystem hosting data

Rinse and repeat for your Education ZFS pool. This will give you substantial 
isolation and improvement, sufficient enough to buy you time to plan out a 
better deployment strategy given that you're under the gun now.

Another thought is while ZFS works out its kinks why not use the BCV or 
ShadowCopy or whatever IBM calls it to create Education instance. This will 
reduce a tremendous amount of I/O.

Just this past weekend I re-did our SAS server to relocate [b]just[/b] the SAS 
work area to good ol' UFS and the payback is tremendous; not one complaint 
about performance 3 days in a row (we used to hear daily complaints.) By taking 
care of your online redo logs and control files (maybe skipping ZFS for it all 
together and running it on UFS) you'll breathe easier.

BTW, I'm curious what application using Oracle is creating more than a million 
files?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Heavy writes freezing system

2007-01-17 Thread Anantha N. Srirama
I did some straight up Oracle/ZFS testing but not on Zvols. I'll give it a shot 
and report back, next week is the earliest.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Extremely poor ZFS perf and other observations

2007-01-13 Thread Anantha N. Srirama
I'm observing the following behavior in our environment (Sol10U2, E2900, 24x96, 
2x2Gbps, ...)

- I've a compressed ZFS filesystem where I'm creating a large tar file. I 
notice that the tar process is running fine (accumulating CPU, truss shows 
writes, ...) but for whatever reason the timestamp on the file doesn't change 
nor does the file size change. The same is true for 'zpool list' output, the 
usage numbers don't change for minutes at a time.

- I started a tar job to the compressed ZFS filesystem reading from another 
compressed ZFS filesystem. At the same time I started copying files from 
another ZFS filesysem (same pool   same attributes) to a remote server (GigE 
connection) using SCP writing to an UFS filesystem. [b]Guess what? My scp over 
the wire beat the pants off of the local ZFS tar session writing to a 2x2Gbps 
SAN and EMC disks![/b]

[b]I'm beginning to develop serious reservations about ZFS performance, 
specially with the compress feature turned on.[/b]
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option

2007-01-09 Thread Anantha N. Srirama
I'll see if I can confirm what you are suggesting. Thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option

2007-01-09 Thread Anantha N. Srirama
I've some important information that should shed some light on this behavior:

This evening I created a new filesystem across the very same 50 disks including 
the COMPRESS attribute. My goal was to isolate some workload to the new 
filesystem and started moving a 100GB directory tree over to the new FS. While 
I was copying I was averaging around 25MB read and 25MB write as expected. 
[b]Now I opened 'vi' and wanted to write out a new file in the new filesystem 
and what I saw was shocking: my reads remained the same but my writes shot upto 
the 150+MB/S range. This abnormal I/O pattern continued until the 'vi' returned 
from the write request.[/b] Here are the 'zpool iostat mtdc 30' output:

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
mtdc 806G  2.48T 38173  1.93M  7.52M
mtdc 806G  2.48T188228  15.0M  8.78M
mtdc 807G  2.48T266624  14.0M  16.5M
mtdc 807G  2.48T286670  17.1M  14.5M
mtdc 807G  2.48T293  1.21K  18.2M  98.4M -- vi activity, note 
mismatch in r/w rates
mtdc 808G  2.48T457560  35.5M  24.2M
mtdc 809G  2.48T405504  31.7M  26.3M
mtdc 809G  2.48T328  1.37K  25.2M   152M -- vi activity, note r/w 
mismatch in r/w rates
mtdc 810G  2.48T428671  33.0M  48.0M
mtdc 811G  2.48T463500  35.9M  26.4M
mtdc 811G  2.48T207  1.39K  16.5M   154M-- vi activity, note r/w 
mismatch in r/w rates
mtdc 812G  2.48T310878  23.9M  77.7M
mtdc 813G  2.48T362494  26.1M  25.3M
mtdc 813G  2.48T381  1.05K  26.8M   103M
mtdc 814G  2.48T347  1.33K  25.0M   135M
mtdc 815G  2.48T288  1.38K  21.7M   150M
mtdc 815G  2.48T425513  32.7M  25.8M
mtdc 816G  2.47T413515  30.2M  25.1M
mtdc 817G  2.47T341512  21.9M  25.1M
mtdc 818G  2.47T293529  18.5M  25.5M
mtdc 818G  2.47T344508  23.4M  24.7M
mtdc 819G  2.47T442512  33.4M  24.1M
mtdc 820G  2.47T385483  28.3M  24.4M
mtdc 820G  2.47T372483  24.7M  24.7M
mtdc 821G  2.47T347535  23.0M  24.2M
mtdc 821G  2.47T290497  17.9M  24.9M
mtdc 823G  2.47T349517  20.0M  24.1M
mtdc 823G  2.47T399512  21.2M  24.5M
mtdc 824G  2.47T383612  19.3M  17.7M
mtdc 824G  2.47T390614  14.2M  17.5M
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Puzzling ZFS behavior with COMPRESS option

2007-01-08 Thread Anantha N. Srirama
Our setup:

- E2900 (24 x 96); Solaris 10 Update 2 (aka 06/06)
- 2 2Gbps FC HBA
- EMC DMX storage
- 50 x 64GB LUNs configured in 1 ZFS pool
- Many filesystems created with COMPRESS enabled; specifically I've one that is 
768GB

I'm observing the following puzzling behavior:

- We are currently creating a large (1.4TB) and sparse dataset; most of the 
dataset contains repeating blanks (default/standard SAS dataset behavior.)
- ls -l reports the file size as 1.4+TB and du -sk reports the actual on disk 
usage at around 65GB.
- My I/O on the system is pegged at 150+MB/S as reported by zpool iostat and 
I've confirmed the same with iostat.

This is very confusing
 
- ZFS is doing very good compression as reported by the ratio of on disk versus 
as reported size of the file (1.4TB vs 65GB)
- [b]Why on God's green earth am I observing such high I/O when indeed ZFS is 
compressing?[/b] I can't believe that the program is actually generating I/O at 
the rate of (150MB/S * compressratio).

Any thoughts?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option

2007-01-08 Thread Anantha N. Srirama
Quick update, since my original post I've confirmed via DTrace (rwtop script in 
toolkit) that the application is not generating 150MB/S * compressratio of I/O. 
What then is causing this much I/O in our system?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS behavior under heavy load (I/O that is)

2006-12-13 Thread Anantha N. Srirama
Thanks, I just downloaded Update 3 and hopefully the problem will go away.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Performance problems during 'destroy' (and bizzare Zone problem as well)

2006-12-12 Thread Anantha N. Srirama
[b]Setting:[/b]
  We've operating in the following setup for well over 60 days.

 - E2900 (24 x 92)
 - 2 2Gbps FC to EMC SAN
 - Solaris 10 Update 2 (06/06)
 - ZFS with compression turned on
 - Global zone + 1 local zone (sparse)
 - Local zone is fed ZFS clones from the global Zone

[b]Daily Routine[/b]
 - Shutdown local Zone
 - Recreate ZFS clones
 - Restart local Zone
 - End to end timing for this refresh is anywhere between 5 to 30 minutes. Bulk 
of the time is spent in the ZFS 'destroy' phase.

[b]Problem[/b]
 - We had extensive read/write activity in the global and local Zones 
yesterday. I estimate that we wrote 1/4 of one large ZFS filesystem, ~ 160GB of 
write.
 - This morning we had a fair amount of activity on the system when the refresh 
started, zpool was reporting around 150MB/S of write.
 - Our 'zfs destroy' commands took what I considere 'normal', the FS that was 
fielding the bulk of the I/O took 15 minutes. During this time everything was 
crawling or more accurately come to a dead stop. A simple 'rm' would hang. I've 
reported this problem to the forum in the past. I also believe the fix for the 
problem is in Update 3 for Solaris 10, right?
 -[b]Surprisingly today the ZFS 'snapshot  clone' took an inordinate amount of 
time. I observed each snapshot  clone activity together took 10+ minutes. In 
the past the same activity has taken no more than a few seconds even during 
busy times. The total end-to-end timing for all snapshots/clones was a whopping 
1:44:00!!![/b]
 - Even more surprising was that local Zone refused to startup (zoneadm -z 
bluenile boot) with no error messages.
 - I was able to start the Zone only after an hour or so after the completion 
of the ZFS commands.

[b]Questions:[/b]
 - Why is the destroy phase taking so long?
 - What can explain the unduly long snapshot/clone times
 - Why didn't the Zone startup?
 - More surprisingly why did the Zone startup after an hour?

Thanks in advance.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS behavior under heavy load (I/O that is)

2006-12-12 Thread Anantha N. Srirama
I'm observing the following behavior on our E2900 (24 x 92 config), 2 FCs, and 
... I've a large filesystem (~758GB) with compress mode on. When this 
filesystem is under heavy load (150MB/S) I've problems saving files in 'vi'. I 
posted here about it and recall that the issue is addressed in Sol10U3. This 
morning I observed another variation of this problem as follows:

- Create a file in 'vi' and save it, session will hang as if it is waiting for 
the write to complete.
- In another session you'll observe the write from 'vi' is indeed complete as 
evidenced by the contents of the file.

Am I repeating myself here or is it a different problem all together.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-11-28 Thread Anantha N. Srirama
Oh my, one day after I posted my horror story another one strikes. This is 
validation of the design objectives of ZFS, looks like this type of stuff 
happens more often than not. In the past we'd have just attributed this type of 
problem to some application induced corruption, now ZFS is pinning this problem 
squarely on the storage sub-system.

If you didn't do any ZFS redundancy then your data is DONE as the support 
person indicated. Make sure you follow the instructions in the ZFS FAQ 
otherwise your server will end up in an endless 'panic-reboot cycle'.

Don't shoot the messenger (ZFS), consider running diags on your storage 
sub-system. Good luck.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Production ZFS Server Death (06/06)

2006-11-28 Thread Anantha N. Srirama
Glad it worked for you. I suspect in your case the corruption happened way down 
in the tree and you could get around it by pruning the tree (rm the file) below 
the point of corruption. I suspect this could be due to a very localized 
corruption like Alpha particle problem where a bit was flipped on the platter 
or in the cache of the storage sub-system before destaging to disk. In our case 
the problem was pervasive due to the problems affecting our data path (FC).

[b]You do raise a very very valid point.[/b] It'd be nice if ZFS provided 
better diagnostics; namely identify where exactly in the tree it found 
corruption. At that point we can determine if the remedy is to contain our 
damage (similar to fsck discarding all suspect inodes) and continue.

For example I've very high regard for the space management in the Oracle DB. 
When it finds a bad block(s) it prints out the address of the block and marks 
it corrupt. [b]It doesn't put the whole file/tablespace/table/index in 
'suspect' mode like ZFS[/b]. DBA can then either drop the table/index that 
contains the bad block or extract data from the table minus the bad block. 
Oracle DB handles it very gracefully giving the user/DBA a chance to recover 
the known good data.

For ZFS to achieve wide acceptance we [b]must[/b] have the ability to pin point 
the problem area and take remedial action (rm for example) not simply give up. 
Yes there are times when the corruption could affect a block high up in the 
chain making the situation hopeless, in such a case we'd have to discard and 
restart. ZFS now has solved one part of the problem, namely identifying bad 
data and doing it reliably. It provides resiliency in the form form of 
Raid-Z(2) and Raid-1. For it to realize its full potential it must also provide 
tools to discard corrupt parts (branches) of the tree and give us a chance to 
save the remaining data. We won't always have the luxury of rebuilding the pool 
and restoring in a production environment.

Easier said than done, me thinks.

Good night.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Another win for ZFS

2006-11-27 Thread Anantha N. Srirama
Today ZFS proved its mettle at our site. We've a set of Sun servers (25k and 
2900s) that are all connected to a DMX3500 via a SAN. Different servers use the 
storage differently; some of the storage on the server side was configured with 
ZFS while others were configured as UFS filesystems while some more were used 
in the 'raw' form by Oracle ASM. In all cases there was no mirroring or 
protection at the server level, we had delegated that function to the DMX3500. 
This decision came back to haunt us this morning.

One of the 25k domains panic'd this morning and ended up in the 'endless 
panic-reboot cycle'. As it turns out our trusted SAN was silently corrupting 
data due to a bad/flaky FC port in the switch. DMX3500 faithfully wrote the bad 
data and returned normal ACKs back to the server, thus all our servers reported 
no storage problems.

ZFS was the first one to pick up on the silent corruption this morning. We're 
still grateful for ZFS even though it put the server in the 'endless 
panic-reboot cycle' that we fixed by following the ZFS FAQ. It'd have been 
nicer if the bug were not present. Our data grows rapidly and the earlier we 
know of corruption shorter the rebuild/restore cycle.

[b]Note to self, use Raid-Z or Raid-1 using ZFS next time around.[/b]
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Configuring a 3510 for ZFS

2006-10-18 Thread Anantha N. Srirama
Thanks for the stimulating exchange of ideas/thoughts. I've always been a 
believer of letting s/w do my RAID functions; for example in the old days of 
VxVM I always preferred to do mirroring at the s/w level. It is my belief that 
there is more 'meta' information available at the OS level than at the storage 
level for s/w to make intelligent decisions; dynamic recordsize in ZFS is one 
example.

Any thoughts on the following approach:

1. I'll configure 3511 to present multiple LUNs (mirrored internally) to OS.
2. Lay down a ZFS pool/filesystem without RAID protection (RAIDZ...) in the OS

With this approach I will enjoy the caching facility of 3511 and the checksum 
protection afforded by ZFS.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Configuring a 3510 for ZFS

2006-10-16 Thread Anantha N. Srirama
I'm glad you asked this question. We are currently expecting 3511 storage 
sub-systems for our servers. We were wondering about their configuration as 
well. This ZFS thing throws a wrench in the old line think ;-) Seriously, we 
now have to put on a new hat to figure out the best way to leverage both the 
storage sub-system as well as ZFS.

As a sidebar if the performance of ZFS keeps improving then I can tell you the 
ultra expensive large arrays will be in trouble. ZFS falls in the category of 
'disruptive technologies' as discussed the book Innovator's Dillemma. In the 
short run it'll eat away at the bottom of the performance curve but will trend 
upwards and beat the incumbents (just like RAM took over from core memory.)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: I'm dancin' in the streets

2006-09-27 Thread Anantha N. Srirama
Some people have privately asked me the configuration details when the problem 
was encountered. Here they are:

zonecfg:bluenile info
zonepath: /zones/bluenile
autoboot: false
pool:
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
net:
address: a.b.c.d
physical: ce0
dataset:
name: mtdc/bluenile/cloneu001
dataset:
name: mtdc/bluenile/cloneu002
dataset:
name: mtdc/bluenile/cloneu003
dataset:
name: mtdc/bluenile/cloneu004
dataset:
name: mtdc/bluenile/cloneu005
dataset:
name: mtdc/bluenile/cloneu006
dataset:
name: mtdc/bluenile/cloneu007
dataset:
name: mtdc/bluenile/cloneu008
dataset:
name: mtdc/bluenile/cloneu099
dataset:
name: zfspool/bluenile/capps [b]-- This is the dataset in question; if 
you replace 'capps' with 'cloneapps' local zone stops seeing it.[/b]
dataset:
name: zfspool/bluenile/home
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: I'm dancin' in the streets

2006-09-26 Thread Anantha N. Srirama
I've found a small bug in the ZFS  Zones integration in Sol10 06/06 release.

This evening I started tweaking my configuration to make it consistent (I like 
orthogonal naming standards) and hit upon this situation:

- Setup a ZFS clone as /zfspool/bluenile/cloneapps; this is a clone of my 
global zones' /apps filesystem.
- Updated my zone configuration for bluenile to use the 
/zfspool/bluenile/cloneapps
- Booted my zone and I couldn't see the just provisioned ZFS filesystem

Upon a hunch I recreated the ZFS clone but this time I named it as 
/zfspool/bluenile/capps to reduce the overall length and updated my Zone 
config. Upon boot I was able to see the ZFS filesystem!

I'm not sure if this is a ZFS, Zones, or ZFS/Zones integration problem. It is 
not a show stopper but in the spirit of ZFS being 'unlimited' in all dimensions 
why are we limiting the length of the clone name?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] I'm dancin' in the streets

2006-09-22 Thread Anantha N. Srirama
Wow! I solved a tricky problem this morning thanks to Zones  ZFS integration. 
We have a SAS SPDS database environment running on Sol10 06/06. The SPDS 
database is unique in that when a table is being updated by one user it is 
unavailable to the rest of the user community. Our nightly update jobs 
(occassionally they turn into day jobs when they take longer :-() were coming 
in the way of our normal usage.

So I put on my ZFS cap and figure it can be simply solved by deploying the 
'clone' feature. Simply stated I'd create a clone of all the SPDS filesystems 
and start another instance of SPDS to read/write from the cloned data. 
Unfortunately I hit a wall when I realized that there is no way to update the 
SPDS metadata (binary file containing a description of the physical structure 
of the database) with the new directory path.

I was stumped until it occurred to me that I can solve it by simply marrying 
the clones with a Solaris Zone Now our problem is solved as follows:

1. Stop local zone
2. Reclaim the ZFS clones in the global-zone
3. Destroy the clone/snapshot
4. Recreate the clone/snapshot
5. Restart the local zone
6. Start SPDS in the local zone and it works beautifully because it sees all 
the files it needs per its metadata!!!

To accomplish the same in traditional methods would have required a SAN disk, 
disk merge/split, ... You get the picture, ugly!

Chalk one more victory for the Solaris 10 Zones/ZFS!!! Thanks to the developers 
of these features that enabled me elegantly solve a difficult problem.

-Anantha-
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Bizzare problem with ZFS filesystem

2006-09-18 Thread Anantha N. Srirama
I don't see a patch for this on the SunSolve website. I've opened a service 
request to get this patch for Sol10 06/06. Stay tuned.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Bizzare problem with ZFS filesystem

2006-09-13 Thread Anantha N. Srirama
I ran the DTrace script and the resulting output is rather large (1 million 
lines and 65MB), so I won't burden this forum with that much data. Here are the 
top 100 lines from the DTrace output. Let me know if you need the full output 
and I'll figure out a way for the group to get it.

dtrace: description 'fbt:zfs::' matched 2404 probes
CPU FUNCTION
520  - zfs_lookup 2929705866442880
520- zfs_zaccess  2929705866448160
520  - zfs_zaccess_common 2929705866451840
520- zfs_acl_node_read2929705866455040
520  - zfs_acl_node_read_internal 2929705866458400
520- zfs_acl_alloc2929705866461040
520- zfs_acl_alloc2929705866462880
520  - zfs_acl_node_read_internal 2929705866464080
520- zfs_acl_node_read2929705866465600
520- zfs_ace_access   2929705866467760
520- zfs_ace_access   2929705866468880
520- zfs_ace_access   2929705866469520
520- zfs_ace_access   2929705866470320
520- zfs_acl_free 2929705866471920
520- zfs_acl_free 2929705866472960
520  - zfs_zaccess_common 2929705866474720
520- zfs_zaccess  2929705866476320
520- zfs_dirlook  2929705866478320
520  - zfs_dirent_lock2929705866480880
520  - zfs_dirent_lock2929705866486560
520  - zfs_dirent_unlock  2929705866489840
520  - zfs_dirent_unlock  2929705866491600
520- zfs_dirlook  2929705866492560
520  - zfs_lookup 2929705866494080
520  - zfs_getattr2929705866499360
520- dmu_object_size_from_db  2929705866503520
520- dmu_object_size_from_db  2929705866507920
520  - zfs_getattr2929705866509280
520  - zfs_lookup 2929705866520400
520- zfs_zaccess  2929705866521200
520  - zfs_zaccess_common 2929705866521920
520- zfs_acl_node_read2929705866523280
520  - zfs_acl_node_read_internal 2929705866524800
520- zfs_acl_alloc2929705866526000
520- zfs_acl_alloc2929705866526800
520  - zfs_acl_node_read_internal 2929705866527280
520- zfs_acl_node_read2929705866528160
520- zfs_ace_access   2929705866528720
520- zfs_ace_access   2929705866529280
520- zfs_ace_access   2929705866529920
520- zfs_ace_access   2929705866530800
520- zfs_acl_free 2929705866531360
520- zfs_acl_free 2929705866531920
520  - zfs_zaccess_common 2929705866532560
520- zfs_zaccess  2929705866533440
520- zfs_dirlook  2929705866534000
520  - zfs_dirent_lock2929705866534640
520  - zfs_dirent_lock2929705866535600
520  - zfs_dirent_unlock  2929705866536480
520  - zfs_dirent_unlock  2929705866537120
520- zfs_dirlook  2929705866537760
520  - zfs_lookup 2929705866538400
520  - zfs_getsecattr 2929705866543600
520- zfs_getacl   2929705866546240
520  - zfs_zaccess2929705866546960
520- zfs_zaccess_common   2929705866547680
520  - zfs_acl_node_read  2929705866548720
520- zfs_acl_node_read_internal   2929705866549440
520  - zfs_acl_alloc  2929705866550080
520  - zfs_acl_alloc  2929705866550720
520- zfs_acl_node_read_internal   2929705866551600
520  - zfs_acl_node_read  2929705866552160
520  - zfs_ace_access 2929705866552720
520  - zfs_ace_access 2929705866553280
520  - zfs_ace_access 2929705866554160
520  - zfs_ace_access 2929705866554720
520  - zfs_ace_access 2929705866555600
520  - zfs_ace_access 2929705866556160
520  - zfs_ace_access 2929705866557040
520  - zfs_ace_access 2929705866557600
520  - zfs_ace_access 2929705866558160
520  - zfs_ace_access 2929705866558720
520  - zfs_ace_access 2929705866559760
520  - zfs_ace_access 

[zfs-discuss] Re: Bizzare problem with ZFS filesystem

2006-09-13 Thread Anantha N. Srirama
One more piece of information. I was able to ascertain the slowdown happens 
only when ZFS is used heavily; meaning lots of inflight I/O. This morning when 
the system was quiet my writes to the /u099 filesystem was excellent and it has 
gone south like I reported earlier. 

I am currently awaiting the completion of a write to /u099, well over 60 
seconds. At the same time I was able create/save files in /u001 without any 
problems. The only difference between the /u001 and /u099 is the size of the 
filesystem (256GB vs 768GB).

Per your suggestion I ran a 'zfs set' command and it completed after a wait of 
around 20 seconds while my file save from vi against /u099 is still pending!!!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: zfs and Oracle ASM

2006-09-13 Thread Anantha N. Srirama
I did a non-scientific benchmark against ASM and ZFS. Just look for my posts 
and you'll see it. To summarize it was a statistical tie for simple loads of 
around 2GB of data and we've chosen to stick with ASM for a variety of reasons 
not the least of which is its ability to rebalance when disks are 
added/removed. Better integration comes to mind too.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Bizzare problem with ZFS filesystem

2006-09-12 Thread Anantha N. Srirama
I'm experiencing a bizzare write performance problem while using a ZFS 
filesystem. Here are the relevant facts:

[b]# zpool list[/b]
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
mtdc   3.27T502G   2.78T14%  ONLINE -
zfspool68.5G   30.8G   37.7G44%  ONLINE -

[b]# zfs list[/b]
NAME   USED  AVAIL  REFER  MOUNTPOINT
mtdc   503G  2.73T  24.5K  /mtdc
mtdc/sasmeta   397M   627M   397M  /sasmeta
mtdc/u001 30.5G   226G  30.5G  /u001
mtdc/u002 29.5G   227G  29.5G  /u002
mtdc/u003 29.5G   226G  29.5G  /u003
mtdc/u004 28.4G   228G  28.4G  /u004
mtdc/u005 28.3G   228G  28.3G  /u005
mtdc/u006 29.8G   226G  29.8G  /u006
mtdc/u007 30.1G   226G  30.1G  /u007
mtdc/u008 30.6G   225G  30.6G  /u008
mtdc/u099  266G   502G   266G  /u099
zfspool   30.8G  36.6G  24.5K  /zfspool
zfspool/apps  30.8G  33.2G  28.5G  /apps
zfspool/[EMAIL PROTECTED]  2.28G  -  29.8G  -
zfspool/home  15.4M  2.98G  15.4M  /home

[b]# zfs list mtdc/u099[/b]
NAME PROPERTY   VALUE  SOURCE
mtdc/u099type   filesystem -
mtdc/u099creation   Thu Aug 17 10:21 2006  -
mtdc/u099used   267G   -
mtdc/u099available  501G   -
mtdc/u099referenced 267G   -
mtdc/u099compressratio  3.10x  -
mtdc/u099mountedyes-
mtdc/u099quota  768G   local
mtdc/u099reservationnone   default
mtdc/u099recordsize 128K   default
mtdc/u099mountpoint /u099  local
mtdc/u099sharenfs   offdefault
mtdc/u099checksum   on default
mtdc/u099compressionon local
mtdc/u099atime  offlocal
mtdc/u099deviceson default
mtdc/u099exec   on default
mtdc/u099setuid on default
mtdc/u099readonly   offdefault
mtdc/u099zoned  offdefault
mtdc/u099snapdirhidden default
mtdc/u099aclmodegroupmask  default
mtdc/u099aclinherit secure default

[b]No error messages listed by zpool or /var/opt/messages.[/b] When I try to 
save a file the operation takes an inordinate amount of time, in the 30+ second 
range!!! I truss'd the vi session to see the hangup and it waits at the write 
system call.

# truss -p pid
read(0, 0xFFBFD0AF, 1)  (sleeping...)
read(0,  w, 1)= 1
write(1,  w, 1)   = 1
read(0,  q, 1)= 1
write(1,  q, 1)   = 1
read(0, 0xFFBFD00F, 1)  (sleeping...)
read(0, \r, 1)= 1
ioctl(0, I_STR, 0x000579F8) Err#22 EINVAL
write(1, \r, 1)   = 1
write(1,   d e l e t e m e , 10)= 10
stat64(deleteme, 0xFFBFCFA0)  = 0
creat(deleteme, 0666) = 4
ioctl(2, TCSETSW, 0x00060C10)   = 0
[b]write(4,  l f f j d\n, 6) = 6[/b]  still waiting 
while I type this message!!

This problem manifests itself only on this filesystem and not on the other ZFS 
filesystems on the same server built from the same ZFS pool. While I was 
awaiting completion of the above write I was able to start a new vi session in 
another window and saved a file to the /u001 filesystem without any problem. 
System loads are very low. Can anybody comment on this bizzare behavior?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Bizzare problem with ZFS filesystem

2006-09-12 Thread Anantha N. Srirama
Here's the information you requested.

Script started on Tue Sep 12 16:46:46 2006
# uname -a
SunOS umt1a-bio-srv2 5.10 Generic_118833-18 sun4u sparc SUNW,Netra-T12
# prtdiag
System Configuration: Sun Microsystems  sun4u Sun Fire E2900
System clock frequency: 150 MHZ
Memory size: 96GB   

=== CPUs 
===
   E$  CPU  CPU
CPU  Freq  SizeImplementation   MaskStatus  Location
---    --  ---  -   --  
  0,512  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB0/P0
  1,513  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB0/P1
  2,514  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB0/P2
  3,515  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB0/P3
  8,520  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB2/P0
  9,521  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB2/P1
 10,522  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB2/P2
 11,523  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB2/P3
 16,528  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB4/P0
 17,529  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB4/P1
 18,530  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB4/P2
 19,531  1500 MHz  32MBSUNW,UltraSPARC-IV+   2.1on-line SB4/P3

# md    mdb -k
(B)0Loading modules: [ unix krtld genunix dtrace specfs ufs sd sgsbbc md 
sgenv ip sctp usba fcp fctl qlc nca ssd lofs zfs random crypto ptm nfs ipc 
logindmux cpc sppp fcip wrsmd ]
 arc::stat    print
{
anon = ARC_anon
mru = ARC_mru
mru_ghost = ARC_mru_ghost
mfu = ARC_mfu
mfu_ghost = ARC_mfu_ghost
size = 0x11917e1200
p = 0x116e8a1a40
c = 0x11917cf428
c_min = 0xbf77c800
c_max = 0x17aef9
hits = 0x489737a8
misses = 0x8869917
deleted = 0xc832650
skipped = 0x15b29b2
hash_elements = 0x1273d0
hash_elements_max = 0x17576f
hash_collisions = 0x4e0ceee
hash_chains = 0x3a9b2
Segmentation Fault - core dumped
# mdb -k
(B)0Loading modules: [ unix krtld genunix dtrace specfs ufs sd sgsbbc md 
sgenv ip sctp usba fcp fctl qlc nca ssd lofs zfs random crypto ptm nfs ipc 
logindmux cpc sppp fcip wrsmd ]
 ::kmastat
 ::pgrep vi | ::walk thread
3086600f660
 : 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598d91
[ 02a104598d91 cv_wait_sig+0x114() ]
  02a104598e41 str_cv_wait+0x28()
  02a104598f01 strwaitq+0x238()
  02a104598fc1 strread+0x174()
  02a1045990a1 fop_read+0x20()
  02a104599161 read+0x274()
  02a1045992e1 syscall_trap32+0xcc()
 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598e61
 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598e61
  02a104598f61 zil_lwb_commit+0x1ac()
  02a104599011 zil_commit+0x1b0()
  02a1045990c1 zfs_fsync+0xa8()
  02a104599171 fop_fsync+0x14()
  02a104599231 fdsync+0x20()
  02a1045992e1 syscall_trap32+0xcc()
 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598c71
 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598e61
  02a104598f61 zil_lwb_commit+0x1ac()
  02a104599011 zil_commit+0x1b0()
  02a1045990c1 zfs_fsync+0xa8()
  02a104599171 fop_fsync+0x14()
  02a104599231 fdsync+0x20()
  02a1045992e1 syscall_trap32+0xcc()
 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598e61
  02a104598f61 zil_lwb_commit+0x1ac()
  02a104599011 zil_commit+0x1b0()
  02a1045990c1 zfs_fsync+0xa8()
  02a104599171 fop_fsync+0x14()
  02a104599231 fdsync+0x20()
  02a1045992e1 syscall_trap32+0xcc()
 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598e61
 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598bb1
 3086600f660::findstack
stack pointer for thread 3086600f660: 2a104598e61
  02a104598f61 zil_lwb_commit+0x1ac()
  02a104599011 zil_commit+0x1b0()
  02a1045990c1 zfs_fsync+0xa8()
  02a104599171 fop_fsync+0x14()
  02a104599231 fdsync+0x20()
  02a1045992e1 syscall_trap32+0xcc()
 3086600f660::findstack
stack pointer for thread 3086600f660 (TS_FREE): 2a104598ba1
  02a104598fe1 segvn_unmap+0x1b8()
  02a1045990d1 as_free+0xf4()
  02a104599181 proc_exit+0x46c()
  02a104599231 exit+8()
  02a1045992e1 syscall_trap32+0xcc()
# df -h
Filesystem size   used  avail capacity  Mounted on
/dev/md/dsk/d10 32G   6.7G25G22%/
/devices 0K 0K 0K 0%/devices
ctfs 0K 0K 0K 0%/system/contract
proc 0K 0K 0K 0%/proc
mnttab   0K 0K 0K 0%/etc/mnttab
swap  

[zfs-discuss] Re: Oracle on ZFS

2006-09-09 Thread Anantha N. Srirama
I finally got around to running a 'benchmark' using the AOL clickstream data 
(2GB of text files and approximately 36 million rows). Here are the Oracle 
settings during the test.

- Same Oracle settings for all tests
- All disks in question are 32GB EMC hypers
- I had the standard Oracle tablespaces on one ASM group consisting of 1disk
- I created a tablespace using ASM on 10 disks
- I created a tablespace using ZFS on 10 disks
- I created a tablespace using ZFS with compression on 10 disks

Test 1 (loading to ASM)
  I loaded the text file into Oracle using external table feature. Time 1m20s, 
system loads were in the 1-1.35 range.

Test 2 (loading to ZFS)
  I loaded the text file into Oracle using external table feature. Time 1m16s, 
system loads were in the 1.13 range.

Test 3 (loading from ASM to ASM)
  I loaded a new table from the just loaded Oracle table. Time 1m21s, system 
loads were in the 1-1.3 range.

Test 4 (loading from ZFS to ZFS)
  I loaded a new table from the just loaded Oracle table. Time 1m20s, system 
loads were in the 1-1.3 range

Test 5 (loading from ZFS to ZFS compress=ON)
  I loaded a new table from the just loaded Oracle table. Time 1m18s, system 
loads were in the 1-1.45 range, saw a compression in the 3.5-4x range.

Throughout the tests I had other stuff running on the machine as well (1 
additional database and 10g GridControl Repository). [b]All the tests yielded 
same results in my opinion.[/b]

We'll probably go with Oracle ASM because of its integration with other Oracle 
products/features. I'm not comfortable with ZFS enough to bet on it yet (I've 
only played with it for less than 2 months) while ASM has been around for 3 
years. The other contributing factor is ASMs ability to rebalance the data when 
disks are added/removed. ZFS at this time doesn't give a facility to remove 
drives when I'm not using mirrors (my problem is that all our disks are 
provisioned from EMC that are already protected), ASM does.

While performing these tests I came across another (severe?) problem with ZFS 
that I'll post as a separate entry.

-Anantha-
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Oracle on ZFS

2006-09-09 Thread Anantha N. Srirama
One correction in the interest of full disclosure, tests were conducted on a 
machine that is different from my original post indicated a server 
configuration. Here's the server config used in tests:

- E25K domain (1 board: 4P/8Way x 32GB)
- 2 2Gbps FC
- MPxIO
- Solaris 10 Update 2 (06/06); no other patches
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Oracle on ZFS

2006-08-26 Thread Anantha N. Srirama
Good start, I'm now motivated to run the same test on my server. My h/w config 
for the test will be:

- E2900 (24 way x 96GB)
- 2 2Gbps QLogic cards
- 40 x 64GB EMC LUNs

I'll run the AOL deidentified clickstream database. It'll primarily be a write 
test. I intend to use the following scenarios:

- SVM/UFS (nologging, atimeoff, directio), data striped across all LUNs
- ZFS (compress=OFF, atime=OFF)
- ZFS (compress=ON, atime=OFF)
- Oracle 10g Automatic Storage Management (ASM)

I'll keep the same Oracle 10g settings for all tests. I'm really interested in 
the comparison between ASM and ZFS, specially with the compress=ON option. In a 
DW environment like ours this could lead to HUGE savings.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS compression / space efficiency

2006-08-22 Thread Anantha N. Srirama
We're running ZFS with compress=ON on a E2900. I'm hosting SAS/SPDS datasets 
(files) on these filesystems and am achieving 1:3.87 (as reported by zfs) 
compression. Your mileage will vary depending on the data you are writing. If 
your data is already compressed (zip files) then don't expect any payback.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS write performance problem with compression set to ON

2006-08-21 Thread Anantha N. Srirama
I've a few questions:

 - Does 'zpool iostat' report numbers from the top of the ZFS stack or at the 
bottom? I've corelated the zpool iostat numbers with the system iostat numbers 
and they matchup. This tells me the numbers are from the 'bottom' of the ZFS 
stack, right? Having said that it'd be nice to have zpool iostat return numbers 
at the top of the stack. This becomes relevant when we've compression =ON.

  - Secondly, I did some more tests and I find the same read waves and the 
consistent write throughput. I've been reading another thread on this forum 
about Niagara and the compression where Matt Ahrens noted that the compression 
at this time is single-threaded. Further, he stated that there maybe a bugfix 
released to use multiple threads. I eagerly await the fix.

Thanks again for a great feature. Looking forward to more fun stuff out of Sun 
and you Mr. Bonwick.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS write performance problem with compression set to ON

2006-08-17 Thread Anantha N. Srirama
Therein lies my dillemma:

  - We know the I/O sub-system is capable of much higher I/O rates
  - Under the test setup I've SAS datasets which are lending themselves to 
compression. This should manifest itself as lots of read I/O resulting in much 
smaller (4x) write I/O due to compression. This means my read rates should be 
driven higher to keep the compression. I don't see this, as I said in my 
original post I see reads comes in waves.

I'm beginning to think my write rates are hitting a a bottleneck in compression 
as follows:
  - ZFS issues reads
  - ZFS starts compressing the data before the write and cannot drain the input 
buffers fast enough; this results in reads to stop.
  - ZFS completes compression and writes out data at a much smaller rate due to 
the smaller compressed data stream.

I'm not a filesystem wizard but shouldn't ZFS take advantage of my available 
CPUs to drain the input buffer faster (parallel)? It is possible that you've 
some internal throttles in place to make ZFS a good citizen in the Solaris 
landscape; a la algorithms in place to prevent cache flooding by one 
host/device in EMC/Hitachi.

I'll perform some more tests with different datasets and report to the forum. 
Now if only I can convince my storage administrator to provision me raw disks 
instead of the mirrored disks so I can let ZFS do the same for me, another 
battle another day ;-)

Thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS write performance problem with compression set to ON

2006-08-16 Thread Anantha N. Srirama
Test setup:
  - E2900 with 12 US-IV+ 1.5GHz processor, 96GB memory, 2x2Gbps FC HBAs, MPxIO 
in round-robbin config.
  - 50x64GB EMC disks presented on both 2 FCs.
  - ZFS pool defined using all 50 disks
  - Multiple ZFS filesystems built on the above pool.

I'm observing the following:
  - When the filesystems have compress=OFF and I do bulk reads/writes (8 
parallel 'cp's running between ZFS filesystems) I observe approximately 
200-250MB/S consolidated I/O; writes in the 100MB/S range. I get these numbers 
running 'zpool iostat 5'. I see the same read/write ratio for the duration of 
the test.
  - When the filesystems have compress=ON I see the following: reads from 
compressed filesystems come in waves; zpool will report for long durations (60+ 
seconds) no read activity while the write activity is consistently reported at 
20MB/S (no variation in the write rate throughtout the test.)
  - The machine is mostly idling during the entire test; both cases.
  - ZFS reports 4:1 compresson ratio for my filesystem.

I'm puzzled by the following:
  - Why do reads comes in waves with compression=ON; it almost feels like ZFS 
reads a bunch of data and then proceeds to compress it before writing it out. 
This tells me there is not a read bottleneck; meaning there is no starvation of 
the compress routine due to the following facts: CPU/Machine/IO is not 
saturated in any shape or form.
  - Why then does ZFS generate substantially lower write throughput (magical 
20MB/S spread evenly across the 50 disks, 0.4MB/S each)?

Can anybody shed any light on this anomoloy (?). Mr. Bonwick I hope you're 
reading this post.

BTW, we love the ZFS and are looking forward to rolling out aggresively in our 
new project. I'd like to take advantage of the compression since we're mostly 
I/O bound and we've plenty of CPU/Memory.

Thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss