from:"Jim"



Included below is a a thread which dealt with trying to find the 
packages necessary for a minimal Solais 10 U2 install with ZFS 
functionality.  In addition to SUNWzfskr, SUNzfsr and SUNWzfsu the 
SUNWsmapi package needs to be installed.  The libdiskmgt.so.1 library is 
required for the zpool(1M) command.  Finding this out via trial and 
error, there is no dependency mentioned for SUNWsmapi in the SUNWzfsr 
depend file.


Apologies if this is nitpicking, but is this missing dependency worthy 
of submitting a P5 CR?


-- Jim C


Jason Schroeder wrote:

Dale Ghent wrote:


On Jun 28, 2006, at 4:27 PM, Jim Connors wrote:

For an embedded application, I'm looking at creating a minimal  
Solaris 10 U2 image which would include ZFS functionality.  In  
quickly taking a look at the opensolaris.org site under pkgdefs, I  
see three packages that appear to be related to ZFS: SUNWzfskr,  
SUNWzfsr, and SUNWzfsu.  Is it naive to think that this would be  
all that is needed for ZFS?



Those packages, as well as what's listed in the depend files for  
those packages.


Ahh, don't you love climbing the dependency tree?

/dale
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Glenn Brunette wrote a nifty little tool  ...  have to assume that all 
of the dependencies are appropriately doc'ed of course cough.


http://blogs.sun.com/roller/page/gbrunett?entry=solaris_package_companion

/jason


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS state between reboots for RAM rsident OS?


Guys,

Thanks for the help so far,  now comes the more interesting questions ...

Piggybacking off of some work being done to minimize Solaris for 
embedded use, I have a version of Solaris 10 U2 with ZFS functionality 
with a disk footprint of about 60MB.   Creating a miniroot based upon 
this image, it can be compressed to under 30MB.  Currently, I load this 
image onto a USB keyring and boot from the USB device running the 
Solaris miniroot out of RAM.  Note: The USB key ring is a hideously slow 
device, but for the sake of this proof of concept it works fine.  In 
addition, some more packages will need to be added later on (i.e. NFS, 
Samba?) which will increase the footprint.


My ultimate goal here would be to demonstrate a network storage 
appliance using ZFS, where the OS is effectively stateless, or as 
stateless as possible.  ZFS goes a long way in assisting here since, for 
example, mount and nfs share information can be managed by ZFS.  But I 
suppose it's not as stateless as I thought.  Upon booting from USB 
device into memory, I can do a `zpool create poo1 c1d0',  but a 
subsequent reboot does not remember this work.  Doing a `zpool list' 
yields 'no pools available'.  So the question is, what sort of state is 
required between reboots for ZFS?


Regards,
-- Jim C
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS state between reboots for RAM rsident OS?



I understand.  Thanks.

Just curious, ZFS manages NFS shares.  Have you given any thought to 
what might be involved for ZFS to manage SMB shares in the same manner.  
This all goes towards my stateless OS theme.


-- Jim C


Eric Schrock wrote:

You need the following file:

/etc/zfs/zpool.cache

This file 'knows' about all the pools on the system.  These pools can
typically be discovered via 'zpool import', but we can't do this at boot
because:

a. It can be really, really expensive (tasting every disk on the system)
b. Pools can be comprised of files or devices not in /dev/dsk

So, we have the cache file, which must be editable if you want to
remember newly created pools.  Note this only affects configuration
changes to pools - everything else is stored within the pool itself.

- Eric

On Tue, Jul 25, 2006 at 12:18:07PM -0400, Jim Connors wrote:
  

Guys,

Thanks for the help so far,  now comes the more interesting questions ...

Piggybacking off of some work being done to minimize Solaris for 
embedded use, I have a version of Solaris 10 U2 with ZFS functionality 
with a disk footprint of about 60MB.   Creating a miniroot based upon 
this image, it can be compressed to under 30MB.  Currently, I load this 
image onto a USB keyring and boot from the USB device running the 
Solaris miniroot out of RAM.  Note: The USB key ring is a hideously slow 
device, but for the sake of this proof of concept it works fine.  In 
addition, some more packages will need to be added later on (i.e. NFS, 
Samba?) which will increase the footprint.


My ultimate goal here would be to demonstrate a network storage 
appliance using ZFS, where the OS is effectively stateless, or as 
stateless as possible.  ZFS goes a long way in assisting here since, for 
example, mount and nfs share information can be managed by ZFS.  But I 
suppose it's not as stateless as I thought.  Upon booting from USB 
device into memory, I can do a `zpool create poo1 c1d0',  but a 
subsequent reboot does not remember this work.  Doing a `zpool list' 
yields 'no pools available'.  So the question is, what sort of state is 
required between reboots for ZFS?


Regards,
-- Jim C



--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS state between reboots for RAM rsident OS?


Eric Schrock wrote:

You need the following file:

/etc/zfs/zpool.cache
  


So as a workaround (or more appropriately, a kludge) would it be 
possible to:


1. At boot time do a 'zpool import' of some pool guaranteed to exist.  
For the sake of this discussion call it 'system'


2. Have /etc/zfs/zpool.cache be symbolically linked to /system/ZPOOL.CACHE

-- Jim C

This file 'knows' about all the pools on the system.  These pools can
typically be discovered via 'zpool import', but we can't do this at boot
because:

a. It can be really, really expensive (tasting every disk on the system)
b. Pools can be comprised of files or devices not in /dev/dsk

So, we have the cache file, which must be editable if you want to
remember newly created pools.  Note this only affects configuration
changes to pools - everything else is stored within the pool itself.

- Eric

On Tue, Jul 25, 2006 at 12:18:07PM -0400, Jim Connors wrote:
  

Guys,

Thanks for the help so far,  now comes the more interesting questions ...

Piggybacking off of some work being done to minimize Solaris for 
embedded use, I have a version of Solaris 10 U2 with ZFS functionality 
with a disk footprint of about 60MB.   Creating a miniroot based upon 
this image, it can be compressed to under 30MB.  Currently, I load this 
image onto a USB keyring and boot from the USB device running the 
Solaris miniroot out of RAM.  Note: The USB key ring is a hideously slow 
device, but for the sake of this proof of concept it works fine.  In 
addition, some more packages will need to be added later on (i.e. NFS, 
Samba?) which will increase the footprint.


My ultimate goal here would be to demonstrate a network storage 
appliance using ZFS, where the OS is effectively stateless, or as 
stateless as possible.  ZFS goes a long way in assisting here since, for 
example, mount and nfs share information can be managed by ZFS.  But I 
suppose it's not as stateless as I thought.  Upon booting from USB 
device into memory, I can do a `zpool create poo1 c1d0',  but a 
subsequent reboot does not remember this work.  Doing a `zpool list' 
yields 'no pools available'.  So the question is, what sort of state is 
required between reboots for ZFS?


Regards,
-- Jim C



--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Assertion raised during zfs share?

2006-08-04 Thread Jim Connors


Eric Schrock wrote:

This indicates that share(1M) didn't produce any output, but returned
a non-zero exit status.  I'm not sure why this would happen - can you
run the following by hand?

# share /export
# echo $?
  


bash-3.00# share
bash-3.00# share /export
bash-3.00# echo $?
0

Looks like the NFS server is not completely configured yet, and that it 
requires this zfs share stuff to work first.


bash-3.00# svcs -a | grep nfs/server
disabled6:24:31 svc:/network/nfs/server:default
bash-3.00# more /var/svc/log/network-nfs-server\:default.log
[ Aug  4 06:15:31 Executing start method (/lib/svc/method/nfs-server 
start) ]
Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 
399, function zfs_share

Abort - core dumped
[ Aug  4 06:15:32 Method start exited with status 0 ]
[ Aug  4 06:15:32 Stopping because process dumped core. ]
[ Aug  4 06:15:32 Executing stop method (/lib/svc/method/nfs-server 
stop 30) ][ Aug  4 06:15:32 Method stop exited with status 0 ]
[ Aug  4 06:15:32 Executing start method (/lib/svc/method/nfs-server 
start) ]
Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 
399, function zfs_share

Abort - core dumped

-- Jim C


Incidentally, the explicit 'zfs share' isn't needed, as we automatically
share the filesystem when the options are set (which did succeed).

- Eric

On Fri, Aug 04, 2006 at 12:42:02PM -0400, Jim Connors wrote:
  
Working to get  ZFS to run on a minimal Solaris 10 U2 configuration.  In 
this scenario, ZFS is included the miniroot which is booted into RAM.  
When trying to share one of the filesystems, an assertion is raised - 
see below.   If the version of  source on OpenSolaris.org  matches 
Solaris 10 U2, then it looks like it's associated with a popen of 
/usr/sbin/share.  Can anyone shed any light on this?


Thanks,
-- Jim C


# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
SYS 83K   163M  30.5K  /SYS
export 110K  72.8G  25.5K  /export
export/home   24.5K  72.8G  24.5K  /export/home
# zpool list
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
SYS 195M 90K195M 0%  ONLINE -
export   74G114K   74.0G 0%  ONLINE -
# zfs set sharenfs=on export
# zfs share export
Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 
399, function zfs_share

Abort - core dumped

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Assertion raised during zfs share?

2006-08-04 Thread Jim Connors


Richard Elling wrote:

Jim Connors wrote:


Working to get  ZFS to run on a minimal Solaris 10 U2 configuration. 


What does minimal mean?  Most likely, you are missing something.
  -- richard
Yeah.  Looking at package and SMF dependencies plus a whole lot of and 
trial and error, I've currently got Solaris down to 47 packages.  The 
nfs/server service for Solaris 10 U2 will first try to do a zfs share.  
For the next step, I'll probably comment out that stuff and see I can 
bring up the nfs server code and share a UFS filesystem using the 
traditional methods.  Once that's OK I'll move on to the ZFS portion and 
investigate.


Thanks,
-- Jim C
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320

2006-09-08 Thread Jim Sloey

 Roch - PAE wrote:
 The hard part is getting a set of simple requirements. As you go into 
 more complex data center environments you get hit with older Solaris 
 revs, other OSs, SOX compliance issues, etc. etc. etc. The world where 
 most of us seem to be playing with ZFS is on the lower end of the 
 complexity scale. 
I've been watching this thread and unfortunately fit this model. I'd hoped that 
ZFS might scale enough to solve my problem but you seem to be saying that it's 
mostly untested in large scale environments.
About 7 years ago we ran out of inodes on our UFS file systems. We used bFile 
as middleware for a while to distribute the files across multiple disks and 
then switched to VFS on SAN about 5 years ago. Distribution across file systems 
and inode depletion continued to be a problem so we switched middleware to 
another vendor that essentially compresses about 200 files into a single 10Mb 
archive and uses a DB to find the file within the archive on the correct disk. 
Expensive, complex and slow but effective solution until the latest license 
renewal when we got hit with a huge bill. 
I'd love to go back to a pure file system model and looked at Reiser4, JFS, 
NTFS and now ZFS for a way to support over 100 million small documents and 
16Tb. We average 2 file reads and 1 file write per second 24/7 with expected 
growth to 24Tb. I'd be willing to scrap everything we have to find a 
non-proprietary long term solution.
ZFS looked like it might provide an answer. Are you saying it's not really 
suitable for this type of application?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: zfs hot spare not automatically getting used

2006-11-28 Thread Jim Hranicky

So is there a command to make the spare get used, or
so I have to remove it as a spare and add it if it doesn't
get automatically used?

Is this a bug to be fixed, or will this always be the case when
the disks aren't exactly the same size?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: zfs hot spare not automatically getting used

2006-11-29 Thread Jim Hranicky

I know this isn't necessarily ZFS specific, but after I reboot I spin the 
drives back
up, but nothing I do (devfsadm, disks, etc) can get them seen again until the
next reboot.

I've got some older scsi drives in an old Andataco Gigaraid enclosure which
I thought supported hot-swap, but I seem unable to hot swap them in. The PC
has an adaptec 39160 card in it and I'm running Nevada b51. Is this not a 
setup that can support hot swap? Or is there something I have to do other
than devfsadm to get the scsi bus rescanned?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Managed to corrupt my pool

2006-11-30 Thread Jim Hranicky

Platform:

  - old dell workstation with an Andataco gigaraid enclosure 
plugged into an Adaptec 39160
  - Nevada b51

Current zpool config:

   - one two-disk mirror with two hot spares

In my ferocious pounding of ZFS I've managed to corrupt my data
pool. This is what I've been doing to test it:

   - set zil_disable to 1 in /etc/system
   - continually untar a couple of files into the filesystem
   - manually spin down a drive in the mirror by holding down
 the button on the enclosure
   - for any system hangs reboot with a nasty

  reboot -dnq

I've gotten different results after the spindown:

   - works properly: short or no hang, hot spare successfully 
  added to the mirror
   - system hangs, and after a reboot the spare is not added
   - tar hangs, but after running zpool status the hot
  spare is added properly and tar continues
   - tar continues, but hangs on zpool status

The last is what happened just prior to the corruption. Here's the output
of zpool status:

nextest-01# zpool status -v
  pool: zmir
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed with 1 errors on Thu Nov 30 11:37:21 2006
config:

NAMESTATE READ WRITE CKSUM
zmirDEGRADED 8 0 4
  mirrorDEGRADED 8 0 4
c3t3d0  ONLINE   0 024
c3t4d0  UNAVAIL  0 0 0  cannot open
spares
  c0t0d0AVAIL
  c3t1d0AVAIL

errors: The following persistent errors have been detected:

  DATASET  OBJECT  RANGE
  15   0   lvl=4294967295 blkid=0

So the questions are:

  - is this fixable? I don't see an inum I could run find on to remove, 
and I can't even do a zfs volinit anyway:

nextest-01# zfs volinit
cannot iterate filesystems: I/O error

   - would not enabling zil_disable have prevented this?

   - Should I have been doing a 3-way mirror?

   - Is there a more optimum configuration to help prevent this
  kind of corruption?

Ultimately, I want to build a ZFS server with performance and reliability
comparable to say, a Netapp, but the fact that I appear to have been
able to nuke my pool by simulating a hardware error gives me pause. 

I'd love to know if I'm off-base in my worries.

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Managed to corrupt my pool

2006-12-05 Thread Jim Hranicky

 So the questions are:
 
 - is this fixable? I don't see an inum I could run
  find on to remove, 
and I can't even do a zfs volinit anyway:
nextest-01# zfs volinit
  cannot iterate filesystems: I/O error
 
 - would not enabling zil_disable have prevented
  this?
 
- Should I have been doing a 3-way mirror?
 - Is there a more optimum configuration to help
  prevent this  kind of corruption?

Anyone have any thoughts on this? I'd really like to be 
able to build a nice ZFS box for file service but if a 
hardware failure can corrupt a disk pool I'll have to 
try to find another solution, I'm afraid.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Managed to corrupt my pool

2006-12-05 Thread Jim Hranicky

 Anyone have any thoughts on this? I'd really like to
 be able to build a nice ZFS box for file service but if
 a  hardware failure can corrupt a disk pool I'll have to
  try to find another solution, I'm afraid.

Sorry, I worded this poorly -- if the loss of a disk in a mirror
can corrupt the pool it's going to give me pause in implementing
a ZFS solution. 

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Netapp to Solaris/ZFS issues

2006-12-06 Thread Jim Davis

We have two aging Netapp filers and can't afford to buy new Netapp gear, 
so we've been looking with a lot of interest at building NFS fileservers 
running ZFS as a possible future approach.  Two issues have come up in the 
discussion


- Adding new disks to a RAID-Z pool (Netapps handle adding new disks very 
nicely).  Mirroring is an alternative, but when you're on a tight budget 
losing N/2 disk capacity is painful.


- The default scheme of one filesystem per user runs into problems with 
linux NFS clients; on one linux system, with 1300 logins, we already have 
to do symlinks with amd because linux systems can't mount more than about 
255 filesystems at once.  We can of course just have one filesystem 
exported, and make /home/student a subdirectory of that, but then we run 
into problems with quotas -- and on an undergraduate fileserver, quotas 
aren't optional!


Neither of these problems are necessarily showstoppers, but both make the 
transition more difficult.  Any progress that could be made with them 
would help sites like us make the switch sooner.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-07 Thread Jim Mauro



Hey Ben - I need more time to look at this and connect some dots,
but real quick

Some nfsstat data that we could use to potentially correlate to the local
server activity would be interesting. zfs_create() seems to be the
heavy hitter, but a periodic kernel profile (especially if we can catch
a 97% SYS period) would help:

#lockstat -i997 -Ik -s 10 sleep 60

Alternatively:

#dtrace -n 'profile-997hz / arg0 != 0 / { @s[stack()]=count(); }'

It would also be interesting to see what the zfs_create()'s are doing.
Perhaps a quick:

#dtrace -n 'zfs_create:entry { printf(ZFS Create: %s\n, 
stringof(args[0]-v_path)); }'


It would also be interesting to see the network stats. Grab Brendan's 
nicstat

and collect some samples

You're reference to low traffic is in bandwidth, which, as you indicate, 
is really,
really low. But the data, at least up to this point, suggests the 
workload is not
data/bandwidth intensive, but more attribute intensive. Note again 
zfs_create()
is the heavy ZFS function, along with zfs_getattr. Perhaps it's the 
attribute-intensive

nature of the load that is at the root of this.

I can spend more time on this tomorrow (traveling today).

Thanks,
/jim


Ben Rockwood wrote:

I've got a Thumper doing nothing but serving NFS.  Its using B43 with 
zil_disabled.  The system is being consumed in waves, but by what I don't know. 
 Notice vmstat:

 3 0 0 25693580 2586268 0 0  0  0  0  0  0  0  0  0  0  926   91  703  0 25 75
 21 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 13 14 1720   21 1105  0 92  8
 20 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 17 18 2538   70  834  0 100 0
 25 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0  0  0  745   18  179  0 100 0
 37 0 0 25693552 2586240 0 0 0  0  0  0  0  0  0  7  7 1152   52  313  0 100 0
 16 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0 15 13 1543   52  767  0 100 0
 17 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0  2  2  890   72  192  0 100 0
 27 0 0 25693572 2586260 0 0 0  0  0  0  0  0  0 15 15 3271   19 3103  0 98  2
 0 0 0 25693456 2586144 0 11 0  0  0  0  0  0  0 281 249 34335 242 37289 0 46 54
 0 0 0 25693448 2586136 0 2  0  0  0  0  0  0  0  0  0 2470  103 2900  0 27 73
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1062  105  822  0 26 74
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1076   91  857  0 25 75
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0  917  126  674  0 25 75

These spikes of sys load come in waves like this.  While there are close to a 
hundred systems mounting NFS shares on the Thumper, the amount of traffic is 
really low.  Nothing to justify this.  We're talking less than 10MB/s.

NFS is pathetically slow.  We're using NFSv3 TCP shared via ZFS sharenfs on a 
3Gbps aggregation (3*1Gbps).

I've been slamming my head against this problem for days and can't make 
headway.  I'll post some of my notes below.  Any thoughts or ideas are welcome!

benr.

===

Step 1 was to disable any ZFS features that might consume large amounts of CPU:

# zfs set compression=off joyous
# zfs set atime=off joyous
# zfs set checksum=off joyous

These changes had no effect.

Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed 
dns was specified in /etc/nsswitch.conf which won't work given that no DNS 
servers are accessable from the storage or private networks, but again, no improvement. 
In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled 
the dns/client service in SMF.

Turning back to CPU usage, we can see the activity is all SYStem time and comes 
in waves:

[private:/tmp] root# sar 1 100

SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006

10:38:05%usr%sys%wio   %idle
10:38:06   0  27   0  73
10:38:07   0  27   0  73
10:38:09   0  27   0  73
10:38:10   1  26   0  73
10:38:11   0  26   0  74
10:38:12   0  26   0  74
10:38:13   0  24   0  76
10:38:14   0   6   0  94
10:38:15   0   7   0  93
10:38:22   0  99   0   1  --
10:38:23   0  94   0   6  --
10:38:24   0  28   0  72
10:38:25   0  27   0  73
10:38:26   0  27   0  73
10:38:27   0  27   0  73
10:38:28   0  27   0  73
10:38:29   1  30   0  69
10:38:30   0  27   0  73

And so we consider whether or not there is a pattern to the frequency. The 
following is sar output from any lines in which sys is above 90%:

10:40:04%usr%sys%wio   %idleDelta
10:40:11   0  97   0   3
10:40:45   0  98   0   2   34 seconds
10:41:02   0  94   0   6   17 seconds
10:41:26   0 100   0   0   24 seconds
10:42:00   0 100   0   0   34 seconds
10:42:25   (end of sample) 25 seconds

Looking

Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-09 Thread Jim Mauro



Could be NFS synchronous semantics on file create (followed by 
repeated flushing of the write cache).  What kind of storage are you 
using (feel free to send privately if you need to) - is it a thumper? 
It's not clear why NFS-enforced synchronous semantics would induce 
different behavior than the same

load to a local ZFS.

File creates are metadata intensive, right? And these operations need to 
be synchronous to guarantee

file system consistency (yes, I am familiar with the ZFS COW model).

AnywayI'm feeling rather naive' here, but I've seen the NFS 
enforced synchronous semantics phrase
kicked around many times as the explanation for suboptimal performance 
for metadata-intensive
operations when ZFS is the underlying file system, but I never really 
understood what's unsynchronous

about doing the same thing to a local ZFS.

And yes, there is certainly a network latency component to the NFS 
configuration, so for any
synchronous operation, I would expect things to be slower when done over 
NFS.


Awaiting enlightment

:^)

/jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Can't destroy corrupted pool

Ok, so I'm planning on wiping my test pool that seems to have problems 
with non-spare disks being marked as spares, but I can't destroy it:

# zpool destroy -f zmir
cannot iterate filesystems: I/O error

Anyone know how I can nuke this for good?

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Can't destroy corrupted pool

BTW, I'm also unable to export the pool -- same error.

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Can't destroy corrupted pool

Nevermind:

# zfs destroy [EMAIL PROTECTED]:28
cannot open '[EMAIL PROTECTED]:28': I/O error

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Can't destroy corrupted pool

 You are likely hitting:
 
 6397052 unmounting datasets should process
 /etc/mnttab instead of traverse DSL
 
 Which was fixed in build 46 of Nevada.  In the
 meantime, you can remove
 /etc/zfs/zpool.cache manually and reboot, which will
 remove all your
 pools (which you can then re-import on an individual
 basis).

I'm running b51, but I'll try deleting the cache.

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Can't destroy corrupted pool

This worked. 

I've restarted my testing but I've been fdisking each drive before I 
add it to the pool, and so far the system is behaving as expected
when I spin a drive down, i.e., the hot spare gets automatically used. 
This makes me wonder if it's possible to ensure that the forced
addition of a drive to a pool wipes the pool of any previous data, 
especially any zfs metadata.

I'll keep the list posted as I continue my tests.

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs exported a live filesystem

By mistake, I just exported my test filesystem while it was up
and being served via NFS, causing my tar over NFS to start
throwing stale file handle errors. 

Should I file this as a bug, or should I just not do that :-

Ko,
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: zfs exported a live filesystem

2006-12-12 Thread Jim Hranicky

For the record, this happened with a new filesystem. I didn't
muck about with an old filesystem while it was still mounted, 
I created a new one, mounted it and then accidentally exported
it.

  Except that it doesn't:
  
  # mount /dev/dsk/c1t1d0s0 /mnt
  # share /mnt
  # umount /mnt
  umount: /mnt busy
  # unshare /mnt
  # umount /mnt
 
 If you umount -f it will though!

Well, sure, but I was still surprised that it happened anyway.

 The system is working as designed, the NFS client did
 what it was  supposed to do.  If you brought the pool back in
 again with zpool import  things should have picked up where they left off.

Yep -- an import/shareall made the FS available again.

 Whats more you we probably running as root when you
 did that so you got  what you asked for - there is only so much protection
 we can give  without being annoying!  

Sure, but there are still safeguards in place even when running things
as root, such as requiring umount -f as above, or warning you
when running format on a disk with mounted partitions.

Since this appeared to be an operation that may warrant such a
safeguard I thought I'd check and see if this was to be expected or
if a safeguard should be put in.

Annoying isn't always bad :-

 Now having said that I personally wouldn't have
 expected that zpool  export should have worked as easily as that while
 there where shared  filesystems.  I would have expected that exporting
 the pool should have attempted to unmount all the ZFS filesystems first -
 which would have  failed without a -f flag because they were shared.
 
 So IMO it is a bug or at least an RFE.

Ok, where should I file an RFE?

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Kickstart hot spare attachment

2006-12-12 Thread Jim Hranicky

For my latest test I set up a stripe of two mirrors with one hot spare
like so:

zpool create -f -m /export/zmir zmir mirror c0t0d0 c3t2d0 mirror c3t3d0 c3t4d0 
spare c3t1d0

I spun down c3t2d0 and c3t4d0 simultaneously, and while the system kept 
running (my tar over NFS barely hiccuped), the zpool command hung again.

I rebooted the machine with -dnq, and although the system didn't come up
the first time, it did after a fsck and a second reboot. 

However, once again the hot spare isn't getting used:

# zpool status -v
  pool: zmir
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
  the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: resilver completed with 0 errors on Tue Dec 12 09:15:49 2006
config:

  NAMESTATE READ WRITE CKSUM
  zmirDEGRADED 0 0 0
mirrorDEGRADED 0 0 0
  c0t0d0  ONLINE   0 0 0
  c3t2d0  UNAVAIL  0 0 0  cannot open
mirrorDEGRADED 0 0 0
  c3t3d0  ONLINE   0 0 0
  c3t4d0  UNAVAIL  0 0 0  cannot open
  spares
c3t1d0AVAIL

A few questions:

- I know I can attach it via the zpool commands, but is there a way to
kickstart the attachment process if it fails to attach automatically upon
disk failure?

- In this instance the spare is twice as big as the other
drives -- does that make a difference? 

- Is there something inherent to an old SCSI bus that causes spun-
down drives to hang the system in some way, even if it's just hanging
the zpool/zfs system calls? Would a thumper be more resilient to this?

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Project Proposal: Availability Suite

2007-01-26 Thread Jim Dunham


Jason J. W. Williams wrote:
Could the replication engine eventually be integrated more tightly 
with ZFS?
Not it in the present form. The architecture and implementation of 
Availability Suite is driven off block-based replication at the device 
level (/dev/rdsk/...), something that allows the product to replicate 
any Solaris file system, database, etc., without any knowledge of what 
it is actually replicating.


To pursue ZFS replication in the manner of Availability Suite, one needs 
to see what replication looks like from an abstract point of view. So 
simplistically, remote replication is like the letter 'h', where the 
left side of the letter is the complete I/O path on the primary node, 
the horizontal part of the letter is the remote replication network 
link, and the right side of the letter is only the bottom half of the 
complete I/O path on the secondary node.


Next ZFS would have to have its functional I/O path split into two 
halves, a top and bottom piece.  Next we configure replication, the 
letter 'h', between two given nodes, running both a top and bottom piece 
of ZFS on the source node, and just the bottom half of ZFS on the 
secondary node.


Today, the SNDR component of Availability Suite works like the letter 
'h' today, where we split the Solaris I/O stack into a top and bottom 
half. The top half is that software (file system, database or 
application I/O) that directs its I/Os to the bottom half (raw device, 
volume manager or block device).


So all that needs to be done is to design and build a new variant of the 
letter 'h', and find the place to separate ZFS into two pieces.


- Jim Dunham



That would be slick alternative to send/recv.

Best Regards,
Jason

On 1/26/07, Jim Dunham [EMAIL PROTECTED] wrote:

Project Overview:

I propose the creation of a project on opensolaris.org, to bring to 
the community two Solaris host-based data services; namely volume 
snapshot and volume replication. These two data services exist today 
as the Sun StorageTek Availability Suite, a Solaris 8, 9  10, 
unbundled product set, consisting of Instant Image (II) and Network 
Data Replicator (SNDR).


Project Description:

Although Availability Suite is typically known as just two data 
services (II  SNDR), there is an underlying Solaris I/O filter 
driver framework which supports these two data services. This 
framework provides the means to stack one or more block-based, pseudo 
device drivers on to any pre-provisioned cb_ops structure, [ 
http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs 
], thereby shunting all cb_ops I/O into the top of a developed filter 
driver, (for driver specific processing), then out the bottom of this 
filter driver, back into the original cb_ops entry points.


Availability Suite was developed to interpose itself on the I/O stack 
of a block device, providing a filter driver framework with the means 
to intercept any I/O originating from an upstream file system, 
database or application layer I/O. This framework provided the means 
for Availability Suite to support snapshot and remote replication 
data services for UFS, QFS, VxFS, and more recently the ZFS file 
system, plus various databases like Oracle, Sybase and PostgreSQL, 
and also application I/Os. By providing a filter driver at this point 
in the Solaris I/O stack, it allows for any number of data services 
to be implemented, without regard to the underlying block storage 
that they will be configured on. Today, as a snapshot and/or 
replication solution, the framework allows both the source and 
destination block storage device to not only differ in physical 
characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical 
characteristics such as in RAID type, volume managed storage (i.e., 
SVM, VxVM), lofi, zvols, even ram disks.


Community Involvement:

By providing this filter-driver framework, two working filter drivers 
(II  SNDR), and an extensive collection of supporting software and 
utilities, it is envisioned that those individuals and companies that 
adopt OpenSolaris as a viable storage platform, will also utilize and 
enhance the existing II  SNDR data services, plus have offered to 
them the means in which to develop their own block-based filter 
driver(s), further enhancing the use and adoption on OpenSolaris.


A very timely example that is very applicable to Availability Suite 
and the OpenSolaris community, is the recent announcement of the 
Project Proposal: lofi [ compression  encryption ] - 
http://www.opensolaris.org/jive/click.jspamessageID=26841. By 
leveraging both the Availability Suite and the lofi OpenSolaris 
projects, it would be highly probable to not only offer compression  
encryption to lofi devices (as already proposed), but by collectively 
leveraging these two project, creating the means to support file 
systems, databases and applications, across all block-based storage 
devices.


Since Availability

[zfs-discuss] Re: ZFS panics system during boot, after 11/06 upgrade

2007-01-29 Thread Jim Walker

 There are ZFS file systems.  There are no zones.
 
 Any help would be greatly appreciated, this is my
 everyday computer.
 
Take a look at page 167 of the admin guide:
http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf

You need to delete /etc/zfs/zpool.cache. And, use 
zpool import to recover.

Cheers,
Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Project Proposal: Availability Suite

2007-01-29 Thread Jim Dunham


Jason,

Thank you for the detailed explanation. It is very helpful to
understand the issue. Is anyone successfully using SNDR with ZFS yet?
Of the opportunities I've been involved with the answer is yes, but so 
far I've not seen SNDR with  ZFS in a production environment, but that 
does not mean they don't exists. It was not until late June '06, that 
AVS 4.0, Solaris 10 and ZFS were generally available, and to date AVS 
has not been made available for the Solaris Express, Community Release, 
but it will be real soon.


While I have your attention, there are two issues between ZFS and AVS 
that needs mentioning.


1). When ZFS is given an entire LUN to place in a ZFS storage pool, ZFS 
detect this, enabling SCSI write-caching on the LUN, and also opens the 
LUN with exclusive access, preventing other data services (like AVS) 
from accessing this device. The work-around is to manually format the 
LUN, typically placing all the blocks into a single partition, then just 
place this partition into the ZFS storage pool. ZFS detect this, not 
owning the entire LUN, and doesn't enable write-caching, which means it 
also doesn't open the LUN with exclusive access, and therefore AVS and 
ZFS can share the same LUN.


I thought about submitting an RFE to have ZFS provide a means to 
override this restriction, but I am not 100% certain that a ZFS 
filesystem directly accessing a write-cached enabled LUN is the same 
thing as a replicated ZFS filesystem accessing a write-cached enabled 
LUN. Even though AVS is write-order consistent, there are disaster 
recovery scenarios, when enacted, where block-order, verses write-order 
I/Os are issued.


2). One has to be very cautious in using zpool import -f   (forced 
import), especially on a LUN or LUNs in which SNDR is actively 
replicating into. If ZFS complains that the storage pool was not cleanly 
exported when issuing a zpool import ..., and one attempts a zpool 
import -f , without checking the active replication state, they are 
sure to panic Solaris. Of  course this failure scenario is no different 
then accessing a LUN or LUNs on dual-ported, or SAN based storage when 
another Solaris host is still accessing the ZFS filesystem, or 
controller based replication, as they are all just different operational 
scenarios of the same issue, data blocks changing out from underneath 
the ZFS filesystem, and its CRC checking mechanisms.


Jim



Best Regards,
Jason


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Read Only Zpool: ZFS and Replication

Ben,
I've been playing with replication of a ZFS Zpool using the recently released AVS. I'm pleased with things, but just replicating the data is only part of the problem. The big question is: can I have a zpool open in 2 places?

No. The ability to have a zpool open in two place would required shared
ZFS. The semantics of remote replication can be viewed to that of two
Solaris hosts looking at the same SAN or dual-ported storage. Today, ZFS
detects this with both SNDR and shared storage, as part of zpool
import, warning that the pool is active elsewhere.

What I really want is a Zpool on node1 open and writable (production storage)
and a replicated to node2 where its open for read-only access (standby storage).

The best you can do for this to use the II portion of Availability Suite
to take a snapshot of the active SNDR replica on the remote node,
getting a snapshot of the ZFS filesystem being replicated. In this case,
ZFS on the remote node will see and detect replicated disk blocks
changing in the zpool it is reading from.

This is an old problem. I'm not sure its remotely possible. Its bad enough
with UFS, but ZFS maintains a hell of a lot more meta-data. How is node2
supposed to know that a snapshot has been created for instance. With UFS you
can at least get by some of these problems using directio, but thats not an
option with a zpool.

I know this is a fairly remedial issue to bring up... but if I think about what I want Thumper-to-Thumper replication to look like, I want 2 usable storage systems. As I see it now the secondary storage (node2) is useless untill you break replication and import the pool, do your thing, and then re-sync storage to re-enable replication.

Am I missing something? I'm hoping there is an option I'm not aware of.

No. Also just to be clear, after you ... do your thing, and then
re-sync storage ... the re-sync is keep all of the data on the SNDR
primary OR keep all the data on the SNDR secondary.There is no means to
combine writes that occurred in two separate ZFS filesystems, back into
one filesystem. The remote ZFS filesystem is essentially a clone of the
original filesystem, and once a write I/O occurs to either side, the two
filesystems take on a life of their own.

Of course this is not unique to the ZFS filesystem, as the same is true
for all others, and this underlying storage behavior is not unique to
SNDR as it happens with other host-based replication and
controller-based replication.

Jim

benr.

This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Project Proposal: Availability Suite


Frank,

On Fri, 2 Feb 2007, Torrey McMahon wrote:


Jason J. W. Williams wrote:

Hi Jim,

Thank you very much for the heads up. Unfortunately, we need the
write-cache enabled for the application I was thinking of combining
this with. Sounds like SNDR and ZFS need some more soak time together
before you can use both to their full potential together?


Well...there is the fact that SNDR works with other FS other then 
ZFS. (Yes, I know this is the ZFS list.) Working around architectural 
issues for ZFS and ZFS alone might cause issues for others.


SNDR has some issues with logging UFS as well. If you start a SNDR 
live copy on an active logging UFS (not _writelocked_), the UFS log 
state may not be copied consistently.


Treading very carefully, UFS logging may have issues with being 
replicated, not the other way around. SNDR replication (after 
synchronizing) maintains a write-order consistent volume, thus if there 
is an issue with UFS logging being able to access an SNDR secondary, 
then UFS logging will also have issues with accessing a volume after 
Solaris crashes. The end result of Solaris crashing, or SNDR replication 
stopping, is a write-ordered, crash-consistent volume.


Given that both UFS logging and SNDR are (near) perfect (or there would 
be a flood of escalations), this issue in all cases I've seen to date, 
is that the SNDR primary volume being replicated is mounted with UFS 
logging enable, but the SNDR secondary is not mounted with UFS logging 
enabled. Once this condition happens, the problem can be resolved by 
fixing /etc/vfstab to correct the inconsistent mount options, and then 
performing an SNDR update sync.




If you want a live remote replication facility, it _NEEDS_ to talk to 
the filesystem somehow. There must be a callback mechanism that the 
filesystem could use to tell the replicator and from exactly now on 
you start replicating. The only entity which can truly give this 
signal is the filesystem itself.


There is an RFE against SNDR for something called in-line PIT. I hope 
that this work will get done soon.




And no, that _not_ when the filesystem does a flush write cache 
ioctl. Or when the user has just issued a sync command or similar.
For ZFS, it'd be when a ZIL transaction is closed (as I understand 
it), for UFS it'd be when the UFS log is fully rolled. There's no 
notification to external entities when these two events happen.


Because ZFS is always on-disk consistent, this is not an issue. So far 
in ALL my testing with replicating ZFS with SNDR, I have not seen ZFS fail!


Of course be careful to not confuse my stated position with another 
closely related scenario. That being accessing ZFS on the remote node 
via a forced import zpool import -f name, with  active SNDR 
replication, as ZFS is sure to panic the system. ZFS, unlike other 
filesystems has 0% tolerance to corrupted metadata.


Jim


SNDR tries its best to achieve this detection, but without actually 
_stopping_ all I/O (on UFS: writelocking), there's a window of 
vulnerability still open.
And SNDR/II don't stop filesystem I/O - by basic principle. That's how 
they're sold/advertised/intended to be used.


I'm all willing to see SNDR/II go open - we could finally work these 
issues !


FrankH.



I think the best of both worlds approach would be to let SNDR plug-in 
to ZFS along the same lines the crypto stuff will be able to plug in, 
different compression types, etc. There once was a slide that showed 
how that workedor I'm hallucinating again.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Read Only Zpool: ZFS and Replication


Robert,

Hello Ben,

Monday, February 5, 2007, 9:17:01 AM, you wrote:

BR I've been playing with replication of a ZFS Zpool using the
BR recently released AVS.  I'm pleased with things, but just
BR replicating the data is only part of the problem.  The big
BR question is: can I have a zpool open in 2 places?  


BR What I really want is a Zpool on node1 open and writable
BR (production storage) and a replicated to node2 where its open for
BR read-only access (standby storage).

BR This is an old problem.  I'm not sure its remotely possible.  Its
BR bad enough with UFS, but ZFS maintains a hell of a lot more
BR meta-data.  How is node2 supposed to know that a snapshot has been
BR created for instance.  With UFS you can at least get by some of
BR these problems using directio, but thats not an option with a zpool.

BR I know this is a fairly remedial issue to bring up... but if I
BR think about what I want Thumper-to-Thumper replication to look
BR like, I want 2 usable storage systems.  As I see it now the
BR secondary storage (node2) is useless untill you break replication
BR and import the pool, do your thing, and then re-sync storage to re-enable 
replication.

BR Am I missing something?  I'm hoping there is an option I'm not aware of.


You can't mount rw on one node and ro on another (not to mention that
zfs doesn't offer you to import RO pools right now). You can mount the
same file system like UFS in RO on both nodes but not ZFS (no ro import).
  
One can not just mount a filesystem in RO mode if SNDR or any other 
host-based or controller-based replication is underneath. For all 
filesystems that I know of,  expect of course shared-reader QFS, this 
will fail given time.


Even if one has the means to mount a filesystem with DIRECTIO 
(no-caching), READ-ONLY (no-writes), it does not prevent a filesystem 
from looking at the contents of block A and then acting on block B. 
The reason being is that during replication at time T1 both blocks A  
B could be written and be consistent with each other. Next the file 
system reads block A. Now replication at time T2 updates blocks A  
B, also consistent with each other. Next the file system reads block 
B and panics due to an inconsistency only it sees between old A and 
new B. I know this for a fact, since a forced zpool import -f 
name, is a common instance of this exact failure, due most likely 
checksum failures between metadata blocks A  B.


Of course using an instantly accessible II snapshot of an SNDR secondary 
volume would work just fine, since the data being read is now 
point-in-time consistent, and static.


- Jim


I belive what you really need is 'zfs send continuos' feature.
We are developing something like this right now.
I expect we can give more details really soon now.


  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Read Only Zpool: ZFS and Replication


Ben Rockwood wrote:

Jim Dunham wrote:

Robert,

Hello Ben,

Monday, February 5, 2007, 9:17:01 AM, you wrote:

BR I've been playing with replication of a ZFS Zpool using the
BR recently released AVS.  I'm pleased with things, but just
BR replicating the data is only part of the problem.  The big
BR question is: can I have a zpool open in 2 places? BR What I 
really want is a Zpool on node1 open and writable

BR (production storage) and a replicated to node2 where its open for
BR read-only access (standby storage).

BR This is an old problem.  I'm not sure its remotely possible.  Its
BR bad enough with UFS, but ZFS maintains a hell of a lot more
BR meta-data.  How is node2 supposed to know that a snapshot has been
BR created for instance.  With UFS you can at least get by some of
BR these problems using directio, but thats not an option with a 
zpool.


BR I know this is a fairly remedial issue to bring up... but if I
BR think about what I want Thumper-to-Thumper replication to look
BR like, I want 2 usable storage systems.  As I see it now the
BR secondary storage (node2) is useless untill you break replication
BR and import the pool, do your thing, and then re-sync storage to 
re-enable replication.


BR Am I missing something?  I'm hoping there is an option I'm not 
aware of.



You can't mount rw on one node and ro on another (not to mention that
zfs doesn't offer you to import RO pools right now). You can mount the
same file system like UFS in RO on both nodes but not ZFS (no ro 
import).
  
One can not just mount a filesystem in RO mode if SNDR or any other 
host-based or controller-based replication is underneath. For all 
filesystems that I know of,  expect of course shared-reader QFS, this 
will fail given time.


Even if one has the means to mount a filesystem with DIRECTIO 
(no-caching), READ-ONLY (no-writes), it does not prevent a filesystem 
from looking at the contents of block A and then acting on block 
B. The reason being is that during replication at time T1 both 
blocks A  B could be written and be consistent with each other. 
Next the file system reads block A. Now replication at time T2 
updates blocks A  B, also consistent with each other. Next the 
file system reads block B and panics due to an inconsistency only 
it sees between old A and new B. I know this for a fact, since a 
forced zpool import -f name, is a common instance of this exact 
failure, due most likely checksum failures between metadata blocks 
A  B.


Ya, that bit me last night.  'zpool import' shows the pool fine, but 
when you force the import you panic:


Feb  5 07:14:10 uma ^Mpanic[cpu0]/thread=fe8001072c80: Feb  5 
07:14:10 uma genunix: [ID 809409 kern.notice] ZFS: I/O failure (write 
on unknown off 0: zio fe80c54ed380 [L0 unallocated] 400L/200P 
DVA[0]=0:36000:200 DVA[1]=0:9c0003800:200 
DVA[2]=0:20004e00:200 fletcher4 lzjb LE contiguous birth=57416 
fill=0 cksum=de2e56ffd:5591b77b74b:1101a91d58dfc:252efdf22532d0): error 5
Feb  5 07:14:11 uma unix: [ID 10 kern.notice] Feb  5 07:14:11 uma 
genunix: [ID 655072 kern.notice] fe8001072a40 zfs:zio_done+140 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072a60 
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072ab0 
zfs:zio_wait_for_children+5d ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072ad0 
zfs:zio_wait_children_done+20 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072af0 
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072b40 
zfs:zio_vdev_io_assess+129 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072b60 
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072bb0 
zfs:vdev_mirror_io_done+2af ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072bd0 
zfs:zio_vdev_io_done+26 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072c60 
genunix:taskq_thread+1a7 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072c70 
unix:thread_start+8 ()

Feb  5 07:14:11 uma unix: [ID 10 kern.notice]

So without using II, whats the best method of bring up the secondary 
storage?  Is just dropping the primary into logging acceptable?

Yes, placing SNDR in logging mode stops the replication of writes.

Also performing a zpool export on the primary node, and waiting 
(sndradm -w) until all writes are replicated, means that on the SNDR 
secondary node, a zpool import can be done without using the -f, as a 
forced imported is not need, since the zpool export operation got 
replicated.


Be sure to remember to zpool export on the remote node, before 
resuming replication on the primary node, or another panic will likely 
occur.


Jim


benr.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] FROSUG February Meeting Announcement (2/22/2007)

2007-02-08 Thread Jim Walker

This month's FROSUG (Front Range OpenSolaris User Group) meeting is on
Thursday, February 22, 2007.  Our presentation is ZFS as a Root File
System by Lori Alt. In addition, Jon Bowman will be giving an OpenSolaris
Update, and we will also be doing an InstallFest. So, if you want help
installing an OpenSolaris distribution, backup your laptop and bring it
to the meeting!

About the presentation(s):
One of the next steps in the evolution of ZFS is to enable
its use as a root file system.  This presentation will focus
on how booting from ZFS will work, how installation
will be affected by ZFS's feature set, and the many advantages
that will result from being able to use ZFS as a root file system.

The presentation(s)s will be posted here prior to the meeting:
http://www.opensolaris.org/os/community/os_user_groups/frosug/

About our presenter(s):
Lori Alt is a Staff Engineer at Sun Microsystems, where
she has worked since 1991.  Lori worked on Solaris install
and upgrade and then on UFS, where she led the multi-terabyte
UFS project.  She has Bachelor's and Master's degrees in
computer science from Washington University in St. Louis, MO.

-

Meeting Details:

When: Thursday, February 22, 2007
Times: 6:00pm - 6:30pm Doors open and Pizza
   6:30pm - 6:45pm OpenSolaris Update (Jon Bowman)
   6:45pm - 8:30pm ZFS as a Root File System (Lori Alt)
Where: Sun Broomfield Campus
   Building 1 - Conference Center
   500 Eldorado Blvd.
   Broomfield, CO 80021

Note:  The location of this meeting may change. We will send out an
   additional email prior to the meeting if this happens.

Pizza and soft drinks will be served at the beginning of the meeting.
Please RSVP to frosug-rsvp(AT)opensolaris(DOT)org in order to help us
plan for food and setup access to the Sun campus.

We hope to see you there!
Thanks,
FROSUG

+++

Future Meeting Plans:
March 29, 2007: Doug McCallum presents sharemgr
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] UPDATE: FROSUG February Meeting (2/22/2007)

2007-02-15 Thread Jim Walker

***Meeting Update***
We will be having this month's meeting at the Omni Interlocken Resort
in Broomfield and a conference call number is being provided for those
who can not make the meeting in person, see Meeting Details below for
more information.

In addition, we will be discussing Solaris Express Developer Edition
during the OpenSolaris Update and providing free SXDE DVDs.

Hope to see you there. This month's meeting is getting a lot of interest!
***Meeting Update***

This month's FROSUG (Front Range OpenSolaris User Group) meeting is on
Thursday, February 22, 2007.  Our presentation is ZFS as a Root File
System by Lori Alt. In addition, Jon Bowman will be giving an OpenSolaris
Update, and we will also be doing an InstallFest. So, if you want help
installing Solaris Express Developer Edition, backup your laptop and bring
it to the meeting!

About the presentation:
One of the next steps in the evolution of ZFS is to enable
its use as a root file system.  This presentation will focus
on how booting from ZFS will work, how installation
will be affected by ZFS's feature set, and the many advantages
that will result from being able to use ZFS as a root file system.

The presentation will be posted here prior to the meeting:
http://www.opensolaris.org/os/community/os_user_groups/frosug/

About our presenter:
Lori Alt is a Staff Engineer at Sun Microsystems, where
she has worked since 1991.  Lori worked on Solaris install
and upgrade and then on UFS, where she led the multi-terabyte
UFS project.  She has Bachelor's and Master's degrees in
computer science from Washington University in St. Louis, MO.

-

Meeting Details

When: Thursday, February 22, 2007
Times: 6:00pm - 6:30pm Food and Drinks
   6:30pm - 6:45pm OpenSolaris Update (Jon Bowman)
   6:45pm - 8:30pm ZFS as a Root File System (Lori Alt)
Where: Omni Interlocken Resort (Fir Conference Room)
   500 Interlocken Blvd.
   Broomfield, CO 80021

Conference Call Information

US:   866-545-5198
INTL: 865-521-8904
Access Code: 5518835

-

The meeting is free and open to the public.

Snacks and soft drinks will be served at the beginning of the meeting.
Please RSVP to frosug-rsvp(AT)opensolaris(DOT)org in order to help us
plan for food.

We hope to see you there!
Thanks,
FROSUG

-

Future Meeting Plans:
March 29, 2007: Doug McCallum presents sharemgr

If you have ideas for meeting topics, send them to:
ug-frosug(AT)opensolaris(DOT)org
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why number of NFS threads jumps to the max value?

2007-02-27 Thread Jim Mauro



You don't honestly, really, reasonably, expect someone, anyone, to look 
at the stack
trace of a few  hundred threads, and post something along the lines of 
This is what is
wrong with your NFS server.Do you? Without any other information at 
all?


We're here to help, but please reset your expectations around our 
abilities to

root-cause pathological behavior based an almost no information.

What size and type of server?
What size and type of storage?
What release of Solaris?
What how may networks, and what type?
What is being used to generate the load for the testing?
What is the zpool configuration?
What do the system stats look like while under load (e.g. mpstat), and how
to they change when you see this behavior?
What does zpool iostat zpool_name 1 data look like while under load?
Are you collecting nfsstat data - what is the rate of incoming NFS ops?
Can you characterize the load - read/write data intensive, metadata 
intensive?


Are the client machines Solaris, or something else?

Does this last for seconds, minutes, tens-of-minutes? Does the system 
remain in this

state indefinitely until reboot, or does it normalize?

Can you consistently reproduce this problem?

/jim


Leon Koll wrote:

Hello, gurus
I need your help. During the benchmark test of NFS-shared ZFS file systems at 
some moment the number of NFS threads jumps to the maximal value, 1027 
(NFSD_SERVERS was set to 1024). The latency also grows and the number of IOPS 
is going down.
I've collected the output of
echo ::pgrep nfsd | ::walk thread | ::findstack -v | mdb -k
that can be seen here:
http://tinyurl.com/yrvn4z

Could you please look at it and tell me what's wrong with my NFS server.
Appreciate,
-- Leon
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] C'mon ARC, stay small...



FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:


 arc::print -tad
{
. . .
   c02e29e8 uint64_t size = 0t10527883264
   c02e29f0 uint64_t p = 0t16381819904
   c02e29f8 uint64_t c = 0t1070318720
   c02e2a00 uint64_t c_min = 0t1070318720
   c02e2a08 uint64_t c_max = 0t1070318720
. . .

Perhaps c_max does not do what I think it does?

Thanks,
/jim


Jim Mauro wrote:

Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
(update 3). All file IO is mmap(file), read memory segment, unmap, close.

Tweaked the arc size down via mdb to 1GB. I used that value because
c_min was also 1GB, and I was not sure if c_max could be larger than
c_minAnyway, I set c_max to 1GB.

After a workload run:
 arc::print -tad
{
. . .
  c02e29e8 uint64_t size = 0t3099832832
  c02e29f0 uint64_t p = 0t16540761088
  c02e29f8 uint64_t c = 0t1070318720
  c02e2a00 uint64_t c_min = 0t1070318720
  c02e2a08 uint64_t c_max = 0t1070318720
. . .

size is at 3GB, with c_max at 1GB.

What gives? I'm looking at the code now, but was under the impression
c_max would limit ARC growth. Granted, it's not a factor of 10, and
it's certainly much better than the out-of-the-box growth to 24GB
(this is a 32GB x4500), so clearly ARC growth is being limited, but it
still grew to 3X c_max.

Thanks,
/jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] C'mon ARC, stay small...





How/when did you configure arc_c_max?  

Immediately following a reboot, I set arc.c_max using mdb,
then verified reading the arc structure again.

arc.p is supposed to be
initialized to half of arc.c.  Also, I assume that there's a reliable
test case for reproducing this problem?
  

Yep. I'm using a x4500 in-house to sort out performance of a customer test
case that uses mmap. We acquired the new DIMMs to bring the
x4500 to 32GB, since the workload has a 64GB working set size,
and we were clobbering a 16GB thumper. We wanted to see how doubling
memory may help.

I'm trying clamp the ARC size because for mmap-intensive workloads,
it seems to hurt more than help (although, based on experiments up to this
point, it's not hurting a lot).

I'll do another reboot, and run it all down for you serially...

/jim


Thanks,

-j

On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
  


ARC_mru::print -d size lsize
  

size = 0t10224433152
lsize = 0t10218960896


ARC_mfu::print -d size lsize
  

size = 0t303450112
lsize = 0t289998848


ARC_anon::print -d size
  

size = 0

So it looks like the MRU is running at 10GB...


What does this tell us?

Thanks,
/jim



[EMAIL PROTECTED] wrote:


This seems a bit strange.  What's the workload, and also, what's the
output for:

 
  

ARC_mru::print size lsize
ARC_mfu::print size lsize
   


and
 
  

ARC_anon::print size
   


For obvious reasons, the ARC can't evict buffers that are in use.
Buffers that are available to be evicted should be on the mru or mfu
list, so this output should be instructive.

-j

On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
 
  

FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:


   


arc::print -tad
 
  

{
. . .
  c02e29e8 uint64_t size = 0t10527883264
  c02e29f0 uint64_t p = 0t16381819904
  c02e29f8 uint64_t c = 0t1070318720
  c02e2a00 uint64_t c_min = 0t1070318720
  c02e2a08 uint64_t c_max = 0t1070318720
. . .

Perhaps c_max does not do what I think it does?

Thanks,
/jim


Jim Mauro wrote:
   


Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
(update 3). All file IO is mmap(file), read memory segment, unmap, close.

Tweaked the arc size down via mdb to 1GB. I used that value because
c_min was also 1GB, and I was not sure if c_max could be larger than
c_minAnyway, I set c_max to 1GB.

After a workload run:
 
  

arc::print -tad
   


{
. . .
c02e29e8 uint64_t size = 0t3099832832
c02e29f0 uint64_t p = 0t16540761088
c02e29f8 uint64_t c = 0t1070318720
c02e2a00 uint64_t c_min = 0t1070318720
c02e2a08 uint64_t c_max = 0t1070318720
. . .

size is at 3GB, with c_max at 1GB.

What gives? I'm looking at the code now, but was under the impression
c_max would limit ARC growth. Granted, it's not a factor of 10, and
it's certainly much better than the out-of-the-box growth to 24GB
(this is a 32GB x4500), so clearly ARC growth is being limited, but it
still grew to 3X c_max.

Thanks,
/jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] C'mon ARC, stay small...



Following a reboot:
 arc::print -tad
{
. . .
   c02e29e8 uint64_t size = 0t299008
   c02e29f0 uint64_t p = 0t16588228608
   c02e29f8 uint64_t c = 0t33176457216
   c02e2a00 uint64_t c_min = 0t1070318720
   c02e2a08 uint64_t c_max = 0t33176457216
. . .
}  
 c02e2a08 /Z 0x2000 --- set 
c_max to 512MB

arc+0x48:   0x7b9789000 =   0x2000
 arc::print -tad
{
. . .
   c02e29e8 uint64_t size = 0t299008
   c02e29f0 uint64_t p = 0t16588228608
   c02e29f8 uint64_t c = 0t33176457216
   c02e2a00 uint64_t c_min = 0t1070318720
   c02e2a08 uint64_t c_max = 0t536870912  - c_max is 512MB
. . .
}  
 ARC_mru::print -d size lsize

size = 0t294912
lsize = 0t32768


Run the workload a couple times...

   c02e29e8 uint64_t size = 0t27121205248 --- ARC size is 27GB
   c02e29f0 uint64_t p = 0t10551351442
   c02e29f8 uint64_t c = 0t27121332576
   c02e2a00 uint64_t c_min = 0t1070318720
   c02e2a08 uint64_t c_max = 0t536870912 - c_max is 512MB

 ARC_mru::print -d size lsize
size = 0t223985664
lsize = 0t221839360
 ARC_mfu::print -d size lsize
size = 0t26897219584  -- MFU list is almost 27GB ...
lsize = 0t26869121024

Thanks,
/jim




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] C'mon ARC, stay small...


Will try that now...

/jim


[EMAIL PROTECTED] wrote:

I suppose I should have been more forward about making my last point.
If the arc_c_max isn't set in /etc/system, I don't believe that the ARC
will initialize arc.p to the correct value.   I could be wrong about
this; however, next time you set c_max, set c to the same value as c_max
and set p to half of c.  Let me know if this addresses the problem or
not.

-j

  
How/when did you configure arc_c_max?  
  

Immediately following a reboot, I set arc.c_max using mdb,
then verified reading the arc structure again.


arc.p is supposed to be
initialized to half of arc.c.  Also, I assume that there's a reliable
test case for reproducing this problem?
 
  

Yep. I'm using a x4500 in-house to sort out performance of a customer test
case that uses mmap. We acquired the new DIMMs to bring the
x4500 to 32GB, since the workload has a 64GB working set size,
and we were clobbering a 16GB thumper. We wanted to see how doubling
memory may help.

I'm trying clamp the ARC size because for mmap-intensive workloads,
it seems to hurt more than help (although, based on experiments up to this
point, it's not hurting a lot).

I'll do another reboot, and run it all down for you serially...

/jim



Thanks,

-j

On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
 
  
   


ARC_mru::print -d size lsize
 
  

size = 0t10224433152
lsize = 0t10218960896
   


ARC_mfu::print -d size lsize
 
  

size = 0t303450112
lsize = 0t289998848
   


ARC_anon::print -d size
 
  

size = 0
   
So it looks like the MRU is running at 10GB...


What does this tell us?

Thanks,
/jim



[EMAIL PROTECTED] wrote:
   


This seems a bit strange.  What's the workload, and also, what's the
output for:


 
  

ARC_mru::print size lsize
ARC_mfu::print size lsize
  
   


and

 
  

ARC_anon::print size
  
   


For obvious reasons, the ARC can't evict buffers that are in use.
Buffers that are available to be evicted should be on the mru or mfu
list, so this output should be instructive.

-j

On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:

 
  

FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:


  
   


arc::print -tad

 
  

{
. . .
 c02e29e8 uint64_t size = 0t10527883264
 c02e29f0 uint64_t p = 0t16381819904
 c02e29f8 uint64_t c = 0t1070318720
 c02e2a00 uint64_t c_min = 0t1070318720
 c02e2a08 uint64_t c_max = 0t1070318720
. . .

Perhaps c_max does not do what I think it does?

Thanks,
/jim


Jim Mauro wrote:
  
   


Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
(update 3). All file IO is mmap(file), read memory segment, unmap, 
close.


Tweaked the arc size down via mdb to 1GB. I used that value because
c_min was also 1GB, and I was not sure if c_max could be larger than
c_minAnyway, I set c_max to 1GB.

After a workload run:

 
  

arc::print -tad
  
   


{
. . .
c02e29e8 uint64_t size = 0t3099832832
c02e29f0 uint64_t p = 0t16540761088
c02e29f8 uint64_t c = 0t1070318720
c02e2a00 uint64_t c_min = 0t1070318720
c02e2a08 uint64_t c_max = 0t1070318720
. . .

size is at 3GB, with c_max at 1GB.

What gives? I'm looking at the code now, but was under the impression
c_max would limit ARC growth. Granted, it's not a factor of 10, and
it's certainly much better than the out-of-the-box growth to 24GB
(this is a 32GB x4500), so clearly ARC growth is being limited, but it
still grew to 3X c_max.

Thanks,
/jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] C'mon ARC, stay small...