Re: [zfs-discuss] Enabling compression/encryption on a populated filesystem

2006-07-17 Thread Darren J Moffat

Jeff Victor wrote:
Why?  Is the 'data is encrypted' flag only stored in filesystem 
metadata, or is that flag stored in each data block?  


Like compression and which checksum algorithm it will be stored in
every dmu object.

If the latter is 
true, it would be possible (though potentially time-consuming) to 
determine which files are encrypted, and which are not.


Exactly, time consuming and expensive.

Buth the reason thing is how do you tell the admin its done now the 
filesystem is safe.   With compression you don't generally care if some 
old stuff didn't compress (and with the current implementation it has to 
compress a certain amount or it gets written uncompressed anyway).  With 
encryption the human admin really needs to be told.


So far there really isn't any time consuming or offline action for ZFS 
data sets that actually impacts your data so I'd rather not introduce 
one now (I wouldn't really put scrub in this category).



--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Enabling compression/encryption on a populated filesystem

2006-07-17 Thread Darren J Moffat

Bill Sommerfeld wrote:

On Fri, 2006-07-14 at 07:03, Darren J Moffat wrote:
The current plan is that encryption must be turned on when the file 
system is created and can't be turned on later.  This means that the 
zfs-crypto work depends on the RFE to set properties at file system 
creation time.


You also won't be able to turn crypto off for a given filesystem later 
(because you won't know when all the data is back in the clear again and 
you can safely destroy the key).


So, I'd think that, in the fullness of time, you'd want some sort of
mechanism for graceful key roll-over -- i.e., you'd set a new key,
migrate existing data encrypted using the old key to the new key, then
forget the old key; the whole point of keyed cryptography is that the
key is kept both small (so it can more easily remain secret) AND
changeable.


One way, and the initial way we will deal with this, is to have the key 
change be done on the master wrapping key not on the actual per data 
set encryption keys.


One of the goals of the ZFS crypto project is to support multiple 
different key management strategies with the same on disk capabilities.


Key roll over is one the agenda for a later phase, as is key expiry 
(manual and time based).


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Mikael Kjerrman
Hi,

so it happened...

I have a 10 disk raidz pool running Solaris 10 U2, and after a reboot the whole 
pool became unavailable after apparently loosing a diskdrive. (The drive is 
seemingly ok as far as I can tell from other commands)

--- bootlog ---
Jul 17 09:57:38 expprd fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-CS, 
TYPE: Fault, VER: 1, SEVERITY: Major
Jul 17 09:57:38 expprd EVENT-TIME: Mon Jul 17 09:57:38 MEST 2006
Jul 17 09:57:38 expprd PLATFORM: SUNW,UltraAX-i2, CSN: -, HOSTNAME: expprd
Jul 17 09:57:38 expprd SOURCE: zfs-diagnosis, REV: 1.0
Jul 17 09:57:38 expprd EVENT-ID: e2fd61f7-a03d-6279-d5a5-9b8755fa1af9
Jul 17 09:57:38 expprd DESC: A ZFS pool failed to open.  Refer to 
http://sun.com/msg/ZFS-8000-CS for more information.
Jul 17 09:57:38 expprd AUTO-RESPONSE: No automated response will occur.
Jul 17 09:57:38 expprd IMPACT: The pool data is unavailable
Jul 17 09:57:38 expprd REC-ACTION: Run 'zpool status -x' and either attach the 
missing device or
Jul 17 09:57:38 expprd  restore from backup.
---

--- zpool status -x ---
bash-3.00# zpool status -x
  pool: data
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
dataUNAVAIL  0 0 0  insufficient replicas
  c1t0d0ONLINE   0 0 0
  c1t1d0ONLINE   0 0 0
  c1t2d0ONLINE   0 0 0
  c1t3d0ONLINE   0 0 0
  c2t0d0ONLINE   0 0 0
  c2t1d0ONLINE   0 0 0
  c2t2d0ONLINE   0 0 0
  c2t3d0ONLINE   0 0 0
  c2t4d0ONLINE   0 0 0
  c1t4d0UNAVAIL  0 0 0  cannot open
--

The problem as I see it is that the pool should be able to handle 1 disk error, 
no?
and the online, attach, replace commands doesn't work when the pool is 
unavailable. I've filed a case with Sun, but thought I'd ask around here to see 
if anyone has experienced this before.


cheers,

//Mikael
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Jeff Bonwick
 I have a 10 disk raidz pool running Solaris 10 U2, and after a reboot
 the whole pool became unavailable after apparently loosing a diskdrive.
 [...]
 NAMESTATE READ WRITE CKSUM
 dataUNAVAIL  0 0 0  insufficient replicas
   c1t0d0ONLINE   0 0 0
 [...]
   c1t4d0UNAVAIL  0 0 0  cannot open
 --
 
 The problem as I see it is that the pool should be able to handle
 1 disk error, no?

If it were a raidz pool, that would be correct.  But according to
zpool status, it's just a collection of disks with no replication.
Specifically, compare these two commands:

(1) zpool create data A B C

(2) zpool create data raidz A B C

Assume each disk has 500G capacity.

The first command will create an unreplicated pool with 1.5T capacity.
The second will create a single-parity RAID-Z pool with 1.0T capacity.

My guess is that you intended the latter, but actually typed the former,
perhaps assuming that RAID-Z was always present.  If so, I apologize for
not making this clearer.  If you have any suggestions for how we could
improve the zpool(1M) command or documentation, please let me know.

One option -- I confess up front that I don't really like it -- would be
to make 'unreplicated' an explicit replication type (in addition to
mirror and raidz), so that you couldn't get it by accident:

zpool create data unreplicated A B C

The extra typing would be annoying, but would make it almost impossible
to get the wrong behavior by accident.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread Jonathan Wheeler
Hi All,

I've just built an 8 disk zfs storage box, and I'm in the testing phase before 
I put it into production. I've run into some unusual results, and I was hoping 
the community could offer some suggestions. I've bascially made the switch to 
Solaris on the promises of ZFS alone (yes I'm that excited about it!), so 
naturally I'm looking forward to some great performance - but it appears I'm 
going to need some help finding all of it.

I was having even lower numbers with filebench, so I decided to dial back to a 
really simple app for testing - bonnie.

The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate sata II 
300GB disks, Supermicro SAT2-MV8 8 port sata controller, running at/on a 133Mhz 
64pci-x bus.
The bottle neck here, by my thinkng, should be the disks themselves.
It's not the disk interfaces ('300MB'), the disk bus (300MB EACH), the pci-x 
bus (1.1GB), and I'd hope a 64-bit 3Ghz cpu would be sufficent.

Tests were run on a fresh clean zpool, on an idle system. Rogue results were 
dropped, and as you can see below, all tests were run more then once. 8GB 
should be far more then the 1GB of RAM that the system has, eliminating caching 
issues.

If I've still managed to overlook something in my testing setup, please let me 
know - I sure did try!

Sorry about the formatting - this is bound to end up ugly

Bonnie
  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raid0MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0  
2.0
8 disk   8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9  
2.1

so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be 
LOWER then writes??

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
mirror   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 33285 38.6 46033  9.9 33077  6.8 67934 90.4  93445  7.7 230.5  
1.3 
8 disk   8196 34821 41.4 46136  9.0 32445  6.6 67120 89.1  94403  6.9 210.4  1.8

46MB/sec writes, each disk individually can do better, but I guess keeping 8 
disks in sync is hurting performance. The 94MB/sec writes is interesting. One 
the one hand, that's greater then 1 disk's worth, so I'm getting striping 
performance out of a mirror GO ZFS. On the other, if I can get striping 
performance from mirrored reads, why is it only 94MB/sec? Seemingly it's not 
cpu bound.


Now for the important test, raid-z

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raidz  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3  
1.0
8 disk   8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3  
1.0
8 disk   8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5  
0.9
7 disk   8196 51103 58.8  93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9  
1.0
7 disk   8196 49446 56.8  93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1  
1.0
7 disk   8196 49831 57.1  81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4  
1.0
6 disk   8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7  
0.9
6 disk   8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4  
0.8
4 disk   8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1  
0.9

I'm getting distinctly non-linear scaling here.
Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8 =33Mb/sec with 
cpu to spare (roughly half on what each individual disk should be capable of). 
Here I'm getting 123/4= 30Mb/sec, or should that be 123/3= 41Mb/sec?
Using 30 as a basline, I'd be expecting to see twice that with 8 disks 
(240ish?). What I end up with is ~135, Clearly not good scaling at all.
The really interesting numbers happen at 7 disks - it's slower then with 4, in 
all tests.
I ran it 3x to be sure.
Note this was a native 7 disk raid-z, it wasn't 8 running in degraded mode with 
7.
Something is really wrong with my write performance here across the board.

Reads: 4 disks gives me 190MB/sec. WOAH! I'm very happy with that. 8 disks 
should scale to 380 then, Well 320 isn't all that far off - no biggie.
Looking at the 6 disk raidz is interesting though, 290MB/sec. The disks are 
good for 60+MB/sec individually. 290 is 48/disk - note also that this is better 
then my raid0 performance?!
Adding another 2 disks to my raidz gives me a mere 30Mb/sec extra performance? 
Something is going very wrong here too.

The 7 disk raidz read test is about what I'd expect (330/7= 47/disk), but it 
shows that the 8 disk is actually going backwards.

hmm...


I understand that 

Re: [zfs-discuss] Re: zvol Performance

2006-07-17 Thread Neil Perrin

This is change request:

6428639 large writes to zvol synchs too much, better cut down a little

which I have a fix for, but it hasn't been put back.

Neil.

Jürgen Keil wrote On 07/17/06 04:18,:

Further testing revealed
that it wasn't an iSCSI performance issue but a zvol
issue.  Testing on a SATA disk locally, I get these
numbers (sequentual write):

UFS: 38MB/s
ZFS: 38MB/s
Zvol UFS: 6MB/s
Zvol Raw: ~6MB/s

ZFS is nice and fast but Zvol performance just drops
off a cliff.  Suggestion or observations by others
using zvol would be extremely helpful.   



# zfs create -V 1g data/zvol-test
# time dd if=/data/media/sol-10-u2-ga-x86-dvd.iso 
of=/dev/zvol/rdsk/data/zvol-test bs=32k count=1
1+0 records in
1+0 records out
0.08u 9.37s 2:21.56 6.6%

That's ~ 2.3 MB/s.

I do see *frequent* DKIOCFLUSHWRITECACHE ioctls
(one flush write cache ioctl after writing ~36KB of data, needs ~6-7 
milliseconds per flush):


  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02778, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 5736778 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e027c0, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6209599 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02808, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6572132 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02850, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6732316 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02898, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6175876 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e028e0, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6251611 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02928, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7756397 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02970, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6393356 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e029b8, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6147003 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a00, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6247036 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a48, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6061991 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a90, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6284297 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02ad8, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6174818 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02b20, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6245923 
nsec, error 0



dtrace with stack backtraces:


  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5d1ec10, count 9000
  0  39404  zio_ioctl:entry
  zfs`zil_flush_vdevs+0x144
  zfs`zil_commit+0x311
  zfs`zvol_strategy+0x4bc
  genunix`default_physio+0x308
  genunix`physio+0x1d
  zfs`zvol_write+0x22
  genunix`cdev_write+0x25
  specfs`spec_write+0x4d6
  genunix`fop_write+0x2e
  genunix`write+0x2ae
  unix`sys_sysenter+0x104

  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6638189 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5d1ec58, count 9000
  0  39404  zio_ioctl:entry
  zfs`zil_flush_vdevs+0x144
  zfs`zil_commit+0x311
  zfs`zvol_strategy+0x4bc
  genunix`default_physio+0x308
  genunix`physio+0x1d
  zfs`zvol_write+0x22
  genunix`cdev_write+0x25
  specfs`spec_write+0x4d6
  genunix`fop_write+0x2e
  genunix`write+0x2ae
  unix`sys_sysenter+0x104

  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7881400 
nsec, error 0



Re: [zfs-discuss] Large device support

2006-07-17 Thread Robert Milkowski
Hello J.P.,

Monday, July 17, 2006, 2:15:56 PM, you wrote:

JPK Possibly not the right list, but the only appropriate one I knew about.

JPK I have a Solaris box (just reinstalled to Sol 10 606) with a 3.19TB device
JPK hanging off it, attatched by fibre.

JPK Solaris refuses to see this device except as a 1.19 TB device.

JPK Documentation that I have found 
JPK 
(http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1k?a=view#disksconcepts-17)
JPK Suggests that this will not work with the ssd or sd drives.

JPK Is this really the case?  If that isn't supported, what is?

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Enabling compression/encryption on a populated filesystem

2006-07-17 Thread Luke Scharf




Darren J Moffat wrote:
Buth the
reason thing is how do you tell the admin "its done now the filesystem
is safe". With compression you don't generally care if some old stuff
didn't compress (and with the current implementation it has to compress
a certain amount or it gets written uncompressed anyway). With
encryption the human admin really needs to be told.

As a sysadmin, I'd be happy with another scrub-type command. Something
with the following meaning: 
"Reapply all block-level properties such as compression,
encryption, and checksum to every block in the volume. Have the admin
come back tomorrow and run 'zpool status' too see if it's zone." 

Mad props if I can do this on a live filesystem (like the other ZFS
commands, which also get mad props for being good tools).

A natural command for this would be something like "zfs blockscrub
tank/volume". Also, "zpool blockscrub tank" would make sense to me as
well, even though it might touch more data.

Of course, it's easy for me to just say this, since I'm not thinking
about the implementation very deeply...

-Luke





smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread J.P. King

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.


How do you suggest that I create a slice representing the whole disk?
format (with or without -e) only sees 1.19TB


Robertmailto:[EMAIL PROTECTED]


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread J.P. King

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.


Well, in fact it turned out that the firmware on the device needed 
upgrading to support the appropriate SCSI extensions.


The documentation is still wrong in that it suggests that the ssd/sd 
driver shouldn't work with 2TB, but I am happy, so no problem.



Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Al Hopper
On Mon, 17 Jul 2006, Darren J Moffat wrote:

 Jeff Bonwick wrote
  zpool create data unreplicated A B C
 
  The extra typing would be annoying, but would make it almost impossible
  to get the wrong behavior by accident.

 I think that is a very good idea from a usability view point.  It is
 better to have to type a few more chars to explicitly say I know ZFS
 isn't going to do all the data replication when you run zpool than to
 find out later you aren't protected (by ZFS anyway).

+1

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread J.P. King

On Mon, 17 Jul 2006, Cindy Swearingen wrote:


Hi Julian,

Can you send me the documentation pointer that says 2 TB isn't supported
on the Solaris 10 6/06 release?


As per my original post:
http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1k?a=view#disksconcepts-17

This doesn't say which version of Solaris 10 it is talking about.


The  2 TB limit was lifted in the Solaris 10 1/06 release, as described
here:

http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1j?a=view#ftzen


Thanks.  That was exactly what I was looking for but had failed to find, 
confirmation that it should work.



Cindy


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Richard Elling

I too have seen this recently, due to a partially failed drive.
When I physically removed the drive, ZFS figured everything out and
I was back up and running.  Alas, I have been unable to recreate.
There is a bug lurking here, if someone has a more clever way to
test, we might be able to nail it down.
 -- richard

Mikael Kjerrman wrote:

Hi,

so it happened...

I have a 10 disk raidz pool running Solaris 10 U2, and after a reboot the whole 
pool became unavailable after apparently loosing a diskdrive. (The drive is 
seemingly ok as far as I can tell from other commands)

--- bootlog ---
Jul 17 09:57:38 expprd fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-CS, 
TYPE: Fault, VER: 1, SEVERITY: Major
Jul 17 09:57:38 expprd EVENT-TIME: Mon Jul 17 09:57:38 MEST 2006
Jul 17 09:57:38 expprd PLATFORM: SUNW,UltraAX-i2, CSN: -, HOSTNAME: expprd
Jul 17 09:57:38 expprd SOURCE: zfs-diagnosis, REV: 1.0
Jul 17 09:57:38 expprd EVENT-ID: e2fd61f7-a03d-6279-d5a5-9b8755fa1af9
Jul 17 09:57:38 expprd DESC: A ZFS pool failed to open.  Refer to 
http://sun.com/msg/ZFS-8000-CS for more information.
Jul 17 09:57:38 expprd AUTO-RESPONSE: No automated response will occur.
Jul 17 09:57:38 expprd IMPACT: The pool data is unavailable
Jul 17 09:57:38 expprd REC-ACTION: Run 'zpool status -x' and either attach the 
missing device or
Jul 17 09:57:38 expprd  restore from backup.
---

--- zpool status -x ---
bash-3.00# zpool status -x
  pool: data
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
dataUNAVAIL  0 0 0  insufficient replicas
  c1t0d0ONLINE   0 0 0
  c1t1d0ONLINE   0 0 0
  c1t2d0ONLINE   0 0 0
  c1t3d0ONLINE   0 0 0
  c2t0d0ONLINE   0 0 0
  c2t1d0ONLINE   0 0 0
  c2t2d0ONLINE   0 0 0
  c2t3d0ONLINE   0 0 0
  c2t4d0ONLINE   0 0 0
  c1t4d0UNAVAIL  0 0 0  cannot open
--

The problem as I see it is that the pool should be able to handle 1 disk error, 
no?
and the online, attach, replace commands doesn't work when the pool is 
unavailable. I've filed a case with Sun, but thought I'd ask around here to see 
if anyone has experienced this before.


cheers,

//Mikael
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread Al Hopper
On Mon, 17 Jul 2006, Roch wrote:


 Sorry to plug my own blog but have you had a look at these ?

   http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to (raidz)
   http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs

 Also, my thinking is that raid-z is probably more friendly
 when the config contains (power-of-2 + 1) disks (or + 2 for
 raid-z2).

+1

I think that 5 disks for a raidz is the sweet spot IMHO.  But ... YMMV etc.etc.

FWIW: here's a datapoint from a dirty raidz system with 8Gb of RAM  5 *
300Gb SATA disks:

Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zfs016G 88937  99 195973  47 95536  29 75279  95 228022  27 433.9   
1
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 31812  99 + +++ + +++ 28761  99 + +++ + +++
zfs0,16G,88937,99,195973,47,95536,29,75279,95,228022,27,433.9,1,16,31812,99,+,+++,+,+++,28761,99,+,+++,+,+++

I'm *very* pleased with the current release of ZFS.  That being said, ZFS
can be frustrating at times.  Occasionally it'll issue in excess of 1k I/O
ops a Second (IOPS) and you'll say holy snit, look at... - and then
there are times you wonder why it won't issue more that ~250 IOPS.  But,
for a Rev 1 filesystem, with the technical complexity of ZFS, this level
of performance is excellent IMHO and I expect that all kinds of
improvements will continue to be made on the code over time.

Jonathan - I expect the answer to your performance expectations is that
ZFS is-what-it-is at the moment.  A suggestion is to split your 8 drives
into a 5 disk raidz pool and a 2 disk mirror with one spare drive
remaining.  Of course this is from my ZFS experience and for my intended
usage and may not apply to your intended application(s).

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Bart Smaalders

Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 09:44:28AM -0700, Bart Smaalders wrote:

Mark Shellenbaum wrote:

PERMISSION GRANTING

zfs allow -c ability[,ability...] dataset

-c Create means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

ALLOW EXAMPLE 

Lets setup a public build machine where engineers in group staff can 
create ZFS file systems,clones,snapshots and so on, but you want to allow 
only creator of the file system to destroy it.


# zpool create sandbox disks
# chmod 1777 /sandbox
# zfs allow -l staff create sandbox
# zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?


I think you're asking for the -c Creator flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt


Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Mark Shellenbaum

Bart Smaalders wrote:

Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 09:44:28AM -0700, Bart Smaalders wrote:

Mark Shellenbaum wrote:

PERMISSION GRANTING

zfs allow -c ability[,ability...] dataset

-c Create means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

ALLOW EXAMPLE
Lets setup a public build machine where engineers in group staff 
can create ZFS file systems,clones,snapshots and so on, but you want 
to allow only creator of the file system to destroy it.


# zpool create sandbox disks
# chmod 1777 /sandbox
# zfs allow -l staff create sandbox
# zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?


I think you're asking for the -c Creator flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt


Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.



Yes, you can delegate snapshot,clone,...

# zfs allow user snapshot,mount,clone,whatever pool

that will allow the above permissions to be inherited by all datasets in 
the pool.


If you wanted to open it up even more you could do

# zfs allow everyone snapshot,mount,clone,whatever pool
That would allow anybody to create a snapshot,clone,...

The -l and -d control the inheritance of the allow permissions.


- Bart



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Matthew Ahrens
On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote:
 So as administrator what do I need to do to set
 /export/home up for users to be able to create their own
 snapshots, create dependent filesystems (but still mounted
 underneath their /export/home/usrname)?
 
 In other words, is there a way to specify the rights of the
 owner of a filesystem rather than the individual - eg, delayed
 evaluation of the owner?
 
 I think you're asking for the -c Creator flag.  This allows
 permissions (eg, to take snapshots) to be granted to whoever creates the
 filesystem.  The above example shows how this might be done.
 
 --matt
 
 Actually, I think I mean owner.
 
 I want root to create a new filesystem for a new user under
 the /export/home filesystem, but then have that user get the
 right privs via inheritance rather than requiring root to run
 a set of zfs commands.

In that case, how should the system determine who the owner is?  We
toyed with the idea of figuring out the user based on the last component
of the filesystem name, but that seemed too tricky, at least for the
first version.

FYI, here is how you can do it with an additional zfs command:

# zfs create tank/home/barts
# zfs allow barts create,snapshot,... tank/home/barts

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Bart Smaalders

Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote:

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?

I think you're asking for the -c Creator flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt

Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.


In that case, how should the system determine who the owner is?  We
toyed with the idea of figuring out the user based on the last component
of the filesystem name, but that seemed too tricky, at least for the
first version.

FYI, here is how you can do it with an additional zfs command:

# zfs create tank/home/barts
# zfs allow barts create,snapshot,... tank/home/barts

--matt


Owner of the top level directory is the owner of the filesystem?

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread Torrey McMahon

Or if you have the right patches ...

http://blogs.sun.com/roller/page/torrey?entry=really_big_luns

Cindy Swearingen wrote:

Hi Julian,

Can you send me the documentation pointer that says 2 TB isn't supported
on the Solaris 10 6/06 release?

The  2 TB limit was lifted in the Solaris 10 1/06 release, as described
here:

http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1j?a=view#ftzen

Thanks,

Cindy



J.P. King wrote:

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.



Well, in fact it turned out that the firmware on the device needed 
upgrading to support the appropriate SCSI extensions.


The documentation is still wrong in that it suggests that the ssd/sd 
driver shouldn't work with 2TB, but I am happy, so no problem.



Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Nicolas Williams
On Mon, Jul 17, 2006 at 10:11:35AM -0700, Matthew Ahrens wrote:
  I want root to create a new filesystem for a new user under
  the /export/home filesystem, but then have that user get the
  right privs via inheritance rather than requiring root to run
  a set of zfs commands.
 
 In that case, how should the system determine who the owner is?  We
 toyed with the idea of figuring out the user based on the last component
 of the filesystem name, but that seemed too tricky, at least for the
 first version.

The owner of the root directory of the ZFS filesystem in question.
Could delegation be derived from the ACL of the directory that would
contain a new ZFS filesystem?

E.g.,

# zfs create pool/foo
# chown joe pool/foo
# su - joe
% zfs create pool/foo/a
% chmod add ACE that allows jane to create directories /pool/foo/a
% exit
# su - jane
% zfs create pool/foo/a/b
% 
...

After all, with cheap filesystems creating a filesystem is almost like
creating a directory (I know, not quite the same, but perhaps close
enough for reusing the add_subdirectory ACE flag).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread James Dickens

On 7/17/06, Mark Shellenbaum [EMAIL PROTECTED] wrote:

The following is the delegated admin model that Matt and I have been
working on.  At this point we are ready for your feedback on the
proposed model.

   -Mark




PERMISSION GRANTING

zfs allow [-l] [-d] everyone|user|group ability[,ability...] \
dataset
zfs allow [-l] [-d] -u user ability[,ability...] dataset
zfs allow [-l] [-d] -g group ability[,ability...] dataset
zfs allow [-l] [-d] -e ability[,ability...] dataset
zfs allow -c ability[,ability...] dataset

If no flags are used, the ability will be allowed for the specified
dataset and all of its descendents.

-l Local means that the permission will be allowed for the
specified dataset, and not its descendents (unless -d is also
specified).

-d Descendents means that the permission will be allowed for
descendent datasets, and not for this dataset (unless -l is also
specified).  (needed for 'zfs allow -d ahrens quota tank/home/ahrens')

When using the first form (without -u, -g, or -e), the
everyone|user|group argument will be interpreted as the keyword
everyone if possible, then as a user if possible, then as a group as
possible.  The -u user, -g group, and -e (everyone) forms
allow one to specify a user named everyone, or a group whose name
conflicts with a user (or everyone).  (note: the -e form is not
necessary since zfs allow everyone will always mean the keyword
everyone not the user everyone.)

As a possible extension, multiple who's could be allowed in one
command (eg. 'zfs allow -u ahrens,marks create tank/project')

-c Create means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

Abilities are mostly self explanatory, the ability to run
'zfs [set] ability ds'.  Note, this implicitly collapses the
subcommand and property namespaces into one.  (I think that the 'set' is
superfluous anyway, it would be more convenient to say
'zfs property=value' anyway.)

create  create descendent datasets
destroy
snapshot
rollback
clone   create clone of any of the ds's snaps
(must also have 'create' ability in clone's parent)
promote (must also have 'promote' ability in origin fs)
rename  (must also have 'create' ability in new parent)
mount   mount and unmount the ds
share   share and unshare this ds
sendsend any of the ds's snapshots
receive create a descendent with 'zfs receive'
(must also have 'create' ability)
quota
reservation
volsize
recordsize
mountpoint
sharenfs
checksum
compression
atime
devices
exec
setuid
readonly
zoned
snapdir
aclmode
aclinherit


Hi

just one addition, all or full attributes, for the case you want
to get full permissions to the user or group

zfs create p1/john
zfs  allow  p1/john john  full

so we don't have to type out every attribute.


James Dickens
uadmin.blogspot.com





PERMISSION REVOKING

zfs unallow dataset [-r] [-l] [-d]
everyone|user|group[,everyone|user|group...] \
ability[,ability...] dataset
zfs unallow [-r][-l][-d] -u user ability[,ability...]  dataset
zfs unallow [-r][-l][-d] -g group ability[,ability...]  dataset
zfs unallow [-r][-l][-d] -e ability[,ability...]  dataset

'zfs unallow' removes permissions that were granted with 'zfs allow'.
Note that this does not explicitly deny any permissions; the permissions
may still be allowed by ancestors of the specified dataset.

-l Local will cause only the Local permission to be removed.

-d Descendents will cause only the Descendant permissions to be
removed.

-r Recursive will remove the specified permissions from all descendant
datasets, as if 'zfs unallow' had been run on each descendant.

Note that '-r' removes abilities that have been explicitly set on
descendants, whereas '-d' removes abilities that have been set on *this*
dataset but apply to descendants.


PERMISSION PRINTING

zfs allow [-1] dataset

prints permissions that are set or allowed on this dataset, in the
following format:

whotype who ability[,ability...] (type)

whotype is user, group, or everyone
who is the user or group name, or blank for everyone and create
type can be:
Local (ie. set here with -l)
Descendent (ie. set here with -d)
Local+Descendent (ie. set here with no flags)
Create (ie. set here with -c)
Inherited from dataset (ie. set on an ancestor without -l)

By default, only one line with a given whotype,who,type will be
printed (ie. abilities will be consolodated into one line of output
where possible).

-1 One will cause each line of output to print only a single ability,

Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread James Dickens

On 7/17/06, Jonathan Wheeler [EMAIL PROTECTED] wrote:

Hi All,

I've just built an 8 disk zfs storage box, and I'm in the testing phase before 
I put it into production. I've run into some unusual results, and I was hoping 
the community could offer some suggestions. I've bascially made the switch to 
Solaris on the promises of ZFS alone (yes I'm that excited about it!), so 
naturally I'm looking forward to some great performance - but it appears I'm 
going to need some help finding all of it.

I was having even lower numbers with filebench, so I decided to dial back to a 
really simple app for testing - bonnie.

The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate sata II 
300GB disks, Supermicro SAT2-MV8 8 port sata controller, running at/on a 133Mhz 
64pci-x bus.
The bottle neck here, by my thinkng, should be the disks themselves.
It's not the disk interfaces ('300MB'), the disk bus (300MB EACH), the pci-x 
bus (1.1GB), and I'd hope a 64-bit 3Ghz cpu would be sufficent.

Tests were run on a fresh clean zpool, on an idle system. Rogue results were 
dropped, and as you can see below, all tests were run more then once. 8GB 
should be far more then the 1GB of RAM that the system has, eliminating caching 
issues.

If I've still managed to overlook something in my testing setup, please let me 
know - I sure did try!

Sorry about the formatting - this is bound to end up ugly

Bonnie
  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raid0MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0  
2.0
8 disk   8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9  
2.1

so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be 
LOWER then writes??

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
mirror   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 33285 38.6 46033  9.9 33077  6.8 67934 90.4  93445  7.7 230.5  1.3
8 disk   8196 34821 41.4 46136  9.0 32445  6.6 67120 89.1  94403  6.9 210.4  1.8

46MB/sec writes, each disk individually can do better, but I guess keeping 8 
disks in sync is hurting performance. The 94MB/sec writes is interesting. One 
the one hand, that's greater then 1 disk's worth, so I'm getting striping 
performance out of a mirror GO ZFS. On the other, if I can get striping 
performance from mirrored reads, why is it only 94MB/sec? Seemingly it's not 
cpu bound.


Now for the important test, raid-z

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raidz  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3  
1.0
8 disk   8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3  
1.0
8 disk   8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5  
0.9
7 disk   8196 51103 58.8  93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9  
1.0
7 disk   8196 49446 56.8  93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1  
1.0
7 disk   8196 49831 57.1  81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4  
1.0
6 disk   8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7  
0.9
6 disk   8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4  
0.8
4 disk   8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1  
0.9

I'm getting distinctly non-linear scaling here.
Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8 =33Mb/sec with 
cpu to spare (roughly half on what each individual disk should be capable of). 
Here I'm getting 123/4= 30Mb/sec, or should that be 123/3= 41Mb/sec?
Using 30 as a basline, I'd be expecting to see twice that with 8 disks 
(240ish?). What I end up with is ~135, Clearly not good scaling at all.
The really interesting numbers happen at 7 disks - it's slower then with 4, in 
all tests.
I ran it 3x to be sure.
Note this was a native 7 disk raid-z, it wasn't 8 running in degraded mode with 
7.
Something is really wrong with my write performance here across the board.

Reads: 4 disks gives me 190MB/sec. WOAH! I'm very happy with that. 8 disks 
should scale to 380 then, Well 320 isn't all that far off - no biggie.
Looking at the 6 disk raidz is interesting though, 290MB/sec. The disks are 
good for 60+MB/sec individually. 290 is 48/disk - note also that this is better 
then my raid0 performance?!
Adding another 2 disks to my raidz gives me a mere 30Mb/sec extra performance? 
Something is going very wrong here too.


I'm not an expert, but would be great if you could run at least one more test.

can you try  2x 4disks in 

Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Mark Shellenbaum

Glenn Skinner wrote:

The following is a nit-level comment, so I've directed it onl;y to you,
rather than to the entire list.

Date: Mon, 17 Jul 2006 09:57:35 -0600
From: Mark Shellenbaum [EMAIL PROTECTED]
Subject: [zfs-discuss] Proposal: delegated administration

The following is the delegated admin model that Matt and I have been 
working on.  At this point we are ready for your feedback on the 
proposed model.


...
PERMISSION REVOKING

zfs unallow dataset [-r] [-l] [-d]
everyone|user|group[,everyone|user|group...] \
ability[,ability...] dataset
zfs unallow [-r][-l][-d] -u user ability[,ability...]  dataset
zfs unallow [-r][-l][-d] -g group ability[,ability...]  dataset
	zfs unallow [-r][-l][-d] -e ability[,ability...]  dataset 


Please, can we have disallow instead of unallow?  The former is a
real word, the latter isn't.

-- Glenn



The reasoning behind unallow was to imply that you are simply removing 
an allow.  With *disallow* it would sound more like you are denying a 
permission.


  -Mark

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Mark Shellenbaum

James Dickens wrote:

On 7/17/06, Mark Shellenbaum [EMAIL PROTECTED] wrote:

The following is the delegated admin model that Matt and I have been
working on.  At this point we are ready for your feedback on the
proposed model.

   -Mark




PERMISSION GRANTING

zfs allow [-l] [-d] everyone|user|group 
ability[,ability...] \

dataset
zfs allow [-l] [-d] -u user ability[,ability...] dataset
zfs allow [-l] [-d] -g group ability[,ability...] dataset
zfs allow [-l] [-d] -e ability[,ability...] dataset
zfs allow -c ability[,ability...] dataset

If no flags are used, the ability will be allowed for the specified
dataset and all of its descendents.

-l Local means that the permission will be allowed for the
specified dataset, and not its descendents (unless -d is also
specified).

-d Descendents means that the permission will be allowed for
descendent datasets, and not for this dataset (unless -l is also
specified).  (needed for 'zfs allow -d ahrens quota tank/home/ahrens')

When using the first form (without -u, -g, or -e), the
everyone|user|group argument will be interpreted as the keyword
everyone if possible, then as a user if possible, then as a group as
possible.  The -u user, -g group, and -e (everyone) forms
allow one to specify a user named everyone, or a group whose name
conflicts with a user (or everyone).  (note: the -e form is not
necessary since zfs allow everyone will always mean the keyword
everyone not the user everyone.)

As a possible extension, multiple who's could be allowed in one
command (eg. 'zfs allow -u ahrens,marks create tank/project')

-c Create means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

Abilities are mostly self explanatory, the ability to run
'zfs [set] ability ds'.  Note, this implicitly collapses the
subcommand and property namespaces into one.  (I think that the 'set' is
superfluous anyway, it would be more convenient to say
'zfs property=value' anyway.)

create  create descendent datasets
destroy
snapshot
rollback
clone   create clone of any of the ds's snaps
(must also have 'create' ability in clone's 
parent)

promote (must also have 'promote' ability in origin fs)
rename  (must also have 'create' ability in new parent)
mount   mount and unmount the ds
share   share and unshare this ds
sendsend any of the ds's snapshots
receive create a descendent with 'zfs receive'
(must also have 'create' ability)
quota
reservation
volsize
recordsize
mountpoint
sharenfs
checksum
compression
atime
devices
exec
setuid
readonly
zoned
snapdir
aclmode
aclinherit


Hi

just one addition, all or full attributes, for the case you want
to get full permissions to the user or group

zfs create p1/john
zfs  allow  p1/john john  full

so we don't have to type out every attribute.



I think you wanted

zfs allow john full p1/john

We could have either a full or all to represent all permissions, but 
the problem with that is that you will then end up granting more 
permissions than are necessary to achieve the desired goal.


If enough people think its useful then we can do it.



James Dickens
uadmin.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-17 Thread Gregory Shaw
To maximize the throughput, I'd go with 8 5-disk raid-z{2} luns.   Using that configuration, a full-width stripe write should be a single operation for each controller.In production, the application needs would probably dictate the resulting disk layout.  If the application doesn't need tons of i/o, you could bind more disks together for larger luns...On Jul 17, 2006, at 3:30 PM, Richard Elling wrote:ZFS fans,I'm preparing some analyses on RAS for large JBOD systems such asthe Sun Fire X4500 (aka Thumper).  Since there are zillions of possiblepermutations, I need to limit the analyses to some common or desirablescenarios.  Naturally, I'd like your opinions.  I've already got a fewscenarios in analysis, and I don't want to spoil the brain storming, sofeel free to think outside of the box.If you had 46 disks to deploy, what combinations would you use?  Why?Examples,	46-way RAID-0  (I'll do this just to show why you shouldn't do this)	22x2-way RAID-1+0 + 2 hot spares	15x3-way RAID-Z2+0 + 1 hot spare	...Because some people get all wrapped up with the controllers, assume 58-disk SATA controllers plus 1 6-disk controller.  Note: the reliability ofthe controllers is much greater than the reliability of the disks, sothe data availability and MTTDL analysis will be dominated by the disksthemselves.  In part, this is due to using SATA/SAS (point-to-point diskconnections) rather than a parallel bus or FC-AL where we would also haveto worry about bus or loop common cause failures.I will be concentrating on data availability and MTTDL as two views of RAS.The intention is that the interesting combinations will also be analyzedfor performance and we can complete a full performability analysis on them.Thanks -- richard___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss  -Gregory Shaw, IT ArchitectPhone: (303) 673-8273        Fax: (303) 673-2773ITCTO Group, Sun Microsystems Inc.1 StorageTek Drive ULVL4-382              [EMAIL PROTECTED] (work)Louisville, CO 80028-4382                    [EMAIL PROTECTED] (home)"When Microsoft writes an application for Linux, I've Won." - Linus Torvalds ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Large device support

2006-07-17 Thread J.P. King



I take it you already have solved the problem.


Yes, my problems went away once my device supported the extended SCSI 
instruction set.


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS needs a viable backup mechanism

2006-07-17 Thread Matthew Ahrens
On Fri, Jul 07, 2006 at 04:00:38PM -0400, Dale Ghent wrote:
 Add an option to zpool(1M) to dump the pool config as well as the  
 configuration of the volumes within it to an XML file. This file  
 could then be sucked in to zpool at a later date to recreate/ 
 replicate the pool and its volume structure in one fell swoop. After  
 that, Just Add Data(tm).

Yep, this has been on our to-do list for quite some time:

RFE #6276640 zpool config
RFE #6276912 zfs config

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Fun with ZFS and iscsi volumes

2006-07-17 Thread Jason Hoffman

Hi Everyone,

I thought I'd share some benchmarking and playing around that we had  
done with making zpools from disks that were iscsi volumes. The  
numbers are representative of 6 benchmarking rounds per.


The interesting finding at least for us was the filebench varmail  
(50:50 reads-writes) results where we had a RAIDZ pool containing 9  
volumes vs other combinations (we also did 8,7,6,5,4 volumes. Groups  
of 3 and 9 seemed to form the boundary cases).


Regards, Jason


More detailed at http://svn.joyent.com/public/experiments/equal-sun- 
iscsi-zfs-fun.txt



## Hardware Setup

- A T1000 server, 8 core, 8GB of RAM
- Equallogic PS300E storage arrays
- Standard 4 dedicated switch arrangements (http://joyeur.com/ 
2006/05/04/what-sysadmins-start-doing-when-hanging-around-designers)


## Software benchmarks

- FileBench (http://www.opensolaris.org/os/community/performance/ 
filebench/) with varmail and webserver workloads

- Bonnie++

## Questions
1) In a zpool of 3x RAIDZ groups of 3 volumes each, can we offline a  
total of 3 drives (that could come from 3 different physical  
arrays)? What are the performance differences between 9 online and 6  
online drives?

2) What are the differences between zpools containing
- 3x RAIDZ groups of 3 volumes each,
- a single zpool of 9 volumes with no mirroring or RAIDZ,
- a single RAIDZ group with 9 volumes, and
- two RAIDZ groups with 9 volumes each?
3) Can we saturate a single gigabit connection between the server and  
iSCSI storage?


## Findings
1) Tolerated offlining 3 of 9 drives in a 3x RAIDZ of 3 drives each.  
DEGRADED (6 of 9 online) versus ONLINE (9 of 9)

a) Filebench varmail (50:50 reads-writes):
- 2045.2 ops/s in state: DEGRADED
- 2473.0 ops/s in state: ONLINE
b) Filebench webserver (90:10 reads-writes)
- 54530.5 ops/s in state: DEGRADED
- 54328.1 ops/s in state: ONLINE.

2) Filebench RAIDZ of 3x3 vs RAID0 vs RAIDZ of 1x9 vs RAIDZ of 2x9
a) Varmail (50:50 reads-writes):
- 2473.0 ops/s (RAIDZ of 3x3)
- 4316.8 ops/s (RAID0),
- 13144.8 ops/s (RAIDZ of 1x9),
- 11363.7 ops/s (RAIDZ of 2x9)
b) Webserver (90:10 reads-writes):
- 54328.1 ops/s (RAIDZ of 3x3),
- 54386.9 ops/s (RAID0),
- 53960.1 ops/s (RAIDZ of 1x9),
- 56897.2 ops/s (RAIDZ of 2x9)

3) We could saturate a single gigabit connection out to the storage.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-17 Thread Jim Mauro
I agree with Greg - For ZFS, I'd recommend a larger number of raidz 
luns, with a smaller number

of disks per LUN, up to 6 disks per raidz lun.

This will more closely align with performance best practices, so it 
would be cool to find

common ground in terms of a sweet-spot for performance and RAS.

/jim


Gregory Shaw wrote:
To maximize the throughput, I'd go with 8 5-disk raid-z{2} luns.  
 Using that configuration, a full-width stripe write should be a 
single operation for each controller.


In production, the application needs would probably dictate the 
resulting disk layout.  If the application doesn't need tons of i/o, 
you could bind more disks together for larger luns...


On Jul 17, 2006, at 3:30 PM, Richard Elling wrote:


ZFS fans,
I'm preparing some analyses on RAS for large JBOD systems such as
the Sun Fire X4500 (aka Thumper).  Since there are zillions of possible
permutations, I need to limit the analyses to some common or desirable
scenarios.  Naturally, I'd like your opinions.  I've already got a few
scenarios in analysis, and I don't want to spoil the brain storming, so
feel free to think outside of the box.

If you had 46 disks to deploy, what combinations would you use?  Why?

Examples,
46-way RAID-0  (I'll do this just to show why you shouldn't do this)
22x2-way RAID-1+0 + 2 hot spares
15x3-way RAID-Z2+0 + 1 hot spare
...

Because some people get all wrapped up with the controllers, assume 5
8-disk SATA controllers plus 1 6-disk controller.  Note: the 
reliability of

the controllers is much greater than the reliability of the disks, so
the data availability and MTTDL analysis will be dominated by the disks
themselves.  In part, this is due to using SATA/SAS (point-to-point disk
connections) rather than a parallel bus or FC-AL where we would also have
to worry about bus or loop common cause failures.

I will be concentrating on data availability and MTTDL as two views 
of RAS.

The intention is that the interesting combinations will also be analyzed
for performance and we can complete a full performability analysis on 
them.

Thanks
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-2773
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382  [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED] (work)
Louisville, CO 80028-4382[EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. - Linus 
Torvalds






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Nathan Kroenert
Jeff -

That sounds like a great idea... 

Another idea might to be have a zpool create announce the 'availability'
of any given configuration, and output the Single points of failure.

# zpool create mypool a b c
NOTICE: This pool has no redundancy. 
Without hardware redundancy (raid1 / 5), 
a single disk failure will destroy the whole pool.

# zpool create mypool raidz a b c
NOTICE: This pool has single disk redundancy. 
Without hardware redundancy (raid1 / 5), 
this pool can survive at most 1 disks failing.

# zpool create mypool raidz2 a b c
NOTICE: This pool has double disk redundancy. 
Without hardware redundancy (raid1 / 5), 
this pool can survive at most 2 disks failing.

It would be especially nice if it was able to detect silly
configurations too (like adding dimple disks to a raidz or something
like that (if it's even possible) and announce the reduction in
reliability.

Thoughts? :)

Nathan.










On Mon, 2006-07-17 at 18:35, Jeff Bonwick wrote:
  I have a 10 disk raidz pool running Solaris 10 U2, and after a reboot
  the whole pool became unavailable after apparently loosing a diskdrive.
  [...]
  NAMESTATE READ WRITE CKSUM
  dataUNAVAIL  0 0 0  insufficient replicas
c1t0d0ONLINE   0 0 0
  [...]
c1t4d0UNAVAIL  0 0 0  cannot open
  --
  
  The problem as I see it is that the pool should be able to handle
  1 disk error, no?
 
 If it were a raidz pool, that would be correct.  But according to
 zpool status, it's just a collection of disks with no replication.
 Specifically, compare these two commands:
 
 (1) zpool create data A B C
 
 (2) zpool create data raidz A B C
 
 Assume each disk has 500G capacity.
 
 The first command will create an unreplicated pool with 1.5T capacity.
 The second will create a single-parity RAID-Z pool with 1.0T capacity.
 
 My guess is that you intended the latter, but actually typed the former,
 perhaps assuming that RAID-Z was always present.  If so, I apologize for
 not making this clearer.  If you have any suggestions for how we could
 improve the zpool(1M) command or documentation, please let me know.
 
 One option -- I confess up front that I don't really like it -- would be
 to make 'unreplicated' an explicit replication type (in addition to
 mirror and raidz), so that you couldn't get it by accident:
 
   zpool create data unreplicated A B C
 
 The extra typing would be annoying, but would make it almost impossible
 to get the wrong behavior by accident.
 
 Jeff
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Eric Schrock
On Tue, Jul 18, 2006 at 10:10:33AM +1000, Nathan Kroenert wrote:
 Jeff -
 
 That sounds like a great idea... 
 
 Another idea might to be have a zpool create announce the 'availability'
 of any given configuration, and output the Single points of failure.
 
   # zpool create mypool a b c
   NOTICE: This pool has no redundancy. 
   Without hardware redundancy (raid1 / 5), 
   a single disk failure will destroy the whole pool.
 
   # zpool create mypool raidz a b c
   NOTICE: This pool has single disk redundancy. 
   Without hardware redundancy (raid1 / 5), 
   this pool can survive at most 1 disks failing.
 
   # zpool create mypool raidz2 a b c
   NOTICE: This pool has double disk redundancy. 
   Without hardware redundancy (raid1 / 5), 
   this pool can survive at most 2 disks failing.
 
 It would be especially nice if it was able to detect silly
 configurations too (like adding dimple disks to a raidz or something
 like that (if it's even possible) and announce the reduction in
 reliability.

FYI, zpool(1M) will already detect some variations of silly and force
you to use the '-f' option if you really mean it (for add and create).
Examples include using vdevs of different redundancy (raidz + mirror),
as well as using different size devices.  If you have other definitions
of silly, let us know what we should be looking for.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-17 Thread Richard Elling

[stirring the pot a little...]

Jim Mauro wrote:
I agree with Greg - For ZFS, I'd recommend a larger number of raidz 
luns, with a smaller number

of disks per LUN, up to 6 disks per raidz lun.


For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z
or RAID-Z2.  For 3-5 disks, RAID-Z2 offers better resiliency, even
over split-disk RAID-1+0.

This will more closely align with performance best practices, so it 
would be cool to find

common ground in terms of a sweet-spot for performance and RAS.


It is clear that a single 46-way RAID-Z or RAID-Z2 zpool won't be
popular :-)
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] metadata inconsistency?

2006-07-17 Thread Matthew Ahrens
On Thu, Jul 06, 2006 at 12:46:57AM -0700, Patrick Mauritz wrote:
 Hi,
 after some unscheduled reboots (to put it lightly), I've got an interesting 
 setup on my notebook's zfs partition:
 setup: simple zpool, no raid or mirror, a couple of zfs partitions, one zvol 
 for swap. /foo is one such partition, /foo/bar the directory with the issue.
 
 directly after the reboot happened:
 $ ls /foo/bar
 test.h
 $ ls -l /foo/bar
 Total 0
 
 the file wasn't accessible with cat, etc.

This can happen when the file appears in the directory listing (ie.
getdents(2)), but a stat(2) on the file fails.  Why that stat would fail
is a bit of a mystery, given that ls doesn't report the error.

It could be that the underlying hardware has failed, and the directory
is still intact but the file's metadata has been damaged.  (Note, this
would be hardware error, not metadata inconsistency.)

Another possibility is that the file's inode number is too large to be
expressed in 32 bits, thus causing a 32-bit stat() to fail.  However,
I don't think that Sun's ls(1) should be issuing any 32-bit stats (even
on a 32-bit system, it should be using stat64).

 somewhat later (new data appeared on /foo, in /foo/baz):
 $ ls -l /foo/bar
 Total 3
 -rw-r--r-- 1 user group 1400 Jul 6 02:14 test.h
 
 the content of test.h is the same as the content of /foo/baz/quux now,
 but the refcount is 1!
 
 $ chmod go-r /foo/baz/quux
 $ ls -l /foo/bar
 Total 3
 -rw--- 1 user group 1400 Jul 6 02:14 test.h

This behavior could also be explained if there is an unknown bug which
causes the object representing the file to be deleted, but not the
directory entry pointing to it.

 anyway, how do I get rid of test.h now without making quux unreadable?
 (the brute force approach would be a new partition, moving data over
 with copying - instead of moving - the troublesome file, just in case
 - not sure if zfs allows for links that cross zfs partitions and thus
 optimizes such moves, then zfs destroy data/test, but there might be a
 better way?)

Before trying to rectify the problem, could you email me the output of
'zpool status' and 'zdb -vvv foo'?  

FYI, there are no cross-filesystem links, even with ZFS.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss