Re: [zfs-discuss] Take Three: PSARC 2007/171 ZFS Separate Intent Log

2007-07-11 Thread Cyril Plisko
Neil,

many thanks for publishing this doc - it is exactly
what I was looking for !


On 7/9/07, Neil Perrin [EMAIL PROTECTED] wrote:
 Er with attachment this time.


  So I've attached the accepted proposal. There was (as expected) not
  much discussion of this case as it was considered an obvious extension.
  The actual psarc case materials when opened will not have much more info
  than this.

 PSARC CASE: 2007/171 ZFS Separate Intent Log

 SUMMARY:

 This is a proposal to allow separate devices to be used
 for the ZFS Intent Log (ZIL). The sole purpose of this is
 performance. The devices can be disks, solid state drives,
 nvram drives, or any device that presents a block interface.

 PROBLEM:

 The ZIL satisfies the synchronous requirements of POSIX.
 For instance, databases often require their
 transactions to be on stable storage on return from the system
 call.  NFS and other applications can also use fsync() to ensure
 data stability. The speed of the ZIL is therefore essential in
 determining the latency of writes for these critical applications.

 Currently the ZIL is allocated dynamically from the pool.
 It consists of a chain of varying block sizes which are
 anchored in fixed objects. Blocks are sized to fit the
 demand and will come from different metaslabs and thus
 different areas of the disk. This causes more head movement.

 Furthermore, the log blocks are freed as soon as the intent
 log transaction (system call) is committed. So a swiss cheesing
 effect can occur leading to pool fragmentation.

 PROPOSED SOLUTION:

 This proposal takes advantage of the greatly faster media speeds
 of nvram, solid state disks, or even dedicated disks.
 To this end, additional extensions to the zpool command
 are defined:

 zpool create pool pool devices log log devices
 Creates a pool with a separate log. If more than one
 log device is specified then writes are load-balanced
 between devices. It's also possible to mirror log
 devices. For example a log consisting of
 two sets of two mirrors could be created thus:

 zpool create pool pool devices \
 log mirror c1t8d0 c1t9d0 mirror c1t10d0 c1t11d0

 A raidz/raidz2 log is not supported

 zpool add pool log log devices
 Creates a separate log if it doesn't exist, or
 adds extra devices if it does.

 zpool remove pool log devices
 Remove the log devices. If all log devices are removed
 we revert to placing the log in the pool.  Evacuating a
 log is easily handled by ensuring all txgs are committed.

 zpool replace pool old log device new log device
 Replace old log device with new log device.

 zpool attach pool log device new log device
 Attaches a new log device to an existing log device. If
 the existing device is not a mirror then a 2 way mirror
 is created. If device is part of a two-way log mirror,
 attaching new_device creates a three-way log mirror,
 and so on.

 zpool detach pool log device
 Detaches a log device from a mirror.

 zpool status
 Additionally displays the log devices

 zpool iostat
 Additionally shows IO statistics for log devices.

 zpool export/import
 Will export and import the log devices.

 When a separate log that is not mirrored fails then
 logging will start using chained logs within the main pool.

 The name log will become a reserved word. Attempts to create
 a pool with the name log will fail with:

 cannot create 'log': name is reserved
  pool name may have been omitted

 Hot spares cannot replace log devices.





-- 
Regards,
Cyril
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pseudo file system access to snapshots?

2007-07-11 Thread Darren J Moffat
Mike Gerdts wrote:
 Perhaps a better approach is to create a pseudo file system that looks like:
 
 mntpt/pool
/@@
/@today
/@yesterday
/fs
   /@@
   /@2007-06-01
/otherfs
/@@

How is this different from cd mntpt/.zfs/snapshot/   ?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pseudo file system access to snapshots?

2007-07-11 Thread Mike Gerdts
On 7/11/07, Darren J Moffat [EMAIL PROTECTED] wrote:
 Mike Gerdts wrote:
  Perhaps a better approach is to create a pseudo file system that looks like:
 
  mntpt/pool
 /@@
 /@today
 /@yesterday
 /fs
/@@
/@2007-06-01
 /otherfs
 /@@

 How is this different from cd mntpt/.zfs/snapshot/   ?


mntpt/.zfs/snapshot provides file-level access to the contents of
the snapshot.  If you back those up, then restore every snapshot, you
will potentially be using way more disk space.

What I am proposing is that cat mntpt/pool/@snap1 delivers a data
stream corresponding to the output of zfs send and that cat
mntpnt/pool/@[EMAIL PROTECTED] delivers a data stream corresponding to zfs
send -i snap1 snap2.

This would allow existing backup tools to perform block level
incremental backups.  Assuming that writing to the various files is
the equivalent of the corresponding zfs receive commands, it
provides for block level restores that preserve space efficiency as
well.

Why?

Suppose I have a server with 50 full root zones on it.  Each zone has
a zonepath at /zones/zonename that is about 8 GB.  This implies that
I need 400 GB just for zone paths.  Using ZFS clones, I can likely
trim that down to far less than 100 and probably less than 20.  I
can't trim it down that far if I don't have a way to restore the
system.

This restore problem is my key worry in deploying ZFS in the area
where I see it as most beneficial.  Another solution that would deal
with the same problem is block-level deduplication.  So far my queries
in this area have been met with silence.

Hmmm... I just ran into another snag with this.  I had been assuming
that clones and snapshots were more closely related.  But when I tried
to send the differences between the source of a clone and a snapshot
within that clone I got this message:

incremental source must be in same filesystem
usage:
send [-i snapshot] snapshot

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pseudo file system access to snapshots?

2007-07-11 Thread Matthew Ahrens
 This restore problem is my key worry in deploying ZFS in the area
 where I see it as most beneficial.  Another solution that would deal
 with the same problem is block-level deduplication.  So far my queries
 in this area have been met with silence.

I must have missed your messages on deduplication.  But did you see this 
thread on it?  zfs space efficiency, 6/24 - 7/7?

We've been thinking about ZFS dedup for some time, and want to do it but have 
other priorities at the moment.

 Hmmm... I just ran into another snag with this.  I had been assuming
 that clones and snapshots were more closely related.  But when I tried
 to send the differences between the source of a clone and a snapshot
 within that clone I got this message:

I'm not sure what you mean by more closely related.  The only reason we 
don't support that is because we haven't gotten around to adding the special 
cases and error checking for it (and I think you're the first person to notice 
its omission).  But it's actually in the works now so stay tuned for an update 
in a few months.

--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and IBM's TSM

2007-07-11 Thread Hans-Juergen Schnitzer


Our main problem with TSM and ZFS is currently that there seems to be
no efficient way to do a disaster restore when the backup
resides on tape - due to the large number of filesystems/TSM filespaces.
The graphical client (dsmj) does not work at all and with dsmc one
has to start a separate restore session for each filespace.
This results in a unpractical large number of tape mounts.

Hans




smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] pool analysis

2007-07-11 Thread Kent Watsen

Richard's blog analyzes MTTDL as a function of N+P+S:
http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl

But to understand how to best utilize an array with a fixed number of 
drives, I add the following constraints:
  - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2}
  - all sets in an array should be configured similarly
  - the MTTDL for S sets is equal to (MTTDL for one set)/S

I got the following results by varying the NUM_BAYS parameter in the 
source code below:

*_4 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 1 (2+1) w/ 1 spares providing 600 GB with MTTDL of
5840.00 years
  - can have 1 (2+2) w/ 0 spares providing 600 GB with MTTDL of
799350.00 years
  - can have 0 (4+1) w/ 4 spares providing 0 GB with MTTDL of Inf years
  - can have 0 (4+2) w/ 4 spares providing 0 GB with MTTDL of Inf years
  - can have 0 (8+1) w/ 4 spares providing 0 GB with MTTDL of Inf years
  - can have 0 (8+2) w/ 4 spares providing 0 GB with MTTDL of Inf years

*_8 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 2 (2+1) w/ 2 spares providing 1200 GB with MTTDL of
2920.00 years
  - can have 2 (2+2) w/ 0 spares providing 1200 GB with MTTDL of
399675.00 years
  - can have 1 (4+1) w/ 3 spares providing 1200 GB with MTTDL of
1752.00 years
  - can have 1 (4+2) w/ 2 spares providing 1200 GB with MTTDL of
2557920.00 years
  - can have 0 (8+1) w/ 8 spares providing 0 GB with MTTDL of Inf years
  - can have 0 (8+2) w/ 8 spares providing 0 GB with MTTDL of Inf years

*_12 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 4 (2+1) w/ 0 spares providing 2400 GB with MTTDL of
365.00 years
  - can have 3 (2+2) w/ 0 spares providing 1800 GB with MTTDL of
266450.00 years
  - can have 2 (4+1) w/ 2 spares providing 2400 GB with MTTDL of
876.00 years
  - can have 2 (4+2) w/ 0 spares providing 2400 GB with MTTDL of
79935.00 years
  - can have 1 (8+1) w/ 3 spares providing 2400 GB with MTTDL of
486.67 years
  - can have 1 (8+2) w/ 2 spares providing 2400 GB with MTTDL of
426320.00 years

*_16 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 5 (2+1) w/ 1 spares providing 3000 GB with MTTDL of
1168.00 years
  - can have 4 (2+2) w/ 0 spares providing 2400 GB with MTTDL of
199837.50 years
  - can have 3 (4+1) w/ 1 spares providing 3600 GB with MTTDL of
584.00 years
  - can have 2 (4+2) w/ 4 spares providing 2400 GB with MTTDL of
1278960.00 years
  - can have 1 (8+1) w/ 7 spares providing 2400 GB with MTTDL of
486.67 years
  - can have 1 (8+2) w/ 6 spares providing 2400 GB with MTTDL of
426320.00 years

*_20 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 6 (2+1) w/ 2 spares providing 3600 GB with MTTDL of
973.33 years
  - can have 5 (2+2) w/ 0 spares providing 3000 GB with MTTDL of
159870.00 years
  - can have 4 (4+1) w/ 0 spares providing 4800 GB with MTTDL of
109.50 years
  - can have 3 (4+2) w/ 2 spares providing 3600 GB with MTTDL of
852640.00 years
  - can have 2 (8+1) w/ 2 spares providing 4800 GB with MTTDL of
243.33 years
  - can have 2 (8+2) w/ 0 spares providing 4800 GB with MTTDL of
13322.50 years

*_24 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 8 (2+1) w/ 0 spares providing 4800 GB with MTTDL of
182.50 years
  - can have 6 (2+2) w/ 0 spares providing 3600 GB with MTTDL of
133225.00 years
  - can have 4 (4+1) w/ 4 spares providing 4800 GB with MTTDL of
438.00 years
  - can have 4 (4+2) w/ 0 spares providing 4800 GB with MTTDL of
39967.50 years
  - can have 2 (8+1) w/ 6 spares providing 4800 GB with MTTDL of
243.33 years
  - can have 2 (8+2) w/ 4 spares providing 4800 GB with MTTDL of
213160.00 years

While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that 
/any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 
is unnecessary in a practical sense...  This conclusion surprises me 
given the amount of attention people give to double-parity solutions - 
what am I overlooking?

Thanks,
Kent



_*Source Code*_  (compile with: cc -std:c99 -lm filename) [its more 
than 80 columns - sorry!]

#include stdio.h
#include math.h

#define NUM_BAYS 24
#define DRIVE_SIZE_GB 300
#define MTBF_YEARS 4
#define MTTR_HOURS_NO_SPARE 16
#define MTTR_HOURS_SPARE 4

int main() {

printf(\n);
printf(%u bays w/ %u GB drives having MTBF=%u years\n, NUM_BAYS, 
DRIVE_SIZE_GB, MTBF_YEARS);
for (int num_drives=2; num_drives=8; num_drives*=2) {
for (int num_parity=1; num_parity=2; num_parity++) {
double  mttdl;

int mtbf_hours  = MTBF_YEARS * 365 * 24;
int total_num_drives= num_drives + num_parity;
int num_instances   = NUM_BAYS / total_num_drives;
int  

Re: [zfs-discuss] pool analysis

2007-07-11 Thread Kent Watsen

Resent as HTML to avoid line-wrapping:


Richard's blog analyzes MTTDL as a function of N+P+S:
  http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl

But to understand how to best utilize an array with a fixed number of 
drives, I add the following constraints:

- N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2}
- all sets in an array should be configured similarly
- the MTTDL for S sets is equal to (MTTDL for one set)/S

I got the following results by varying the NUM_BAYS parameter in the 
source code below:


  _*4 bays w/ 300 GB drives having MTBF=4 years*_
- can have 1 (2+1) w/ 1 spares providing 600 GB with MTTDL of 
5840.00 years
- can have 1 (2+2) w/ 0 spares providing 600 GB with MTTDL of 
799350.00 years

- can have 0 (4+1) w/ 4 spares providing 0 GB with MTTDL of Inf years
- can have 0 (4+2) w/ 4 spares providing 0 GB with MTTDL of Inf years
- can have 0 (8+1) w/ 4 spares providing 0 GB with MTTDL of Inf years
- can have 0 (8+2) w/ 4 spares providing 0 GB with MTTDL of Inf years

  _*8 bays w/ 300 GB drives having MTBF=4 years*_
- can have 2 (2+1) w/ 2 spares providing 1200 GB with MTTDL of 
2920.00 years
- can have 2 (2+2) w/ 0 spares providing 1200 GB with MTTDL of 
399675.00 years
- can have 1 (4+1) w/ 3 spares providing 1200 GB with MTTDL of 
1752.00 years
- can have 1 (4+2) w/ 2 spares providing 1200 GB with MTTDL of 
2557920.00 years

- can have 0 (8+1) w/ 8 spares providing 0 GB with MTTDL of Inf years
- can have 0 (8+2) w/ 8 spares providing 0 GB with MTTDL of Inf years

  _*12 bays w/ 300 GB drives having MTBF=4 years*_
- can have 4 (2+1) w/ 0 spares providing 2400 GB with MTTDL of 
365.00 years
- can have 3 (2+2) w/ 0 spares providing 1800 GB with MTTDL of 
266450.00 years
- can have 2 (4+1) w/ 2 spares providing 2400 GB with MTTDL of 
876.00 years
- can have 2 (4+2) w/ 0 spares providing 2400 GB with MTTDL of 
79935.00 years
- can have 1 (8+1) w/ 3 spares providing 2400 GB with MTTDL of 
486.67 years
- can have 1 (8+2) w/ 2 spares providing 2400 GB with MTTDL of 
426320.00 years


*   _16 bays w/ 300 GB drives having MTBF=4 years_*
- can have 5 (2+1) w/ 1 spares providing 3000 GB with MTTDL of 
1168.00 years
- can have 4 (2+2) w/ 0 spares providing 2400 GB with MTTDL of 
199837.50 years
- can have 3 (4+1) w/ 1 spares providing 3600 GB with MTTDL of 
584.00 years
- can have 2 (4+2) w/ 4 spares providing 2400 GB with MTTDL of 
1278960.00 years
- can have 1 (8+1) w/ 7 spares providing 2400 GB with MTTDL of 
486.67 years
- can have 1 (8+2) w/ 6 spares providing 2400 GB with MTTDL of 
426320.00 years


  _*20 bays w/ 300 GB drives having MTBF=4 years*_
- can have 6 (2+1) w/ 2 spares providing 3600 GB with MTTDL of 
973.33 years
- can have 5 (2+2) w/ 0 spares providing 3000 GB with MTTDL of 
159870.00 years
- can have 4 (4+1) w/ 0 spares providing 4800 GB with MTTDL of 
109.50 years
- can have 3 (4+2) w/ 2 spares providing 3600 GB with MTTDL of 
852640.00 years
- can have 2 (8+1) w/ 2 spares providing 4800 GB with MTTDL of 
243.33 years
- can have 2 (8+2) w/ 0 spares providing 4800 GB with MTTDL of 
13322.50 years


  _*24 bays w/ 300 GB drives having MTBF=4 years*_
- can have 8 (2+1) w/ 0 spares providing 4800 GB with MTTDL of 
182.50 years
- can have 6 (2+2) w/ 0 spares providing 3600 GB with MTTDL of 
133225.00 years
- can have 4 (4+1) w/ 4 spares providing 4800 GB with MTTDL of 
438.00 years
- can have 4 (4+2) w/ 0 spares providing 4800 GB with MTTDL of 
39967.50 years
- can have 2 (8+1) w/ 6 spares providing 4800 GB with MTTDL of 
243.33 years
- can have 2 (8+2) w/ 4 spares providing 4800 GB with MTTDL of 
213160.00 years


While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that 
/any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 
is unnecessary in a practical sense...  This conclusion surprises me 
given the amount of attention people give to double-parity solutions - 
what am I overlooking?


Thanks,
Kent



_Source Code_  (compile with: cc -std:c99 -lm filename) [its more than 
80 columns - sorry!]


#include stdio.h
#include math.h

#define NUM_BAYS 24
#define DRIVE_SIZE_GB 300
#define MTBF_YEARS 4
#define MTTR_HOURS_NO_SPARE 16
#define MTTR_HOURS_SPARE 4

int main() {

  printf(\n);
  printf(%u bays w/ %u GB drives having MTBF=%u years\n, NUM_BAYS, 
DRIVE_SIZE_GB, MTBF_YEARS);

  for (int num_drives=2; num_drives=8; num_drives*=2) {
  for (int num_parity=1; num_parity=2; num_parity++) {
  double  mttdl;

  int mtbf_hours  = MTBF_YEARS * 365 * 24;
  int total_num_drives= num_drives + num_parity;
  int num_instances   = NUM_BAYS / total_num_drives;
  int num_spares  = NUM_BAYS % total_num_drives;
  double  mttr= num_spares==0 ? 
MTTR_HOURS_NO_SPARE : 

Re: [zfs-discuss] pool analysis

2007-07-11 Thread Darren Dunham
 While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that 
 /any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 
 is unnecessary in a practical sense...  This conclusion surprises me 
 given the amount of attention people give to double-parity solutions - 
 what am I overlooking?

When talking to Netapp, some of their folks have mentioned their DP
solution wasn't necessarily so useful for handling near-simultaneous
disk loss (although it does do that).

But that when a disk failed, it would not be uncommon for reconstruction
to be unable to read some data off the remaining disks (perhaps a bad
sector or bad data that fails checksum).  With 1P, you have to shut down
the volume or leave a hole in the filesystem.  With 2P, you reconstruct
that one read and continue.

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
  This line left intentionally blank to confuse you. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool analysis

2007-07-11 Thread Anton B. Rang
 Are Netapp using some kind of block checksumming?

They provide an option for it, I'm not sure how often it's used.

 If Netapp doesn't do something like [ZFS checksums], that would
 explain why there's frequently trouble reconstructing, and point up a
 major ZFS advantage.

Actually, the real problem is uncorrectable errors on drives. On a 1 TB SATA 
drive, there's a good chance (over 1%) that at least one block will be 
unreadable once written.

Scrubbing tries to catch these, but if an error develops between the last scrub 
and the need to read the data as part of reconstruction, you're out of luck.

This is the big advantage of RAID-6 / RAIDZ2; the combination of a drive 
failure and a single-block failure on a second drive won't lead to data loss.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool analysis

2007-07-11 Thread Kent Watsen

 But to understand how to best utilize an array with a fixed number of 
 drives, I add the following constraints:
   - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2}
   - all sets in an array should be configured similarly
   - the MTTDL for S sets is equal to (MTTDL for one set)/S

 Yes, these are reasonable and will reduce the problem space, somewhat.
Actually, I wish I could get more insight into why N can only be 2, 4,or 
8.  In contemplating a 16-bay array, I many times think that 3 (3+2) + 1 
spare would be perfect, but I have no understanding what N=3 implicates...




 While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that 
 /any /RAIDZ configuration will outlive me and so I conclude that 
 RAIDZ2 is unnecessary in a practical sense...  This conclusion 
 surprises me given the amount of attention people give to 
 double-parity solutions - what am I overlooking?

 You are overlooking statistics :-).  As I discuss in
 http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent
 the MTBF (F == death) of children aged 5-14 in the US is 4,807 years, but
 clearly no child will live anywhere close to 4,807 years.  
Thanks - I hadn't seen that blog entry yet...


 #define MTTR_HOURS_NO_SPARE 16

 I think this is optimistic :-)
Not really for me as the array is in my basement - so I assume that I'll 
swap in a drive when I get home from work  ;)


 There are many more facets of looking at these sorts of analysis, 
 which is
 why I wrote RAIDoptimizer.  
Is RAIDoptimizer the name of a spreadsheet you developed - is it 
publically available?


Thanks,
Kent
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool analysis

2007-07-11 Thread David Dyer-Bennet
Kent Watsen wrote:
 #define MTTR_HOURS_NO_SPARE 16

 I think this is optimistic :-)
 
 Not really for me as the array is in my basement - so I assume that I'll 
 swap in a drive when I get home from work  ;)
   
Yes, it's interesting how the parameters for home setups differ from 
professional (not meaning to denigrate the professionalism of 
anybodies home network of course).  We can run to the store and buy 
something rather quicker than lots of professional outfits seem to be 
able to get spares in hand.

But what if you're away on business that week?

-- 
David Dyer-Bennet, [EMAIL PROTECTED]; http://dd-b.net/dd-b
Pics: http://dd-b.net/dd-b/SnapshotAlbum, http://dd-b.net/photography/gallery
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool analysis

2007-07-11 Thread Richard Elling
Kent Watsen wrote:
 
 But to understand how to best utilize an array with a fixed number of 
 drives, I add the following constraints:
   - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2}
   - all sets in an array should be configured similarly
   - the MTTDL for S sets is equal to (MTTDL for one set)/S

 Yes, these are reasonable and will reduce the problem space, somewhat.
 Actually, I wish I could get more insight into why N can only be 2, 4,or 
 8.  In contemplating a 16-bay array, I many times think that 3 (3+2) + 1 
 spare would be perfect, but I have no understanding what N=3 implicates...

There was a discussion a while back which centered around this topic.
I don't recall the details, and I think it needs to be revisited, but
there was concensus that for the time being, best performace was thus
achieved.  I'd like to revisit this, since I think best performance is
more difficult to predict, due to the dynamic nature of ZFS making it
particularly sensitive to the workload.

 While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that 
 /any /RAIDZ configuration will outlive me and so I conclude that 
 RAIDZ2 is unnecessary in a practical sense...  This conclusion 
 surprises me given the amount of attention people give to 
 double-parity solutions - what am I overlooking?

 You are overlooking statistics :-).  As I discuss in
 http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent
 the MTBF (F == death) of children aged 5-14 in the US is 4,807 years, but
 clearly no child will live anywhere close to 4,807 years.  
 Thanks - I hadn't seen that blog entry yet...
 
 
 #define MTTR_HOURS_NO_SPARE 16

 I think this is optimistic :-)
 Not really for me as the array is in my basement - so I assume that I'll 
 swap in a drive when I get home from work  ;)

It is an average value, so you have some leeway there.  I work from home,
so in theory I should have fast response time :-)

 There are many more facets of looking at these sorts of analysis, 
 which is
 why I wrote RAIDoptimizer.  
 Is RAIDoptimizer the name of a spreadsheet you developed - is it 
 publically available?

It is a Java application.  I plan to open-source it, but that may take a
while to get through the process.  I'll check to see if there is a way to
make it available as a webstart client (which is how I deploy it).
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Compression algorithms - Project Proposal

2007-07-11 Thread Richard Elling
Adam Leventhal wrote:
 This is a great idea. I'd like to add a couple of suggestions:
 
 It might be interesting to focus on compression algorithms which are
 optimized for particular workloads and data types, an Oracle database for
 example.

NB. Oracle 11g has builtin compression.  In general, for such problems, solving
them closer to the application is better.

 It might be worthwhile to have some sort of adaptive compression whereby
 ZFS could choose a compression algorithm based on its detection of the
 type of data being stored.

I think there is fertile ground in this area.  As CPU threads approach $0,
it might be a good idea to use more of them :-)
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] [AVS] Question concerning reverse synchronization of a zpool

2007-07-11 Thread Ralf Ramge
Hi,

I'm struggling to get a stable ZFS replication using Solaris 10 110/06 
(actual patches) and AVS 4.0  for several weeks now. We tried it on 
VMware first and ended up in kernel panics en masse (yes, we read Jim 
Dunham's blog articles :-). Now we try on the real thing, two X4500 
servers. Well, I have no trouble replicating our kernel panics there, 
too ... but I think I learned some important things, too. But one 
problem is still remaining.

I have a zpool on host A. Replication to host B works fine.

* zpool export tank on the primary - works.
* sndradm -d on both servers - works (paranoia mode)
* zpool import id on the secondary - works.

So far, so good. I chance the contents of the file system, add some 
files, delete some others ... no problems. The secondary is in 
production use now, everything is fine.

Okay, let's imagine I switched to the secodary host because had a 
problem with the primary. Now it's repaired, now I want my redundancy back.

* sndradm -E -f  on both hosts - works.
* sndradm -u -r on the primary for refreshing the primary - works. 
`nicstat` shows me a bit of traffic.

Good, let's switch back to the primary. Actual status: zpool is imported 
on the secondary and NOT imported on the primary.

* zpool export tank on the secondary - *kernel panic*

Sadly, the machine dies fast, I don't see the kernel panic with `dmesg`. 
And disabling the replication again later and mounting the zpool on the 
primary again shows me that the update sync didn't take place, the file 
system changes I did on the secondary wren't replicated. Exporting the 
zpool on the secondary works *after* the system rebooted.

I uses slices for the zpool, not LUNs, because I think many of my 
problems were caused by exclusive locking, but it doesn't help with this 
one.

Questions:

a) I don't understand why the kernel panics at the moment. the zpool 
isn't mounted on both systems, the zpool itself seems to be fine after a 
reboot ... and switching the primary and secondary hosts just for 
resyncing seems to force a full sync, which isn't an option.

b) I'll try a sndradm -m -r the next time ... but I'm not sure if I 
like that thought. I would accept this if I replaced the primary host 
with another server, but having to do a 24 TB full sync just because the 
replication itself had been disabled for a few minutes would be hard to 
swallow. Or did I do something wrong?

c) What performance can I expect from a X4500, 40 disks zpool, when 
using slices, compared to LUNs? Any experiences?

And another thing: I did some experiments with zvols, because I wanted 
to make desasters and the AVS configuration itself easier to handle - 
there won't be a full sync after replacing a disk because AVS doesn't 
see that a hot spare is being used, and hot spares won't be replicated 
to the secondary host as well although the original drive on the 
secondary never failed.  I used the zvol with UFS and this kind of 
hardware RAID controller emulation by ZFS works pretty well, just the 
performance went down the cliff. Sunsolve told me that this is a 
flushing problem and there's a workaround in Nevada build 53 and higher. 
Has somebody done a comparison, can you share some experiences? I only 
have a few days left and I don't waste time on installing Nevada for 
nothing ...

Thanks,

  Ralf

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss