Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!

2009-06-19 Thread Moore, Joe
Scott Meilicke wrote:
 Obviously iSCSI and NFS are quite different at the storage level, and I
 actually like NFS for the flexibility over iSCSI (quotas, reservations,
 etc.)

Another key difference between them is that with iSCSI, the VMFS filesystem 
(built on the zvol presented as a block device) never frees up unused disk 
space.

Once ESX has written to a block on that zvol, it will always be taking up space 
in your zpool, even if you delete the .vmdk file that contains it.  The zvol 
has no idea that the block is not used any more.

With NFS, ZFS is aware that the file is deleted, and can deallocate those 
blocks.

This would be less of an issue if we had deduplication on the zpool (have ESX 
write blocks of all-0 and those would be deduped down to a single block) or if 
there was some way (like the SSD TRIM command) for the VMFS filesystem to tell 
the block device that a block is no longer used.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring ZFS host memory use

2009-05-07 Thread Moore, Joe
Carson Gaspar wrote:
 Not true. The script is simply not intelligent enough. There are really
 3 broad kinds of RAM usage:
 
 A) Unused
 B) Unfreeable by the kernel (normal process memory)
 C) Freeable by the kernel (buffer cache, ARC, etc.)
 
 Monitoring usually should focus on keeping (A+C) above some threshold.
 On Solaris, this means parsing some rather obscure kstats, sadly (not
 that Linux's /proc/meminfo is much better).

B) is freeable but requires moving pages to spinning rust.  There's a subset of 
B (Call it B1) that is the active processes' working sets which are basically 
useless to swap out, since they'll be swapped right back in again.

Two other important types of RAM usage in many modern situations:
D) Unpageable (pinned) memory
E) Memory that is presented to the OS but that is thin-provisioned by a 
hypervisor or other vitualization layer.  (use of this memory may mean that the 
hypervisor moves pages to spinning rust)

For virtualized systems, you should limit the size of A+B1+C so that it does 
not get into memory E.  There's no point in having data in the ARC if the 
hypervisor has to go to disk to get it.  Considering that the size of E is 
dependant on the memory demands on the host server, (which the guest has no 
insight into) this is a Very Hard problem.

Often this is arranged by having the hypervisor break the virtualization 
containment via a memory management driver (vmware tools provides a memory 
control, for example) which steals pages of virtual-chip memory to avoid the 
hypervisor swapping.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Moore, Joe
Joerg Schilling wrote:
 James Andrewartha jam...@daa.com.au wrote:
  Recently there's been discussion [1] in the Linux community about how 
  filesystems should deal with rename(2), particularly in the case of a crash.
  ext4 was found to truncate files after a crash, that had been written with
  open(foo.tmp), write(), close() and then rename(foo.tmp, foo). This is
   because ext4 uses delayed allocation and may not write the contents to disk
  immediately, but commits metadata changes quite frequently. So when
  rename(foo.tmp,foo) is committed to disk, it has a length of zero which
  is later updated when the data is written to disk. This means after a crash,
  foo is zero-length, and both the new and the old data has been lost, which
  is undesirable. This doesn't happen when using ext3's default settings
  because ext3 writes data to disk before metadata (which has performance
  problems, see Firefox 3 and fsync[2])
 
  Ted T'so's (the main author of ext3 and ext4) response is that applications
  which perform open(),write(),close(),rename() in the expectation that they
  will either get the old data or the new data, but not no data at all, are
  broken, and instead should call open(),write(),fsync(),close(),rename().

 The only granted way to have the file new in a stable state on the
 disk
 is to call:
 
 f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
 write(f, dat, size);
 fsync(f);
 close(f);

AFAIUI, the ZFS transaction group maintains write ordering, at least as far as 
write()s to the file would be in the ZIL ahead of the rename() metadata updates.

So I think the atomicity is maintained without requiring the application to 
call fsync() before closing the file.  If the TXG is applied and the rename() 
is included, then the file writes have been too, so foo would have the new 
contents.  If the TXG containing the rename() isn't complete and on the ZIL 
device at crash time, foo would have the old contents.

Posix doesn't require the OS to sync() the file contents on close for local 
files like it does for NFS access?  How odd.

--Joe

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-11 Thread Moore, Joe
Lars-Gunnar Persson wrote:
 I would like to go back to my question for a second:
 
 I checked with my Nexsan supplier and they confirmed that access to
 every single disk in SATABeast is not possible. The smallest entities
 I can create on the SATABeast are RAID 0 or 1 arrays. With RAID 1 I'll
 loose too much disk space and I believe that leaves me with RAID 0 as
 the only reasonable option. But with this unsecure RAID format I'll
 need higher redundancy in the ZFS configuration. I think I'll go with
 the following configuration:
 
 On the Nexsan SATABeast:
 * 14 disks configured in 7 RAID arrays with RAID level 0 (each disk is
 1 TB which gives me a total of 14 TB raw disk space).
 * Each RAID 0 array configured as one volume.

So what the front end will see is 7 disks, 2TB each disk.

 
 On the Sun Fire X4100 M2 with Solaris 10:
 * Add all 7 volumes to one zpool configured in on raidz2 (gives me
 approx. 8,8 TB available disk space)

You'll get 5 LUNs worth of space in this config, or 10TB of usable space.

 
 Any comments or suggestions?

Given the hardware constraints (no single-disk volumes allowed) this is a good 
configuration for most purposes.

The advantages/disadvantages are:
. 10TB of usable disk space, out of 14TB purchased.
. At least three hard disk failures are required to lose the ZFS pool.
. Random non-cached read performance will be about 300 IO/sec.
. Sequential reads and writes of the whole ZFS blocksize will be fast (up to 
2000 IO/sec).
. One hard drive failure will cause the used blocks of the 2TB LUN (raid0 pair) 
to be resilvered, even though the other half of the pair is not damaged.  The 
other half of the pair is more likely to fail during the ZFS resilvering 
operation because of increased load.

You'll want to pay special attention to the cache settings on the Nexsan.  You 
earlier showed that the write cache is enabled, but IIRC the array doesn't have 
a nonvolatile (battery-backed) cache.  If that's the case, MAKE SURE it's 
hooked up to a UPS that can support it for the 30 second cache flush timeout on 
the array.  And make sure you don't power it down hard.  I think you want to 
uncheck the ignore FUA setting, so that FUA requests are respected.  My guess 
is that this will cause the array to properly handle the cache_flush requests 
that ZFS uses to ensure data consistancy.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Moore, Joe
Bob Friesenhahn wrote:
 Your idea to stripe two disks per LUN should work.  Make sure to use
 raidz2 rather than plain raidz for the extra reliability.  This
 solution is optimized for high data throughput from one user.

Striping two disks per LUN (RAID0 on 2 disks) and then adding a ZFS form of 
redundancy (either mirror or raidz[2]) would be an efficient use of space.  
There would be no additional space overhead caused by running that way.

Note, however, that if you do this, ZFS must resilver the larger LUN in the 
event of a single disk failure on the backend.  This means a longer time to 
rebuild, and a lot of extra work on the other (non-failed) half of the RAID0 
stripe.

 
 An alternative is to create individual RAID 0 LUNs which actually
 only contain a single disk.  

This is certainly preferable, since the unit of failure at the hardware level 
corresponds to the unit of resilvering at the ZFS level.  And at least on my 
Nexsan SATAboy(2f) this configuration is possible.

 Then implement the pool as two raidz2s
 with six LUNs each, and two hot spares.  That would be my own
 preference.  Due to ZFS's load share this should provide better
 performance (perhaps 2X) for multi-user loads.  Some testing may be
 required to make sure that your hardware is happy with this.

I disagree with this suggestion.  With this config, you only get 8 disks worth 
of storage, out of the 14, which is a ~42% overhead.  In order to lose data in 
this scenario, 3 disks would have to fail out of a single 6-disk group before 
zfs is able to resilver any of them to the hot spares.  That seems (to me) a 
lot more redundancy than is needed.

As far as workload, any time you use RAIDZ[2], ZFS must read the entire stripe 
(across all of the disks) in order to verify the checksum for that data block.  
This means that a 128k read (the default zfs blocksize) requires a 32kb read 
from each of 6 disks, which may include a relatively slow seek to the relevant 
part of the spinning rust.  So for random I/O, even though the data is striped 
across all the disks, you will see only a single disks's worth of throughput.  
For sequential I/O, you'll see the full RAID set's worth of throughput.

If you are expecting a non-sequential workload, you would be better off taking 
the 50% storage overhead to do ZFS mirroring.

 
 Avoid RAID5 if you can because it is not as reliable with today's
 large disks and the resulting huge LUN size can take a long time to
 resilver if the RAID5 should fail (or be considered to have failed).

Here's a place that ZFS shines: it doesn't resilver the whole disk, just the 
data blocks.  So it doesn't have to read the full array to rebuild a failed 
disk, so it's less likely to cause a subsequent failure during parity rebuild.

My $.02.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs streams data corruption

2009-02-25 Thread Moore, Joe
Miles Nordin wrote:
   that SQLite2 should be equally as tolerant of snapshot backups as it
   is of cord-yanking.
 
 The special backup features of databases including ``performing a
 checkpoint'' or whatever, are for systems incapable of snapshots,
 which is most of them.  Snapshots are not writeable, so this ``in the
 middle of a write'' stuff just does not happen.

This is correct.  The general term for these sorts of point-in-time backups is 
crash consistant.  If the database can be recovered easily (and/or 
automatically) from pulling the plug (or a kill -9), then a snapshot is an 
instant backup of that database.

In-flight transactions (ones that have not been committed) at the database 
level are rolled back.  Applications using the database will be confused by 
this in a recovery scenario, since the transaction was reported as committed 
are gone when the database comes back.  But that's the case any time a database 
moves backward in time.

 Of course Toby rightly pointed out this claim does not apply if you
 take a host snapshot of a virtual disk, inside which a database is
 running on the VM guest---that implicates several pieces of
 untrustworthy stacked software.  But for snapshotting SQLite2 to clone
 the currently-running machine I think the claim does apply, no?


Snapshots of a virtual disk are also crash-consistant.  If the VM has not 
committed its transactionally-committed data and is still holding it volatile 
memory, that VM is not maintaining its ACID requirements, and that's a bug in 
either the database or in the OS running on the VM.  The snapshot represents 
the disk state as if the VM were instantly gone.  If the VM or the database 
can't recover from pulling the virtual plug, the snapshot can't help that.

That said, it is a good idea to quiesce the software stack as much as possible 
to make the recovery from the crash-consistant image as painless as possible.  
For example, if you take a snapshot of a VM running on an EXT2 filesystem (or 
unlogged UFS for that matter) the recovery will require an fsck of that 
filesystem to ensure that the filesystem structure is consistant.  Perforing a 
lockfs on the filesystem while the snapshot is taken could mitigate that, but 
that's still out of the scope of the ZFS snapshot.

--Joe

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-23 Thread Moore, Joe
Mario Goebbels wrote:
 One thing I'd like to see is an _easy_ option to fall back onto older
 uberblocks when the zpool went belly up for a silly reason. Something
 that doesn't involve esoteric parameters supplied to zdb.

Between uberblock updates, there may be many write operations to a data file, 
each requiring a copy on write operation.  Some of those operations may reuse 
blocks that were metadata blocks pointed to by the previous uberblock.

In which case the old uberblock points to a metadata tree full of garbage.

Jeff, you must have some idea on how to overcome this in your bugfix, would you 
care to share?

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-20 Thread Moore, Joe
  Ross wrote:
  The problem is they might publish these numbers, but we 
 really have  
  no way of controlling what number manufacturers will 
 choose to use  
  in the future.
 
  If for some reason future 500GB drives all turn out to be slightly  
  smaller than the current ones you're going to be stuck.  Reserving  
  1-2% of space in exchange for greater flexibility in replacing  
  drives sounds like a good idea to me.  As others have said, RAID  
  controllers have been doing this for long enough that even the very  
  basic models do it now, and I don't understand why such simple  
  features like this would be left out of ZFS.

It would certainly be terrible go back to the days where 5% of the filesystem 
space is inaccessible to users, and force the sysadmin to manually change that 
percentage to 0 to get full use of the disk.

Oh wait, UFS still does that, and it's a configurable parameter at mkfs time 
(and can be tuned on the fly)

For a ZFS pool, (until block pointer rewrite capability) this would have to be 
a pool-create-time parameter.  Perhaps a --usable-size=N[%] option which would 
either cut down the size of the EFI slices or fake the disk geometry so the EFI 
label ends early.

Or it would be a small matter of programming to build a perl wrapper for zpool 
create that would accomplish the same thing.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-20 Thread Moore, Joe
 Miles Nordin wrote:
  mj == Moore, Joe joe.mo...@siemens.com writes:
 
 mj For a ZFS pool, (until block pointer rewrite capability) this
 mj would have to be a pool-create-time parameter. 
 
 naw.  You can just make ZFS do it all the time, like the other storage
 vendors do.  no parameters.

Other storage vendors have specific compatibility requirements for the disks 
you are allowed to install in their chassis.

On the other hand, OpenSolaris is intended to work on commodity hardware.

And there is no way to change this after the pool has been created, since after 
that time, the disk size can't be changed.  So whatever policy is used by 
default, it is very important to get it right.


(snip)
 
 Most people will not even notice the feature exists except by getting
 errors less often.  AIUI this is how it works with other RAID layers,
 the cheap and expensive alike among ``hardware'' RAID, and this
 common-practice is very ZFS-ish.  except hardware RAID is proprietary
 so you cannot determine their exact policy, while in ZFS you would be
 able to RTFS and figure it out.

Sysadmins should not be required to RTFS.  Behaviors should be documented in 
other places too.

 
 But there is still no need for parameters.  There isn't even a need to
 explain the feature to the user.

There isn't a need to explain the feature to the user?  That's one of the most 
irresponsible responses I've heard lately.  A user is expecting their 500GB 
disk to be 5 bytes, not 4999500 bytes, unless that feature is explained.

Parameters with reasonable defaults (and a reasonable way to change them) allow 
users who care about the parameter and understand the tradeoffs involved in 
changing from the default to make their system work better.

If I didn't want to be able to tune my system for performance, I would be 
running Windows.  OpenSolaris is about transparency, not just Open Source.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs subdirectories to data set conversion

2009-01-12 Thread Moore, Joe
Nicolas Williams wrote:
 It'd be awesome to have a native directory-dataset conversion feature
 in ZFS.  And, relatedly, fast moves of files across datasets 
 in the same
 volume.  These two RFEs have been discussed to death in the list; see
 the archives.

This would be a nice feature to have.  The most compelling technical problem 
I've seen in the idea of reparenting a directory to be a top-level dataset is 
that when a zfs filesystem is used, open files on that filesystem have a 
particular devid.  In order to split off the directory onto a new zfs 
filesystem, you'd have to atomically change the devid inside all the processes 
that have open files under that directory.  Finding those open files is 
practically impossible.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Moore, Joe
Ross Smith wrote:
 My justification for this is that it seems to me that you can split
 disk behavior into two states:
 - returns data ok
 - doesn't return data ok
 
 And for the state where it's not returning data, you can again split
 that in two:
 - returns wrong data
 - doesn't return data

The state in discussion in this thread is the I/O requested by ZFS hasn't 
finished after 60, 120, 180, 3600, etc. seconds

The pool is waiting (for device timeouts) to distinguish between the first two 
states.

More accurate state descriptions are:
- The I/O has returned data
- The I/O hasn't yet returned data and the user (admin) is justifiably 
impatient.

For the first state, the data is either correct (verified by the ZFS checksums, 
or ESUCCESS on write) or incorrect and retried.

 
 The first of these is already covered by ZFS with its checksums (with
 FMA doing the extra work to fault drives), so it's just the second
 that needs immediate attention, and for the life of me I can't think
 of any situation that a simple timeout wouldn't catch.
 
 Personally I'd love to see two parameters, allowing this behavior to
 be turned on if desired, and allowing timeouts to be configured:
 
 zfs-auto-device-timeout
 zfs-auto-device-timeout-fail-delay

I'd prefer these be set at the (default) pool level:
zpool-device-timeout
zpool-device-timeout-fail-delay

with specific per-VDEV overrides possible:
vdev-device-timeout and vdev-device-fail-delay

This would allow but not require slower VDEVs to be tuned specifically for that 
case without hindering the default pool behavior on the local fast disks.  
Specifically, consider where I'm using mirrored VDEVs with one half over iSCSI, 
and want to have the iSCSI retry logic to still apply.  Writes that failed 
while the iSCSI link is down would have to be resilvered, but at least reads 
would switch to the local devices faster.

Set them to the default magic 0 value to have the system use the current 
behavior, of relying on the device drivers to report failures.
Set to a number (in ms probably) and the pool would consider an I/O that takes 
longer than that as returns invalid data

When the FMA work discussed below, these could be augmented by the pools best 
heuristic guess as to what the proper timeouts should be, which could be saved 
in (kstat?) vdev-device-autotimeout.

If you set the timeout to the magic -1 value, the pool would use 
vdev-device-autotimeout.

All that would be required is for the I/O that caused the disk to take a long 
time to be given a deadline (now + (vdev-device-timeout ?: 
(zpool-device-timeout?: forever)))* and consider the I/O complete with whatever 
data has returned after that deadline: if that's a bunch of 0's in a read, 
which would have a bad checksum; or a partially-completed write that would have 
to be committed somewhere else.

Unfortunately, I'm not enough of a programmer to implement this.

--Joe
* with the -1 magic, it would be a little more complicated calculation.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Moore, Joe
C. Bergström wrote:
 Will Murnane wrote:
  On Mon, Nov 24, 2008 at 10:40, Scara Maccai [EMAIL PROTECTED] wrote:

  Still don't understand why even the one on 
 http://www.opensolaris.com/, ZFS - A Smashing Hit, doesn't 
 show the app running in the moment the HD is smashed... weird...
  
 Sorry this is OT, but is it just me or does is only seem 
 proper to have 
 Gallagher do this? ;)

Absolutely not.  Under no circumstances should you attempt to create a striped 
ZFS pool on a watermelon, nor on any other type of epigynous berry.

If you try, you will certainly rind up with a mess, if not a core dump.  And 
let me tell you, that's the pits.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris, thumper and hd

2008-10-15 Thread Moore, Joe
 Tommaso Boccali wrote:
  Ciao, I have a thumper with Opensolaris (snv_91), and 48 disks.
  I would like to try a new brand of  HD, by replacing a
 spare disk with a new one and build on it a zfs pool.
 
  Unfortunately the official utility to map a disk to the
 physical position inside the thumper (hd, in /opt/SUNWhd) is
 not present in OpenSolaris.
 
  Any idea on how
  - get it
 

 It should have shipped with the system.  But you can also download it
 http://www.sun.com/servers/x64/x4500/downloads.jsp

 Get it, install it, be happy :-)


Or if you haven't tweaked the discovery order, there's a map at 
http://docs.sun.com/source/819-4359-14/figures/CH2-power-bios-9.gif

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] An slog experiment (my NAS can beat up your NAS)

2008-10-08 Thread Moore, Joe
Brian Hechinger
 On Mon, Oct 06, 2008 at 10:47:04AM -0400, Moore, Joe wrote:
 
  I wonder if an AVS-replicated storage device on the
 backends would be appropriate?
 
  write - ZFS-mirrored slog - ramdisk -AVS- physical disk
 \
  +-iscsi- ramdisk -AVS- physical disk
 
  You'd get the continuous replication of the ramdisk to
 physical drive (and perhaps automagic recovery on reboot) but
 not pay the syncronous write to remote physical disk penalty

 It looks like the answer is no.

 [EMAIL PROTECTED] sudo sndradm -e localhost
 /dev/rramdisk/avstest1 /dev/zvol/rdsk/SYS0/bitmap1
 \wintermute /dev/zvol/dsk/SYS0/avstest2
 /dev/zvol/rdsk/SYS0/bitmap2 ip async
 Enable Remote Mirror? (Y/N) [N]: y
 sndradm: Error: both localhost and wintermute are local

I've not worked with AVS other than looking at the basic concepts, but to me 
this looks like a dont-shoot-yourself-in-the-foot critical warning rather than 
an actual functionality restriction.  Is there a -force option to override this 
normally quite reasonable sanity check?

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] An slog experiment (my NAS can beat up your NAS)

2008-10-06 Thread Moore, Joe
Nicolas Williams wrote
 There have been threads about adding a feature to support slow mirror
 devices that don't stay synced synchronously.  At least IIRC.  That
 would help.  But then, if the pool is busy writing then your slow ZIL
 mirrors would generally be out of sync, thus being of no help in the
 even of a power failure given fast slog devices that don't
 survive power
 failure.

I wonder if an AVS-replicated storage device on the backends would be 
appropriate?

write - ZFS-mirrored slog - ramdisk -AVS- physical disk
   \
+-iscsi- ramdisk -AVS- physical disk

You'd get the continuous replication of the ramdisk to physical drive (and 
perhaps automagic recovery on reboot) but not pay the syncronous write to 
remote physical disk penalty


 Also, using remote devices for a ZIL may defeat the purpose of fast
 ZILs, even if the actual devices are fast, because what really matters
 here is latency, and the farther the device, the higher the latency.

A .5-ms RTT on an ethernet link to the iSCSI disk may be faster than a 9-ms 
latency on physical media.

There was a time when it was better to place workstations' swap files on the 
far side of a 100Mbps ethernet link rather than using the local spinning rust.  
Ah, the good old days...

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Moore, Joe

Toby Thain Wrote:
 ZFS allows the architectural option of separate storage without losing end to 
 end protection, so the distinction is still important. Of course this means 
 ZFS itself runs on the application server, but so what?

The OP in question is not running his network clients on Solaris or OpenSolaris 
or FreeBSD or MacOSX, but rather a collection of Linux workstations.  Unless 
there's been a recent port of ZFS to Linux, that makes a big What.

Given the fact that NFS, as implemented in his client systems, provides no 
end-to-end reliability, the only data protection that ZFS has any control over 
is after the write() is issued by the NFS server process.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Moore, Joe


Ian Collins wrote:
 I think you'd be surprised how large an organisation can migrate most,
 if not all of their application servers to zones one or two Thumpers.

 Isn't that the reason for buying in server appliances?


Assuming that the application servers can coexist in the only 16GB available 
on a thumper, and the only 8GHz of CPU core speed, and the fact that the 
System controller is a massive single point of failure for both the 
applications and the storage.

You may have a difference of opinion as to what a large organization is, but 
the reality is that the thumper series is good for some things in a large 
enterprise, and not good for some things.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Moore, Joe
Darren J Moffat wrote:
 Moore, Joe wrote:
  Given the fact that NFS, as implemented in his client
 systems, provides no end-to-end reliability, the only data
 protection that ZFS has any control over is after the write()
 is issued by the NFS server process.

 NFS can provided on the wire protection if you enable Kerberos support
 (there are usually 3 options for Kerberos: krb5 (or sometimes called
 krb5a) which is Auth only, krb5i which is Auth plus integrity provided
 by the RPCSEC_GSS layer, krb5p Auth+Integrity+Encrypted data.

 I have personally seen krb5i NFS mounts catch problems when
 there was a
 router causing failures that the TCP checksum don't catch.

No doubt, additional layers of data protection are available.  I don't know the 
state of RPCSEC on Linux, so I can't comment on this, certainly your experience 
brings valuable insight into this discussion.

It is also recommended (when iSCSI is an appropriate transport) to run over 
IPSEC in ESP mode to also ensure data-packet-content consistancy.  Certainly 
NFS over IPSEC/ESP would be more resistant to on-the-wire corruption.

Either of these would give better data reliability than pure NFS, just like ZFS 
on the backend gives better data reliability than for example, UFS or EXT3.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] iscsi target problems on snv_97

2008-09-17 Thread Moore, Joe
 I believe the problem you're seeing might be related to deadlock
 condition (CR 6745310), if you run pstack on the
 iscsi target  daemon you might find a bunch of zombie
 threads.  The fix
 is putback to snv-99, give snv-99 a try.

Yes, a pstack of the core I've generated from iscsitgtd does have a number of 
zombie threads.

I'm afraid I can't make heads nor tails of the bug report at 
http://bugs.opensolaris.org/view_bug.do?bug_id=6658836 nor its duplicate-of 
6745310, nor any of the related bugs (all are unavailable except for 6676298, 
and the stack trace reported in that bug doesn't look anything like mine.

As far as I can tell snv-98 is the latest build, from Sep 10 according to 
http://dlc.sun.com/osol/on/downloads/.  So snv-99 should be out next week, 
correct?

Anything I can do in the mean time?  Do I need to BFU to the latest nightly 
build?  Or would just taking the iscsitgtd from that build suffice?

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] iscsi target problems on snv_97

2008-09-16 Thread Moore, Joe
I've recently upgraded my x4500 to Nevada build 97, and am having problems with 
the iscsi target.

Background: this box is used to serve NFS underlying a VMware ESX environment 
(zfs filesystem-type datasets) and presents iSCSI targets (zfs zvol datasets) 
for a Windows host and to act as zoneroots for Solaris 10 hosts.  For optimal 
random-read performance, I've configured a single zfs pool of mirrored VDEVs of 
all 44 disks (+2 boot disks, +2 spares = 48)

Before the upgrade, the box was flaky under load: all I/Os to the ZFS pool 
would stop occasionally.

Since the upgrade, that hasn't happened, and the NFS clients are quite happy.  
The iSCSI initiators are not.

The windows initiator is running the Microsoft iSCSI initiator v2.0.6 on 
Windows 2003 SP2 x64 Enterprise Edition.  When the system reboots, it is not 
able to connect to its iscsi targets.  No devices are found until I restart the 
iscsitgt process on the x4500, at which point the initiator will reconnect and 
find everything.  I notice that on the x4500, it maintains an active TCP 
connection (according to netstat -an | grep 3260) to the Windows box through 
the reboot and for a long time afterwards.  The initiator starts a second 
connection, but it seems that the target doesn't let go of the old one.  Or 
something.  At this point, every time I reboot the Windows system I have to 
`pkill iscsitgtd`

The Solaris system is running S10 Update 4.  Every once in a while (twice 
today, and not correlated with the pkill's above) the system reports that all 
of the iscsi disks are unavailable.  Nothing I've tried short of a reboot of 
the whole host brings them back.  All of the zones on the system remount their 
zoneroots read-only (and give I/O errors when read or zlogin'd to)

There are a set of TCP connections from the zonehost to the x4500 that remain 
even through disabling the iscsi_initiator service.  There's no process holding 
them as far as pfiles can tell.

Does this sound familiar to anyone?  Any suggestions on what I can do to 
troubleshoot further?  I have a kernel dump from the zonehost and a snoop 
capture of the wire for the Windows host (but it's big).

I'll be opening a bug too.

Thanks,
--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540

2008-07-11 Thread Moore, Joe
Bob Friesenhahn
 I expect that Sun is realizing that it is already 
 undercutting much of 
 the rest of its product line.  These minor updates would allow the 
 X4540 to compete against much more expensive StorageTek SAN hardware. 

Assuming, of course that the requirements for the more expensive SAN
hardware don't include, for example, surviving a controller or
motherboard failure (or gracefully a RAM chip failure) without requiring
an extensive downtime for replacement, or other extended downtime
because there's only 1 set of chips that can talk to those disks.

Real SAN storage is dual-ported to dual controller nodes so that you
can replace a motherboard without taking down access to the disk.  Or
install a new OS version without waiting for the system to POST.

 How can other products remain profitable when competing 
 against such a 
 star performer?

Features.  RAS.  Simplicity.  Corporate Inertia (having storage admins
who don't know OpenSolaris).  Executive outings with StorageTek-logo'd
golfballs.  The last 2 aren't something I'd build a business case
around, but they're a reality.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] proposal partial/relative paths for zfs(1)

2008-07-10 Thread Moore, Joe
Carson Gaspar wrote:
 Darren J Moffat wrote:
  $ pwd
  /cube/builds/darrenm/bugs
  $ zfs create -c 6724478
  
  Why -c ?  -c for current directory  -p partial is 
 already taken to 
  mean create all non existing parents and -r relative is 
 already used 
  consistently as recurse in other zfs(1) commands (as well 
 as lots of 
  other places).
 
 Why not zfs create $PWD/6724478. Works today, traditional UNIX 
 behaviour, no coding required. Unles you're in some bizarroland shell 
 (like csh?)...

Because the zfs dataset mountpoint may not be the same as the zfs pool
name.  This makes things a bit complicated for the initial request.

Personally, I haven't played with datasets where the mountpoint is
different.  If you have a zpool tank mounted on /tank and /tank/homedirs
with mountpoint=/export/home, do you create the next dataset
/tank/homedirs/carson, or /export/home/carson ?  And does the mountpoint
get inherited in the obvious (vs. the simple vs. not at all) way?  I
don't know.

Also $PWD has a leading / in this example.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS deduplication

2008-07-08 Thread Moore, Joe
Bob Friesenhahn wrote:
 Something else came to mind which is a negative regarding 
 deduplication.  When zfs writes new sequential files, it 
 should try to 
 allocate blocks in a way which minimizes fragmentation 
 (disk seeks). 

It should, but because of its copy-on-write nature, fragmentation is a
significant part of the ZFS data lifecycle.

There was a discussion of this on this list at the beginning of the
year...
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-November/044077.h
tml

 Disk seeks are the bane of existing storage systems since they come 
 out of the available IOPS budget, which is only a couple hundred 
 ops/second per drive.  The deduplication algorithm will surely result 
 in increasing effective fragmentation (decreasing sequential 
 performance) since duplicated blocks will result in a seek to the 
 master copy of the block followed by a seek to the next block.  Disk 
 seeks will remain an issue until rotating media goes away, which (in 
 spite of popular opinion) is likely quite a while from now.

On ZFS, sequential files are rarely sequential anyway.  The SPA tries to
keep blocks nearby, but when dealing with snapshotted sequential files
being rewritten, there is no way to keep everything in order.

But if you read through the thread referenced above, you'll see that
there's no clear data about just how that impacts performance (I still
owe Mr. Elling a filebench run on one of my spare servers)

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help! ZFS pool is UNAVAILABLE

2008-01-02 Thread Moore, Joe
I AM NOT A ZFS DEVELOPER.  These suggestions should work, but there
may be other people who have better ideas.

Aaron Berland wrote:
 Basically, I have a 3 drive raidz array on internal Seagate 
 drives. running build 64nv. I purchased 3 add'l USB drives 
 with the intention of mirroring and then migrating the data 
 to the new USB drives.
 (snip)
 Below is my current zpool status.  Note the USB drives are 
 showing up as the same device.  They are plugged into 3 
 different port and they used to show up as different controllers??  
 
 This whole thing was supposed to duplicate my data and have 
 more redundancy, but now it looks like I could be loosing it 
 all?!  I have some data backed up on other devices, but not all.
 
 NAMESTATE READ WRITE CKSUM
 zbk UNAVAIL  0 0 0  insufficient replicas
   raidz1ONLINE   0 0 0
 c2d0p2  ONLINE   0 0 0
 c1d0ONLINE   0 0 0
 c1d1ONLINE   0 0 0
   raidz1UNAVAIL  0 0 0  insufficient replicas
 c5t0d0  ONLINE   0 0 0
 c5t0d0  FAULTED  0 0 0  corrupted data
 c5t0d0  FAULTED  0 0 0  corrupted data

Ok, from here, we can see that you have a single pool, with two striped
components: a raidz set from c1 and c2 disks, and the (presumably new)
raidz set from c5 -- I'm guessing this is where the USB disks show up.

Unfortunately, it is not possible to remove a component from a zfs pool.

On the bright side, it might be possible to trick it, at least for long
enough to get the data back.

First, we'll want to get the system booted.  You'll connect the USB
devices, but DON't try to do anything with your pool (especially don't
put more data on it)

You should then be able to get a consistant pool up and running -- the
devices will be scanned and detected and automatically reenabled.  You
might have to do a zpool import to search all of the /dev/dsk/
devices.

From there, pull out one of the USB drives and do a zpool scrub to
resilver the failed RAID group.  So now, wipe off the removed USB disks
(format it with ufs or something... it just needs to lose the ZFS
identifiers.  And while we're at it, ufs is probably a good choice
anyway, given the next step(s))  One of the disks will show FAULTED at
this point, I'll call it c5t2d0.

Now, mount up that extra disk, and run mkfile 500g
/mnt/theUSBdisk/disk1.img (This will create a sparse file)

Then do a zfs replace c5t2d0 /mnt/theUSBdisk/disk1.img

Then you can also replace the other 2 USB disks with other img files
too... as long as the total data written to these stripes doesn't exceed
the actual size of the disk, you'll be OK.  At this point, back up your
data (zfs send  | bzip2 -9  /mnt/theUSBdisk/backup.dat).

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZIL and snapshots

2007-12-13 Thread Moore, Joe
I'm using an x4500 as a large data store for our VMware environment.  I
have mirrored the first 2 disks, and created a ZFS pool of the other 46:
22 pairs of mirrors, and 2 spares (optimizing for random I/O performance
rather than space).  Datasets are shared to the VMware ESX servers via
NFS.  We noticed that VMware mounts its NFS datastore with the SYNC
option, so every NFS write gets flagged with FILE_SYNC.  In testing,
syncronous writes are significantly slower than async, presumably
because of the strict ordering required for correctness (cache flushing
and ZIL).

Can anyone tell me if a ZFS snapshot taken when zil_disable=1 will be
crash-consistant with respect to the data written by VMware?  Are the
snapshot metadata updates serialized with pending non-metadata writes?
If an asyncronous write is issued before the snapshot is initiated, is
it guarenteed to be in the snapshot data, or can it be reordered to
after the snapshot?  Does a snapshot flush pending writes to disk?

To increase performance, the users are willing to lose an hour or two
of work (these are development/QA environments): In the event that the
x4500 crashes and loses the 16GB of cached (zil_disable=1) writes, we
roll back to the last hourly snapshot, and everyone's back to the way
they were.  However, I want to make sure that we will be able to boot a
crash-consistant VM from that rolled-back virtual disk.

Thanks for any knowledge you might have,
--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL and snapshots

2007-12-13 Thread Moore, Joe
 Have you thought of solid state cache for the ZIL?  There's a 
 16GB battery backed PCI card out there, I don't know how much 
 it costs, but the blog where I saw it mentioned a 20x 
 improvement in performance for small random writes.

Thought about it, looked in the Sun Store, couldn't find one, and cut
the PO.

Haven't gone back to get a new approval.  I did put a couple of the
MTron 32GB SSD drives on the christmas wishlist (aka 2008 budget)

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-21 Thread Moore, Joe
BillTodd wrote:
 In order to be reasonably representative of a real-world 
 situation, I'd suggest the following additions:
 

Your suggestions (make the benchmark big enough so seek times are really
noticed) are good.  I'm hoping that over the holidays, I'll get to play
with an extra server...  If I'm lucky, I'll have 2x36GB drives (in a
1-2GB memory server) that I can dedicate to their own mirrored zfs pool.
I figure a 30GB test file should make the seek times interesting.

There's also a needed 
5) Run the same microbenchmark against a UFS filesystem to compare the
step2/step4 ratio with what a non-COW filesystem offers.

In theory, the UFS ratio should be 1:1, that is, sequential read
performance should not be affected by the intervening random writes.
(In the case of my test server, I'll make it an SVM mirror of the same 2
drives)

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Moore, Joe
Louwtjie Burger wrote:
 Richard Elling wrote:
 
  - COW probably makes that conflict worse
  
  
 
  This needs to be proven with a reproducible, real-world 
 workload before it
  makes sense to try to solve it.  After all, if we cannot 
 measure where
  we are,
  how can we prove that we've improved?
 
 I agree, let's first find a reproducible example where updates
 negatively impacts large table scans ... one that is rather simple (if
 there is one) to reproduce and then work from there.

I'd say it would be possible to define a reproducible workload that
demonstrates this using the Filebench tool... I haven't worked with it
much (maybe over the holidays I'll be able to do this), but I think a
workload like:

1) create a large file (bigger than main memory) on an empty ZFS pool.
2) time a sequential scan of the file
3) random write i/o over say, 50% of the file (either with or without
matching blocksize)
4) time a sequential scan of the file

The difference between times 2 and 4 are the penalty that COW block
reordering (which may introduce seemingly-random seeks between
sequential blocks) imposes on the system.

It would be interesting to watch seeksize.d's output during this run
too.

--Joe

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HAMMER

2007-11-05 Thread Moore, Joe
Peter Tribble wrote: 
 I'm not worried about the compression effect. Where I see problems is
 backing up million/tens of millions of files in a single 
 dataset. Backing up
 each file is essentially a random read (and this isn't helped by raidz
 which gives you a single disks worth of  random read I/O per vdev). I
 would love to see better ways of backing up huge numbers of files.

It's worth correcting this point... the RAIDZ behavior you mention only
occurs if the read size is not aligned to the dataset's block size.  The
checksum verifier must read the entire stripe to validate the data, but
it does that in parallel across the stripe's vdevs.  The whole block is
then available for delivery to the application.

Although, backing up millions/tens of millions of files in a single
backup dataset is a bad idea anyway.  The metadata searches will kill
you, no matter what backend filesystem is supporting it.

zfs send is the faster way of backing up huge numbers of files.  But
you pay the price in restore time.  (But that's the normal tradeoff)

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future ZFS Boot and ZFS copies

2007-10-03 Thread Moore, Joe
 
Jesus Cea wrote:
 Darren J Moffat wrote:
  Why would you do that when it would reduce your protection 
 and ZFS boot 
  can boot from a mirror anyway.
 
 I guess ditto blocks would be protection enough, since the 
 data would be
 duplicated between both disks. Of course, backups are your friend.

I asked almost the exact same question when I first heard about ditto
blocks.  (See
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-May/040596.html
and followups)

There are 2 key differences between ditto blocks and mirrors:

1) The ZFS pool is considered unprotected.  That means a device
failure will result in a kernel panic.

2) Ditto block separation is not enforced.  The allocator tries to keep
the second copy far from the first one, but it is possible that both
copies of your /etc/passwd file are on the same VDEV.  This means that a
device failure could result in real loss of data.

It would be really nice if there was some sort of
enforced-ditto-separation (fail w/ device full if unable to satisfy) but
that doesn't exist currently.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] space allocation vs. thin provisioning

2007-09-14 Thread Moore, Joe
Mike Gerdts wrote: 
 I'm curious as to how ZFS manages space (free and used) and how
 its usage interacts with thin provisioning provided by HDS
 arrays.  Is there any effort to minimize the number of provisioned
 disk blocks that get writes so as to not negate any space
 benefits that thin provisioning may give?

I was trying to compose an email asking almost the exact same question,
but in the context of array-based replication.  They're similar in the
sense that you're asking about using already-written space, rather than
to go off into virgin sectors of the disks (in my case, in the hope that
the previous write is still waiting to be replicated and thus can be
replaced by the current data)

 
 
 Background  more detailed questions:
 
 In Jeff Bonwick's blog[1], he talks about free space management
 and metaslabs.  Of particular interest is the statement: ZFS
 divides the space on each virtual device into a few hundred
 regions called metaslabs.
 
 1. http://blogs.sun.com/bonwick/entry/space_maps

I wish I'd have seen this blog while I was composing my question... it
answers some of my questions about how things work (plus Jeff's
zfs_block_allocation entry actually moots most of my comments since
they've already been implemented)

(snip)
 
 As data is deleted, do the freed blocks get reused before never
 used blocks?

I didn't see any code where this would happen.  

I would really love to see a zpool setting where I can specify the reuse
algorithm.  (For example: zpool set block_reuse_policy=mru or =dense or
=broad or =low)

MRU (most recently used) in the hopes that the storage replication
hasn't yet committed the previous write to the other side of the WAN
DENSE (reuse any previously-written space) in the thin-provisioning case
BROAD (venture off into new space when possible) for media that has a
rewrite cycle limitations (flash drives) to spread the writes over as
much of the media as possible
LOW (prioritize low-block# space) would provide optimal rotational
latency for random i/o in the fututre and might be a special case of the
above.  The corresponding HIGH would improve sequential i/o.

(Implementation is left as an exercise to the reader ;)

 
 Is there any collaboration between the storage vendors and ZFS
 developers to allow the file system to tell the storage array
 this range of blocks is unused so that the array can reclaim
 the space?  I could see this as useful when doing re-writes of
 data (e.g. crypto rekey) to concentrate data that had become
 scattered into contiguous space.

Deallocating storage space is something that nobody seems to be good at:
ever tried to shrink a filesystem?  Or a ZFS pool?  Or a SAN RAID group?

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Force ditto block on different vdev?

2007-08-10 Thread Moore, Joe
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Frank Cusack
 Sent: Friday, August 10, 2007 7:26 AM
 To: Tuomas Leikola
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Force ditto block on different vdev?
 
 On August 10, 2007 2:20:30 PM +0300 Tuomas Leikola 
 [EMAIL PROTECTED] wrote:
   We call that a mirror :-)
  
  
   Mirror and raidz suffer from the classic blockdevice abstraction
   problem in that they need disks of equal size.
 
  Not that I'm aware of.  Mirror and raid-z will simply use 
 the smallest
  size of your available disks.
 
 
  Exactly. The rest is not usable.
 
 Well I don't understand how you suggest to use it if you want 
 redundancy.

Since copies=N is a per-filesystem setting, you fail writes to
/tank/important_documents (copies=2) when you run out of ditto blocks on
another VDEV, but still allow /tank/torrentcache (copies=1) to use the
other space.

With disks of 100 and 50 GB mirrored, /tank/torrentcache would be more
redundant than necessary, and you run out of capacity too soon.

Wishlist: It would be nice to put the whole redundancy definitions into
the zfs filesystem layer (rather than the pool layer):  Imagine being
able to set copies=5+2 for a filesystem... (requires a 7-VDEV pool,
and stripes via RAIDz2, otherwise the zfs create/set fails)

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and powerpath

2007-07-23 Thread Moore, Joe
Brian Wilson wrote:
 On Jul 16, 2007, at 6:06 PM, Torrey McMahon wrote:
  Darren Dunham wrote:
  My previous experience with powerpath was that it rode below the  
  Solaris
  device layer.  So you couldn't cause trespass by using the wrong
  device.  It would just go to powerpath which would choose the link
to
  use on its own.
 
  Is this not true or has it changed over time?
  I haven't looked at power path for some time but it used to be the
  opposite. The powerpath node sat on top of the actual device paths.

  One of the selling points of mpxio is that it doesn't have that  
  problem. (At least for devices it supports.) Most of the multipath
software had  
  that same limitation
 
 
 I agree, it's not true.  I don't know how long it hasn't been true,  
 but the last year and a half I've been implementing PowerPath on  
 Solaris 8, 9, 10, the way to make it work is to point whatever disk  
 tool you're using to the emcpower device.  The other paths are there  
 because leadville finds them and creates them (if you're using  
 leadville), but PowerPath isn't doing anything to make them  
 redundant, it's giving you the emcpower device and the emcp, etc.  
 drivers to front end them and give you a multipathed device (the  
 emcpower device).  It DOES choose which one to use, for all 
 I/O going  
 through the emcpower device.  In a situation where you lose 
 paths and  
 I/O is moving, you'll see scsi errors down one path, then the next,  
 then the next, as PowerPath gets fed the scsi error and tries the  
 next device path.  If you use those actual device paths, you're not  
 actually getting a device that PowerPath is multipathing for you  
 (i.e. it does not dig in beneath the scsi driver)

I'm afraid I have to disagree with you: I'm using the
/dev/dsk/c2t$WWNdXs2 devices quite happily with powerpath handling
failover for my clariion.

# powermt version
EMC powermt for PowerPath (c) Version 4.4.0 (build 274)
# powermt display dev=58
Pseudo name=emcpower58a
CLARiiON ID=APM00051704678 [uscicsap1]
Logical device ID=6006016067E51400565259A15331DB11 [saperqdb1:
/oracle/Q02/saparch]
state=alive; policy=BasicFailover; priority=0; queued-IOs=0
Owner: default=SP A, current=SP A

==
 Host ---   - Stor -   -- I/O Path -  --
Stats ---
### HW Path I/O PathsInterf.   ModeState  Q-IOs
Errors

==
3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016130202E48d58s0 
SP A1 active
alive  0  0
3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016930202E48d58s0 
SP B1 active
alive  0  0
# fsck /dev/dsk/c2t5006016130202E48d58s0
** /dev/dsk/c2t5006016130202E48d58s0
** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups

FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX? n

144 files, 189504 used, 33832172 free (420 frags, 4228969 blocks, 0.0%
fragmentation)
# fsck /dev/dsk/c2t5006016930202E48d58s0
** /dev/dsk/c2t5006016930202E48d58s0
** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups

FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX? n

144 files, 189504 used, 33832172 free (420 frags, 4228969 blocks, 0.0%
fragmentation)

### So at this point, I can look down either path and get to my data.
Now I kill 1 of the 2 paths via SAN zoning.  cfgadm -c configure c2, and
powermt check reports that the path to SP A is now dead.  I'm still able
to fsck the dead path:
# cfgadm -c configure c2
# powermt check
Warning: CLARiiON device path c2t5006016130202E48d58s0 is currently
dead.
Do you want to remove it (y/n/a/q)? n
# powermt display dev=58
Pseudo name=emcpower58a
CLARiiON ID=APM00051704678 [uscicsap1]
Logical device ID=6006016067E51400565259A15331DB11 [saperqdb1:
/oracle/Q02/saparch]
state=alive; policy=BasicFailover; priority=0; queued-IOs=0
Owner: default=SP A, current=SP B

==
 Host ---   - Stor -   -- I/O Path -  --
Stats ---
### HW Path I/O PathsInterf.   ModeState  Q-IOs
Errors

==
3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016130202E48d58s0 
SP A1 active
dead   0  1
3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016930202E48d58s0 
SP B1 active
alive  0  0
# fsck /dev/dsk/c2t5006016130202E48d58s0
** /dev/dsk/c2t5006016130202E48d58s0
** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch
** 

[zfs-discuss] ZFS mirroring vs. ditto blocks

2007-05-23 Thread Moore, Joe
Has anyone done a comparison of the reliability and performance of a
mirrored zpool vs. a non-redundant zpool using ditto blocks?  What about
a gut-instinct about which will give better performance?  Or do I have
to wait until my Thumper arrives to find out for myself?

Also, in selecting where a ditto block is written, (other than far
away) does the system take into account the disk's path, so for
example, would it write both copies down a single controller?

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss