Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Jim Klimov
2012-10-03 22:03, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

If you are going to be an initiator only, then it makes sense for 
svc:/network/iscsi/initiator to be required by svc:/system/filesystem/local
If you are going to be a target only, then it makes sense for 
svc:/system/filesystem/local to be required by svc:/network/iscsi/target


Well, on my system that I complained a lot about last year,
I've had a physical pool, a zvol in it, shared and imported
over iscsi on loopback (or sometimes initiated from another
box), and another pool inside that zvol ultimately.

Since the overall construct including hardware lent itself to
many problems and panics as you may remember, I ultimately did
not import the data pool nor the pool in the zvol via common
services and /etc/zfs/zpool.cache, but made new services for
that. If you want, I'll try to find the pieces and send them
(off-list?), but the general idea was that I made two services -
one for import (without cachefile) of the physical pool and
one for virtual dcpool, *maybe* I also made instances of the
iscsi initiator and/or target services, and overall meshed
it with proper dependencies to start in order:
  OS milestone svcs - pool - target - initiator - dcpool
and stop in proper reverse order.

Ultimately, since the pool imports could occasionally crash
that box, there were files to touch or remove, in order to
delay or cancel imports of pool or dcpool easily.

Overall, this let the system to boot into interactive mode
and enable all of its standard services and mount the rpool
filesystems way before attempting risky pool imports and iscsi.

Of course, on a particular system you might reconfigure SMF
services for zones or VMs to depend on accessibility of their
particular storage pools to start up - and reversely for
shutdowns.


These sorts of problems seem like they should be solvable by introducing some 
service manifest dependencies...  But there's no way to make it a 
generalization for the distribution as a whole (illumos/openindiana/oracle).  
It's just something that should be solvable on a case-by-case basis.


I think I got pretty close to generalization, so after some
code cleanup (things were hard-coded for this box) and even
real-world testing on your setup, we might try to push this
into OI or whoever picks it up. Now, I'll try to find these
manifests and methods ;)

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Schweiss, Chip
Thanks for all the input.  It seems information on the performance of the
ZIL is sparse and scattered.   I've spent significant time researching this
the past day.  I'll summarize what I've found.   Please correct me if I'm
wrong.

   - The ZIL can have any number of SSDs attached either mirror or
   individually.   ZFS will stripe across these in a raid0 or raid10 fashion
   depending on how you configure.
   - To determine the true maximum streaming performance of the ZIL setting
   sync=disabled will only use the in RAM ZIL.   This gives up power
   protection to synchronous writes.
   - Many SSDs do not help protect against power failure because they have
   their own ram cache for writes.  This effectively makes the SSD useless for
   this purpose and potentially introduces a false sense of security.  (These
   SSDs are fine for L2ARC)
   - Mirroring SSDs is only helpful if one SSD fails at the time of a power
   failure.  This leave several unanswered questions.  How good is ZFS at
   detecting that an SSD is no longer a reliable write target?   The chance of
   silent data corruption is well documented about spinning disks.  What
   chance of data corruption does this introduce with up to 10 seconds of data
   written on SSD.  Does ZFS read the ZIL during a scrub to determine if our
   SSD is returning what we write to it?
   - Zpool versions 19 and higher should be able to survive a ZIL failure
   only loosing the uncommitted data.   However, I haven't seen good enough
   information that I would necessarily trust this yet.
   - Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
   SSDs.   I'm not sure if that is current, but I can't find any reports of
   better performance.   I would suspect that DDR drive or Zeus RAM as ZIL
   would push past this.

Anyone care to post their performance numbers on current hardware with E5
processors, and ram based ZIL solutions?

Thanks to everyone who has responded and contacted me directly on this
issue.

-Chip
On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel 
andrew.gabr...@cucumber.demon.co.uk wrote:

 Edward Ned Harvey (**opensolarisisdeadlongliveopens**olaris) wrote:

 From: 
 zfs-discuss-bounces@**opensolaris.orgzfs-discuss-boun...@opensolaris.org[mailto:
 zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip

 How can I determine for sure that my ZIL is my bottleneck?  If it is the
 bottleneck, is it possible to keep adding mirrored pairs of SSDs to the
 ZIL to
 make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.


 Temporarily set sync=disabled
 Or, depending on your application, leave it that way permanently.  I
 know, for the work I do, most systems I support at most locations have
 sync=disabled.  It all depends on the workload.


 Noting of course that this means that in the case of an unexpected system
 outage or loss of connectivity to the disks, synchronous writes since the
 last txg commit will be lost, even though the applications will believe
 they are secured to disk. (ZFS filesystem won't be corrupted, but it will
 look like it's been wound back by up to 30 seconds when you reboot.)

 This is fine for some workloads, such as those where you would start again
 with fresh data and those which can look closely at the data to see how far
 they got before being rudely interrupted, but not for those which rely on
 the Posix semantics of synchronous writes/syncs meaning data is secured on
 non-volatile storage when the function returns.

 --
 Andrew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Andrew Gabriel [mailto:andrew.gabr...@cucumber.demon.co.uk]
 
  Temporarily set sync=disabled
  Or, depending on your application, leave it that way permanently.  I know,
 for the work I do, most systems I support at most locations have
 sync=disabled.  It all depends on the workload.
 
 Noting of course that this means that in the case of an unexpected system
 outage or loss of connectivity to the disks, synchronous writes since the last
 txg commit will be lost, even though the applications will believe they are
 secured to disk. (ZFS filesystem won't be corrupted, but it will look like 
 it's
 been wound back by up to 30 seconds when you reboot.)
 
 This is fine for some workloads, such as those where you would start again
 with fresh data 

It's fine for any load where you don't have clients keeping track of your state.

Examples where it's not fine:  You're processing credit card transactions.  You 
just finished processing a transaction, system crashes, and you forget about 
it.  Not fine, because systems external to yourself are aware of state that is 
in the future of your state, and you aren't aware of it.

You're a NFS server.  Some clients write some files, you say they're written, 
you crash, and forget about it.  Now you reboot, start serving NFS again, and 
the client still has a file handle for something it thinks exists ... but 
according to you in your new state, it doesn't exist.

You're a database server, and your clients are external to yourself.  They do 
transactions against you, you say they're complete, and you forget about it.

But it's ok:  

You're a NFS server, and you're configured to NOT restart NFS automatically 
upon reboot.  In the event of an ungraceful crash, admin intervention is 
required, and the admin is aware, he needs to reboot the NFS clients before 
starting the NFS services again.

You're a database server, and your clients are all inside yourself, either as 
VM's, or services of various kinds

You're a VM server, and all of your VM's are desktop clients, like a windows 7 
machine for example.  None of your guests are servers in and of themselves 
maintaining state with external entities (such as processing credit card 
transactions, serving a database, or file server.)  By mere virtue of the fact 
that you crash ungracefully implies your guests also crash ungracefully.  You 
all reboot, rewind a few seconds, no problem.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip
 
 . The ZIL can have any number of SSDs attached either mirror or
 individually.   ZFS will stripe across these in a raid0 or raid10 fashion
 depending on how you configure.

I'm regurgitating something somebody else said - but I don't know where.  I 
believe multiple ZIL devices don't get striped.  They get round-robin'd.  This 
means your ZIL can absolutely become a bottleneck, if you're doing sustained 
high throughput (not high IOPS) sync writes.  But the way to prevent that 
bottleneck is by tuning the ... I don't know the names of the parameters.  Some 
parameters that indicate a sync write larger than X should skip the ZIL and go 
directly to pool.


 . To determine the true maximum streaming performance of the ZIL setting
 sync=disabled will only use the in RAM ZIL.   This gives up power protection 
 to
 synchronous writes.

There is no RAM ZIL.  The basic idea behind ZIL is like this:  Some 
applications simply tell the system to write and the system will buffer these 
writes in memory, and the application will continue processing.  But some 
applications do not want the OS to buffer writes, so they issue writes in 
sync mode.  These applications will issue the write command, and they will 
block there, until the OS says it's written to nonvolatile storage.  In ZFS, 
this means the transaction gets written to the ZIL, and then it gets put into 
the memory buffer just like any other write.  Upon reboot, when the filesystem 
is mounting, ZFS will always look in the ZIL to see if there are any 
transactions that have not yet been played to disk.

So, when you set sync=disabled, you're just bypassing that step.  You're lying 
to the applications, if they say I want to know when this is written to disk, 
and you just immediately say Yup, it's done unconditionally.  This is the 
highest performance thing you could possibly do - but depending on your system 
workload, could put you at risk for data loss.


 . Mirroring SSDs is only helpful if one SSD fails at the time of a power
 failure.  This leave several unanswered questions.  How good is ZFS at
 detecting that an SSD is no longer a reliable write target?   The chance of
 silent data corruption is well documented about spinning disks.  What chance
 of data corruption does this introduce with up to 10 seconds of data written
 on SSD.  Does ZFS read the ZIL during a scrub to determine if our SSD is
 returning what we write to it?

Not just power loss -- any ungraceful crash.  

ZFS doesn't have any way to scrub ZIL devices, so it's not very good at 
detecting failed ZIL devices.  There is definitely the possibility for an SSD 
to enter a failure mode where you write to it, it doesn't complain, but you 
wouldn't be able to read it back if you tried.  Also, upon ungraceful crash, 
even if you try to read that data, and fail to get it back, there's no way to 
know that you should have expected something.  So you still don't detect the 
failure.

If you want to maintain your SSD periodically, you should do something like:  
Remove it as a ZIL device, create a new pool with just this disk in it, write a 
bunch of random data to the new junk pool, scrub the pool, then destroy the 
junk pool and return it as a ZIL device to the main pool.  This does not 
guarantee anything - but then - nothing anywhere guarantees anything.  This is 
a good practice, and it definitely puts you into a territory of reliability 
better than the competing alternatives.


 . Zpool versions 19 and higher should be able to survive a ZIL failure only
 loosing the uncommitted data.   However, I haven't seen good enough
 information that I would necessarily trust this yet.

That was a very long time ago.  (What, 2-3 years?)  It's very solid now.


 . Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
 SSDs.   I'm not sure if that is current, but I can't find any reports of 
 better
 performance.   I would suspect that DDR drive or Zeus RAM as ZIL would push
 past this.

Whenever I measure the sustainable throughput of a SSD, HDD, DDRDrive, or 
anything else ... Very few devices can actually sustain faster than 1Gb/s, for 
use as a ZIL or anything else.  Published specs are often higher, but not 
realistic.

If you are ZIL bandwidth limited, you should consider tuning the size of stuff 
that goes to ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Jim Klimov [mailto:jimkli...@cos.ru]
 
 Well, on my system that I complained a lot about last year,
 I've had a physical pool, a zvol in it, shared and imported
 over iscsi on loopback (or sometimes initiated from another
 box), and another pool inside that zvol ultimately.

Ick.  And it worked?


  These sorts of problems seem like they should be solvable by introducing
 some service manifest dependencies...  But there's no way to make it a
 generalization for the distribution as a whole (illumos/openindiana/oracle).
 It's just something that should be solvable on a case-by-case basis.

I started looking at that yesterday, and was surprised by how complex the 
dependency graph is.  Also, can't get graphviz to install or build, so I don't 
actually have a graph.

In any event, rather than changing the existing service dependencies, I decided 
to just make a new service, which would zpool import, and zpool export the 
pools that are on iscsi, before and after the iscsi initiator.

At present, the new service correctly mounts  dismounts the iscsi pool while 
I'm sitting there, but for some reason, it fails during reboot.  I ran out of 
time digging into it ... I'll look some more tomorrow.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Andrew Gabriel

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Schweiss, Chip

How can I determine for sure that my ZIL is my bottleneck?  If it is the
bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to
make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.


Temporarily set sync=disabled
Or, depending on your application, leave it that way permanently.  I know, for 
the work I do, most systems I support at most locations have sync=disabled.  It 
all depends on the workload.


Noting of course that this means that in the case of an unexpected system 
outage or loss of connectivity to the disks, synchronous writes since the last 
txg commit will be lost, even though the applications will believe they are 
secured to disk. (ZFS filesystem won't be corrupted, but it will look like it's 
been wound back by up to 30 seconds when you reboot.)

This is fine for some workloads, such as those where you would start again with 
fresh data and those which can look closely at the data to see how far they got 
before being rudely interrupted, but not for those which rely on the Posix 
semantics of synchronous writes/syncs meaning data is secured on non-volatile 
storage when the function returns.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Jim Klimov
2012-10-04 16:06, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) пишет:

From: Jim Klimov [mailto:jimkli...@cos.ru]

Well, on my system that I complained a lot about last year,
I've had a physical pool, a zvol in it, shared and imported
over iscsi on loopback (or sometimes initiated from another
box), and another pool inside that zvol ultimately.


Ick.  And it worked?


Um, well. Kind of yes, but it ran into many rough corners -
many of which I posted and asked about.

The fatal one was my choice of smaller blocks in the zvol,
so I learned that metadata (on 4k sectored disks) could
consume about as much as userdata in that zvol/pool, so
I ultimately migrated data off that pool into the system's
physical pool - not very easy given that there was little
free space left (unexpectedly for me, until I understood
the inner workings).

Still, technically, there is little problem building such
a setup - it just needs some more thorough understanding
and planning. I did learn a lot, so it wasn't in vain, too.


These sorts of problems seem like they should be solvable by introducing

some service manifest dependencies...  But there's no way to make it a
generalization for the distribution as a whole (illumos/openindiana/oracle).
It's just something that should be solvable on a case-by-case basis.


I started looking at that yesterday, and was surprised by how complex the 
dependency graph is.  Also, can't get graphviz to install or build, so I don't 
actually have a graph.


There are also loops ;)

# svcs -d filesystem/usr
STATE  STIMEFMRI
online Aug_27   svc:/system/scheduler:default
...

# svcs -d scheduler
STATE  STIMEFMRI
online Aug_27   svc:/system/filesystem/minimal:default
...

# svcs -d filesystem/minimal
STATE  STIMEFMRI
online Aug_27   svc:/system/filesystem/usr:default
...



In any event, rather than changing the existing service dependencies, I decided 
to just make a new service, which would zpool import, and zpool export the 
pools that are on iscsi, before and after the iscsi initiator.

At present, the new service correctly mounts  dismounts the iscsi pool while 
I'm sitting there, but for some reason, it fails during reboot.  I ran out of time 
digging into it ... I'll look some more tomorrow.


That's about what I did and described.
I too do avoid hacking into distro-provided services, so
that upgrades don't break my customizations and vice-versa.

My code is not yet accessible to me, but I think my instance
of the target/initiator services did a temp-enable/disable of
stock services as its start/stop methods, and the system iscsi
services were kept disabled by default. This way I could start
them at a needed moment in time without changing their service
definitions.

Also note that if you do prefer to rely on stock services, you
can define reverse dependencies in your own new services (i.e.
I declare that iscsi/target depends on me. Yours, pool-import).

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Neil Perrin

  
  
On 10/04/12 05:30, Schweiss, Chip wrote:
Thanks for all the input. It seems information on the
  performance of the ZIL is sparse and scattered. I've spent
  significant time researching this the past day. I'll summarize
  what I've found. Please correct me if I'm wrong.
  
The ZIL can have any number of SSDs attached either mirror
  or individually. ZFS will stripe across these in a raid0 or
  raid10 fashion depending on how you configure.
  


The ZIL code chains blocks together and these are allocated round
robin among slogs or
if they don't exist then the main pool devices.


  
To determine the true maximum streaming performance of the
  ZIL setting sync=disabled will only use the in RAM ZIL. This
  gives up power protection to synchronous writes.
  


There is no RAM ZIL. If sync=disabled then all writes are
asynchronous and are written
as part of the periodic ZFS transaction group (txg) commit that
occurs every 5 seconds.


  
Many SSDs do not help protect against power failure because
  they have their own ram cache for writes. This effectively
  makes the SSD useless for this purpose and potentially
  introduces a false sense of security. (These SSDs are fine
  for L2ARC)

  


The ZIL code issues a write cache flush to all devices it has
written before returning
from the system call. I've heard, that not all devices obey the
flush but we consider them
as broken hardware. I don't have a list to avoid.


  

  

Mirroring SSDs is only helpful if one SSD fails at the time
  of a power failure. This leave several unanswered questions.
  How good is ZFS at detecting that an SSD is no longer a
  reliable write target? The chance of silent data corruption
  is well documented about spinning disks. What chance of data
  corruption does this introduce with up to 10 seconds of data
  written on SSD. Does ZFS read the ZIL during a scrub to
  determine if our SSD is returning what we write to it?

  


If the ZIL code gets a block write failure it will force the txg to
commit before returning.
It will depend on the drivers and IO subsystem as to how hard it
tries to write the block.


  

  

Zpool versions 19 and higher should be able to survive a ZIL
  failure only loosing the uncommitted data.  However, I
  haven't seen good enough information that I would necessarily
  trust this yet. 

  


This has been available for quite a while and I haven't heard of any
bugs in this area.


  

  Several threads seem to suggest a ZIL throughput limit of
  1Gb/s with SSDs. I'm not sure if that is current, but I
  can't find any reports of better performance. I would
  suspect that DDR drive or Zeus RAM as ZIL would push past
  this.
  


1GB/s seems very high, but I don't have any numbers to share.


  

  
  Anyone care to post their performance numbers on current
hardware with E5 processors, and ram based ZIL solutions?
  Thanks to everyone who has responded and contacted me directly
on this issue.
  -Chip
  
  On Thu, Oct 4, 2012 at 3:03 AM, Andrew
Gabriel andrew.gabr...@cucumber.demon.co.uk
wrote:

  
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
  wrote:
  

  From: zfs-discuss-boun...@opensolaris.org
  [mailto:zfs-discuss-
  boun...@opensolaris.org] On
  Behalf Of Schweiss, Chip
  
  How can I determine for sure that my ZIL is my
  bottleneck? If it is the
  bottleneck, is it possible to keep adding mirrored
  pairs of SSDs to the ZIL to
  make it faster? Or should I be looking for a DDR
  drive, ZeusRAM, etc.


Temporarily set sync=disabled
Or, depending on your application, leave it that way
permanently. I know, for the work I do, most systems I
support at most locations have sync=disabled. It all
depends on the workload.
  
  

  
  Noting of course that this means that in the case of an
  unexpected system outage or loss of connectivity to the disks,
  synchronous writes since the last txg commit will be lost,
  even though the applications will believe they are 

Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Dan Swartzendruber


This whole thread has been fascinating.  I really wish we (OI) had the 
two following things that freebsd supports:


1. HAST - provides a block-level driver that mirrors a local disk to a 
network disk presenting the result as a block device using the GEOM API.


2. CARP.

I have a prototype with two freebsd VMs where I can failover back and 
forth and it works beautifully.  Block level replication using all open 
source software.  There were some glitches involving boot and shutdown 
(dependencies that are not set up properly), but I think if there was 
enough love in the freebsd community that could be fixed.   I could be 
wrong, but it doesn't *seem* as if either HAST (or an equivalent) or 
CARP exist in the OI (or other *solaris derivatives) space.  Shame if so...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Dan Swartzendruber


Forgot to mention: my interest in doing this was so I could have my ESXi 
host point at a CARP-backed IP address for the datastore, and I would 
have no single point of failure at the storage level.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] removing upgrade notice from 'zpool status -x'

2012-10-04 Thread Jan Owoc
Hi,

I have a machine whose zpools are at version 28, and I would like to
keep them at that version for portability between OSes. I understand
that 'zpool status' asks me to upgrade, but so does 'zpool status -x'
(the man page says it should only report errors or unavailability).
This is a problem because I have a script that assumes zpool status
-x only returns errors requiring user intervention.

Is there a way to either:
A) suppress the upgrade notice from 'zpool status -x' ?
B) use a different command to get information about actual errors
w/out encountering the upgrade notice ?

I'm using OpenIndiana 151a6 on x86.

Jan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Dan Swartzendruber

On 10/4/2012 11:48 AM, Richard Elling wrote:
On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber dswa...@druber.com 
mailto:dswa...@druber.com wrote:




This whole thread has been fascinating.  I really wish we (OI) had 
the two following things that freebsd supports:


1. HAST - provides a block-level driver that mirrors a local disk to 
a network disk presenting the result as a block device using the 
GEOM API.


This is called AVS in the Solaris world.

In general, these systems suffer from a fatal design flaw: the 
authoritative view of the
data is not also responsible for the replication. In other words, you 
can provide coherency
but not consistency. Both are required to provide a single view of the 
data.


Can you expand on this?


2. CARP.


This exists as part of the OHAC project.
 -- richard



These are both freely available?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] removing upgrade notice from 'zpool status -x'

2012-10-04 Thread Richard Elling
On Oct 4, 2012, at 8:58 AM, Jan Owoc jso...@gmail.com wrote:

 Hi,
 
 I have a machine whose zpools are at version 28, and I would like to
 keep them at that version for portability between OSes. I understand
 that 'zpool status' asks me to upgrade, but so does 'zpool status -x'
 (the man page says it should only report errors or unavailability).
 This is a problem because I have a script that assumes zpool status
 -x only returns errors requiring user intervention.

The return code for zpool is ambiguous. Do not rely upon it to determine
if the pool is healthy. You should check the health property instead.

 Is there a way to either:
 A) suppress the upgrade notice from 'zpool status -x' ?

Pedantic answer, it is open source ;-)

 B) use a different command to get information about actual errors
 w/out encountering the upgrade notice ?
 
 I'm using OpenIndiana 151a6 on x86.


 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Richard Elling
On Oct 4, 2012, at 9:07 AM, Dan Swartzendruber dswa...@druber.com wrote:

 On 10/4/2012 11:48 AM, Richard Elling wrote:
 
 On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber dswa...@druber.com wrote:
 
 
 This whole thread has been fascinating.  I really wish we (OI) had the two 
 following things that freebsd supports:
 
 1. HAST - provides a block-level driver that mirrors a local disk to a 
 network disk presenting the result as a block device using the GEOM API.
 
 This is called AVS in the Solaris world.
 
 In general, these systems suffer from a fatal design flaw: the authoritative 
 view of the 
 data is not also responsible for the replication. In other words, you can 
 provide coherency
 but not consistency. Both are required to provide a single view of the data.
 
 Can you expand on this?

I could, but I've already written a book on clustering. For a more general 
approach
to understanding clustering, I can highly recommend Pfister's In Search of 
Clusters.
http://www.amazon.com/In-Search-Clusters-2nd-Edition/dp/0138997098

NB, clustered storage is the same problem as clustered compute wrt state.

 2. CARP.
 
 This exists as part of the OHAC project.
  -- richard
 
 
 These are both freely available?

Yes.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Jim Klimov

2012-10-04 19:48, Richard Elling wrote:

2. CARP.


This exists as part of the OHAC project.
  -- richard


Wikipedia says CARP is the open-source equivalent of VRRP.
And we have that in OI, don't we? Would it suffice?

# pkg info -r vrrp
  Name: system/network/routing/vrrp
   Summary: Solaris VRRP protocol
   Description: Solaris VRRP protocol service
  Category: System/Administration and Configuration
 State: Not installed
 Publisher: openindiana.org
   Version: 0.5.11
 Build Release: 5.11
Branch: 0.151.1.5
Packaging Date: Sat Jun 30 20:01:06 2012
  Size: 275.57 kB
  FMRI: 
pkg://openindiana.org/system/network/routing/vrrp@0.5.11,5.11-0.151.1.5:20120630T200106Z


HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] removing upgrade notice from 'zpool status -x'

2012-10-04 Thread Freddie Cash
On Thu, Oct 4, 2012 at 9:14 AM, Richard Elling richard.ell...@gmail.com wrote:
 On Oct 4, 2012, at 8:58 AM, Jan Owoc jso...@gmail.com wrote:
 The return code for zpool is ambiguous. Do not rely upon it to determine
 if the pool is healthy. You should check the health property instead.

Huh.  Learn something new everyday.  You just simplified my pool
health check script immensely.  Thank you!

pstatus=$( zpool get health storage | grep health | awk '{ print $3 }' )
if [ ${pstatus} != ONLINE ]; then

Much simpler than the nested ifs and grep pipelines I was using before.

Not sure why I didn't see health in the list of pool properties all
the times I've read the zpool man page.

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] removing upgrade notice from 'zpool status -x'

2012-10-04 Thread Jim Klimov

2012-10-04 20:36, Freddie Cash пишет:

On Thu, Oct 4, 2012 at 9:14 AM, Richard Elling richard.ell...@gmail.com wrote:

On Oct 4, 2012, at 8:58 AM, Jan Owoc jso...@gmail.com wrote:
The return code for zpool is ambiguous. Do not rely upon it to determine
if the pool is healthy. You should check the health property instead.


Huh.  Learn something new everyday.  You just simplified my pool
health check script immensely.  Thank you!

pstatus=$( zpool get health storage | grep health | awk '{ print $3 }' )
if [ ${pstatus} != ONLINE ]; then


Simplify that too with zpool list:

# zpool list -H -o health rpool
ONLINE

;)
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] removing upgrade notice from 'zpool status -x'

2012-10-04 Thread Freddie Cash
On Thu, Oct 4, 2012 at 9:45 AM, Jim Klimov jimkli...@cos.ru wrote:
 2012-10-04 20:36, Freddie Cash пишет:

 On Thu, Oct 4, 2012 at 9:14 AM, Richard Elling richard.ell...@gmail.com
 wrote:

 On Oct 4, 2012, at 8:58 AM, Jan Owoc jso...@gmail.com wrote:
 The return code for zpool is ambiguous. Do not rely upon it to determine
 if the pool is healthy. You should check the health property instead.


 Huh.  Learn something new everyday.  You just simplified my pool
 health check script immensely.  Thank you!

 pstatus=$( zpool get health storage | grep health | awk '{ print $3 }' )
 if [ ${pstatus} != ONLINE ]; then


 Simplify that too with zpool list:

 # zpool list -H -o health rpool
 ONLINE

Thanks!  Was trying to figure out how to remove the heading as I use
-H a lot with zfs commands, but it didn't work for zpool get and I
didn't bother reading to figure it out.  :)


-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Dan Swartzendruber

On 10/4/2012 12:19 PM, Richard Elling wrote:
On Oct 4, 2012, at 9:07 AM, Dan Swartzendruber dswa...@druber.com 
mailto:dswa...@druber.com wrote:



On 10/4/2012 11:48 AM, Richard Elling wrote:
On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber dswa...@druber.com 
mailto:dswa...@druber.com wrote:




This whole thread has been fascinating.  I really wish we (OI) had 
the two following things that freebsd supports:


1. HAST - provides a block-level driver that mirrors a local disk 
to a network disk presenting the result as a block device using 
the GEOM API.


This is called AVS in the Solaris world.

In general, these systems suffer from a fatal design flaw: the 
authoritative view of the
data is not also responsible for the replication. In other words, 
you can provide coherency
but not consistency. Both are required to provide a single view of 
the data.




Sorry to be dense here, but I'm not getting how this is a cluster setup, 
or what your point wrt authoritative vs replication meant.  In the 
scenario I was looking at, one host is providing access to clients - on 
the backup host, no services are provided at all.  The master node does 
mirrored writes to the local disk and the network disk.  The mirrored 
write does not return until the backup host confirms the data is safely 
written to disk.  If a failover event occurs, there should not be any 
writes the client has been told completed that was not completed to both 
sides.  The master node stops responding to the virtual IP, and the 
backup starts responding to it.  Any pending NFS writes will presumably 
be retried by the client, and the new master node has completely up to 
date data on disk to respond with.  Maybe I am focusing too narrowly 
here, but in the case I am looking at, there is only a single node which 
is active at any time, and it is responsible for replication and access 
by clients, so I don't see the failure modes you allude to.  Maybe I 
need to shell out for that book :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sudden and Dramatic Performance Drop-off

2012-10-04 Thread Knipe, Charles
Hey guys,

I've run into another ZFS performance disaster that I was hoping someone might 
be able to give me some pointers on resolving.  Without any significant change 
in workload write performance has dropped off dramatically.  Based on previous 
experience we tried deleting some files to free space, even though we're not 
near 60% full yet.  Deleting files seemed to help for a little while, but now 
we're back in the weeds.

We already have our metaslab_min_alloc_size set to 0x500, so I'm reluctant to 
go lower than that.  One thing we noticed, which is new to us, is that 
zio_state shows a large number of threads in CHECKSUM_VERIFY.  I'm wondering if 
that's generally indicative of anything in particular.  I've got no errors on 
any disks, either in zpool status or iostat -e.  Any ideas as to where else I 
might want to dig in to figure out where my performance has gone?

Thanks

-Charles

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Jim Klimov

2012-10-04 21:19, Dan Swartzendruber writes:

Sorry to be dense here, but I'm not getting how this is a cluster setup,
or what your point wrt authoritative vs replication meant.  In the
scenario I was looking at, one host is providing access to clients - on
the backup host, no services are provided at all.  The master node does
mirrored writes to the local disk and the network disk.  The mirrored
write does not return until the backup host confirms the data is safely
written to disk.  If a failover event occurs, there should not be any
writes the client has been told completed that was not completed to both
sides.  The master node stops responding to the virtual IP, and the
backup starts responding to it.  Any pending NFS writes will presumably
be retried by the client, and the new master node has completely up to
date data on disk to respond with.  Maybe I am focusing too narrowly
here, but in the case I am looking at, there is only a single node which
is active at any time, and it is responsible for replication and access
by clients, so I don't see the failure modes you allude to.  Maybe I
need to shell out for that book :)


What if the backup host is down (i.e. the ex-master after the failover)?
Will your failed-over pool accept no writes until both storage machines
are working?

What if internetworking between these two heads has a glitch, and as
a result both of them become masters of their private copies (mirror
halves), and perhaps both even manage to accept writes from clients?

This is the clustering part, which involves fencing around the node
which is considered dead, perhaps including a hardware reset request
just to make sure it's dead, before taking over resources it used to
master (STONITH - Shoot The Other Node In The Head). In particular,
clusters suggest that for hearbeats so as to make sure both machines
work indeed, you use at least two separate wires (i.e. serial and LAN)
without active hardware (switches) in-between, separate from data
networking.


HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Richard Elling
Thanks Neil, we always appreciate your comments on ZIL implementation.
One additional comment below...

On Oct 4, 2012, at 8:31 AM, Neil Perrin neil.per...@oracle.com wrote:

 On 10/04/12 05:30, Schweiss, Chip wrote:
 
 Thanks for all the input.  It seems information on the performance of the 
 ZIL is sparse and scattered.   I've spent significant time researching this 
 the past day.  I'll summarize what I've found.   Please correct me if I'm 
 wrong.
 The ZIL can have any number of SSDs attached either mirror or individually.  
  ZFS will stripe across these in a raid0 or raid10 fashion depending on how 
 you configure.
 
 The ZIL code chains blocks together and these are allocated round robin among 
 slogs or
 if they don't exist then the main pool devices.
 
 To determine the true maximum streaming performance of the ZIL setting 
 sync=disabled will only use the in RAM ZIL.   This gives up power protection 
 to synchronous writes.
 
 There is no RAM ZIL. If sync=disabled then all writes are asynchronous and 
 are written
 as part of the periodic ZFS transaction group (txg) commit that occurs every 
 5 seconds.
 
 Many SSDs do not help protect against power failure because they have their 
 own ram cache for writes.  This effectively makes the SSD useless for this 
 purpose and potentially introduces a false sense of security.  (These SSDs 
 are fine for L2ARC)
 
 The ZIL code issues a write cache flush to all devices it has written before 
 returning
 from the system call. I've heard, that not all devices obey the flush but we 
 consider them
 as broken hardware. I don't have a list to avoid.
 
 
 Mirroring SSDs is only helpful if one SSD fails at the time of a power 
 failure.  This leave several unanswered questions.  How good is ZFS at 
 detecting that an SSD is no longer a reliable write target?   The chance of 
 silent data corruption is well documented about spinning disks.  What chance 
 of data corruption does this introduce with up to 10 seconds of data written 
 on SSD.  Does ZFS read the ZIL during a scrub to determine if our SSD is 
 returning what we write to it?
 
 If the ZIL code gets a block write failure it will force the txg to commit 
 before returning.
 It will depend on the drivers and IO subsystem as to how hard it tries to 
 write the block.
 
 
 Zpool versions 19 and higher should be able to survive a ZIL failure only 
 loosing the uncommitted data.   However, I haven't seen good enough 
 information that I would necessarily trust this yet. 
 
 This has been available for quite a while and I haven't heard of any bugs in 
 this area.
 
 Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs.   
 I'm not sure if that is current, but I can't find any reports of better 
 performance.   I would suspect that DDR drive or Zeus RAM as ZIL would push 
 past this.
 
 1GB/s seems very high, but I don't have any numbers to share.

It is not unusual for workloads to exceed the performance of a single device.
For example, if you have a device that can achieve 700 MB/sec, but a workload
generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it
should be immediately obvious that the slog needs to be striped. Empirically,
this is also easy to measure.
 -- richard

 
   
 Anyone care to post their performance numbers on current hardware with E5 
 processors, and ram based ZIL solutions?  
 
 Thanks to everyone who has responded and contacted me directly on this issue.
 
 -Chip
 On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel 
 andrew.gabr...@cucumber.demon.co.uk wrote:
 Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip
 
 How can I determine for sure that my ZIL is my bottleneck?  If it is the
 bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL 
 to
 make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
 
 Temporarily set sync=disabled
 Or, depending on your application, leave it that way permanently.  I know, 
 for the work I do, most systems I support at most locations have 
 sync=disabled.  It all depends on the workload.
 
 Noting of course that this means that in the case of an unexpected system 
 outage or loss of connectivity to the disks, synchronous writes since the 
 last txg commit will be lost, even though the applications will believe they 
 are secured to disk. (ZFS filesystem won't be corrupted, but it will look 
 like it's been wound back by up to 30 seconds when you reboot.)
 
 This is fine for some workloads, such as those where you would start again 
 with fresh data and those which can look closely at the data to see how far 
 they got before being rudely interrupted, but not for those which rely on 
 the Posix semantics of synchronous writes/syncs meaning data is secured on 
 non-volatile storage when the function returns.
 
 -- 
 Andrew
 
 
 
 

Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Schweiss, Chip
Again thanks for the input and clarifications.

I would like to clarify the numbers I was talking about with ZiL
performance specs I was seeing talked about on other forums.   Right now
I'm getting streaming performance of sync writes at about 1 Gbit/S.   My
target is closer to 10Gbit/S.   If I get to build it this system, it will
house a decent size VMware NFS storage w/ 200+ VMs, which will be dual
connected via 10Gbe.   This is all medical imaging research.  We move data
around by the TB and fast streaming is imperative.

On the system I've been testing with is 10Gbe connected and I have about 50
VMs running very happily, and haven't yet found my random I/O limit.
However every time, I storage vMotion a handful of additional VMs, the ZIL
seems to max out it's writing speed to the SSDs and random I/O also
suffers.   With out the SSD ZIL, random I/O is very poor.   I will be doing
some testing with sync=off, tomorrow and see how things perform.

If anyone can testify to a ZIL device(s) that can keep up with 10GBe or
more streaming synchronous writes please let me know.

-Chip



On Thu, Oct 4, 2012 at 1:33 PM, Richard Elling richard.ell...@gmail.comwrote:


 This has been available for quite a while and I haven't heard of any bugs
 in this area.


- Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
SSDs.   I'm not sure if that is current, but I can't find any reports of
better performance.   I would suspect that DDR drive or Zeus RAM as ZIL
would push past this.


 1GB/s seems very high, but I don't have any numbers to share.


 It is not unusual for workloads to exceed the performance of a single
 device.
 For example, if you have a device that can achieve 700 MB/sec, but a
 workload
 generated by lots of clients accessing the server via 10GbE (1 GB/sec),
 then it
 should be immediately obvious that the slog needs to be striped.
 Empirically,
 this is also easy to measure.
  -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sudden and Dramatic Performance Drop-off

2012-10-04 Thread Cindy Swearingen

Hi Charles,

Yes, a faulty or failing disk can kill performance.

I would see if FMA has generated any faults:

# fmadm faulty

Or, if any of the devices are collecting errors:

# fmdump -eV | more

Thanks,

Cindy

On 10/04/12 11:22, Knipe, Charles wrote:

Hey guys,

I’ve run into another ZFS performance disaster that I was hoping someone
might be able to give me some pointers on resolving. Without any
significant change in workload write performance has dropped off
dramatically. Based on previous experience we tried deleting some files
to free space, even though we’re not near 60% full yet. Deleting files
seemed to help for a little while, but now we’re back in the weeds.

We already have our metaslab_min_alloc_size set to 0x500, so I’m
reluctant to go lower than that. One thing we noticed, which is new to
us, is that zio_state shows a large number of threads in
CHECKSUM_VERIFY. I’m wondering if that’s generally indicative of
anything in particular. I’ve got no errors on any disks, either in zpool
status or iostat –e. Any ideas as to where else I might want to dig in
to figure out where my performance has gone?

Thanks

-Charles



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sudden and Dramatic Performance Drop-off

2012-10-04 Thread Schweiss, Chip
Sounds similar to the problem discussed here:

http://blogs.everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/

Check 'iostat -xn' and see if one or more disks is stuck at 100%.

-Chip

On Thu, Oct 4, 2012 at 3:42 PM, Cindy Swearingen 
cindy.swearin...@oracle.com wrote:

 Hi Charles,

 Yes, a faulty or failing disk can kill performance.

 I would see if FMA has generated any faults:

 # fmadm faulty

 Or, if any of the devices are collecting errors:

 # fmdump -eV | more

 Thanks,

 Cindy


 On 10/04/12 11:22, Knipe, Charles wrote:

 Hey guys,

 I’ve run into another ZFS performance disaster that I was hoping someone
 might be able to give me some pointers on resolving. Without any
 significant change in workload write performance has dropped off
 dramatically. Based on previous experience we tried deleting some files
 to free space, even though we’re not near 60% full yet. Deleting files
 seemed to help for a little while, but now we’re back in the weeds.

 We already have our metaslab_min_alloc_size set to 0x500, so I’m
 reluctant to go lower than that. One thing we noticed, which is new to
 us, is that zio_state shows a large number of threads in
 CHECKSUM_VERIFY. I’m wondering if that’s generally indicative of
 anything in particular. I’ve got no errors on any disks, either in zpool
 status or iostat –e. Any ideas as to where else I might want to dig in
 to figure out where my performance has gone?

 Thanks

 -Charles



 __**_
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/**mailman/listinfo/zfs-discusshttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 __**_
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/**mailman/listinfo/zfs-discusshttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 There are also loops ;)
 
 # svcs -d filesystem/usr
 STATE  STIMEFMRI
 online Aug_27   svc:/system/scheduler:default
 ...
 
 # svcs -d scheduler
 STATE  STIMEFMRI
 online Aug_27   svc:/system/filesystem/minimal:default
 ...
 
 # svcs -d filesystem/minimal
 STATE  STIMEFMRI
 online Aug_27   svc:/system/filesystem/usr:default
 ...

How is that possible?  Why would the system be willing to startup in a 
situation like that?  It *must* be launching one of those, even without its 
dependencies met ... 

The answer to this question, will in all likelihood, shed some light on my 
situation, trying to understand why my iscsi mounted zpool import/export 
service is failing to go down or come up in the order I expected, when it's 
dependent on the iscsi initiator.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip
 
 If I get to build it this system, it will house a decent size VMware
 NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe.   This is all
 medical imaging research.  We move data around by the TB and fast
 streaming is imperative.

This might not carry over to vmware, iscsi vs nfs.  But with virtualbox, using 
a local file versus using a local zvol, I have found the zvol is much faster 
for the guest OS.  Also, by default the zvol will have smarter reservations 
(refreservation) which seems to be a good thing.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Neil Perrin
 
 The ZIL code chains blocks together and these are allocated round robin
 among slogs or
 if they don't exist then the main pool devices.

So, if somebody is doing sync writes as fast as possible, would they gain more 
bandwidth by adding multiple slog devices?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Jim Klimov
2012-10-05 1:44, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) пишет:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov

There are also loops ;)

# svcs -d filesystem/usr
STATE  STIMEFMRI
online Aug_27   svc:/system/scheduler:default
...

# svcs -d scheduler
STATE  STIMEFMRI
online Aug_27   svc:/system/filesystem/minimal:default
...

# svcs -d filesystem/minimal
STATE  STIMEFMRI
online Aug_27   svc:/system/filesystem/usr:default
...


How is that possible?  Why would the system be willing to startup in a 
situation like that?  It *must* be launching one of those, even without its 
dependencies met ...


Well, it seems just like a peculiar effect of required vs. optional
dependencies. The loop is in the default installation. Details:

# svcprop filesystem/usr | grep scheduler
svc:/system/filesystem/usr:default/:properties/scheduler_usr/entities 
fmri svc:/system/scheduler
svc:/system/filesystem/usr:default/:properties/scheduler_usr/external 
boolean true
svc:/system/filesystem/usr:default/:properties/scheduler_usr/grouping 
astring optional_all
svc:/system/filesystem/usr:default/:properties/scheduler_usr/restart_on 
astring none
svc:/system/filesystem/usr:default/:properties/scheduler_usr/type 
astring service


# svcprop scheduler | grep minimal
svc:/application/cups/scheduler:default/:properties/filesystem_minimal/entities 
fmri svc:/system/filesystem/minimal
svc:/application/cups/scheduler:default/:properties/filesystem_minimal/grouping 
astring require_all
svc:/application/cups/scheduler:default/:properties/filesystem_minimal/restart_on 
astring none
svc:/application/cups/scheduler:default/:properties/filesystem_minimal/type 
astring service


# svcprop filesystem/minimal | grep usr
usr/entities fmri svc:/system/filesystem/usr
usr/grouping astring require_all
usr/restart_on astring none
usr/type astring service


The answer to this question, will in all likelihood, shed some light on my 
situation, trying to understand why my iscsi mounted zpool import/export 
service is failing to go down or come up in the order I expected, when it's 
dependent on the iscsi initiator.


Likewise - see what dependency type you introduced, and verify
that you've svcadm refreshed the service after config changes.

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Richard Elling

On Oct 4, 2012, at 1:33 PM, Schweiss, Chip c...@innovates.com wrote:

 Again thanks for the input and clarifications.
 
 I would like to clarify the numbers I was talking about with ZiL performance 
 specs I was seeing talked about on other forums.   Right now I'm getting 
 streaming performance of sync writes at about 1 Gbit/S.   My target is closer 
 to 10Gbit/S.   If I get to build it this system, it will house a decent size 
 VMware NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe.   
 This is all medical imaging research.  We move data around by the TB and fast 
 streaming is imperative.  
 
 On the system I've been testing with is 10Gbe connected and I have about 50 
 VMs running very happily, and haven't yet found my random I/O limit. However 
 every time, I storage vMotion a handful of additional VMs, the ZIL seems to 
 max out it's writing speed to the SSDs and random I/O also suffers.   With 
 out the SSD ZIL, random I/O is very poor.   I will be doing some testing with 
 sync=off, tomorrow and see how things perform.
 
 If anyone can testify to a ZIL device(s) that can keep up with 10GBe or more 
 streaming synchronous writes please let me know.  

Quick datapoint, with qty 3 ZeusRAMs as striped slog, we could push 1.3 
GBytes/sec of 
storage vmotion on a relatively modest system. To sustain that sort of thing 
often requires
full system-level tuning and proper systems engineering design. Fortunately, 
people 
tend to not do storage vmotion on a continuous basis.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing rpool device paths/drivers

2012-10-04 Thread Jerry Kemp
Its been awhile, but it seems like in the past, you would power the
system down, boot from removable media, import your pool then destroy or
archive the /etc/zfs/zpool.cache, and possibly your /etc/path_to_inst
file, power down again and re-arrange your hardware, then come up one
final time with a reconfigure boot.  Or something like that.

I remember a similar video that was up on YouTube as done by some of the
Sun guys employed in Germany.  They build a big array from USB drives,
then exported the pool.  Once the system was down, they re-arranged all
the drives in random order and ZFS was able to figure out how to put the
raid all back together.  I need to go find that video.

Jerry

On 10/ 3/12 07:04 AM, Fajar A. Nugraha wrote:
 On Wed, Oct 3, 2012 at 5:43 PM, Jim Klimov jimkli...@cos.ru wrote:
 2012-10-03 14:40, Ray Arachelian пишет:

 On 10/03/2012 05:54 AM, Jim Klimov wrote:

 Hello all,

It was often asked and discussed on the list about how to
 change rpool HDDs from AHCI to IDE mode and back, with the
 modern routine involving reconfiguration of the BIOS, bootup
 from separate live media, simple import and export of the
 rpool, and bootup from the rpool.
 
 IIRC when working with xen I had to boot with live cd, import the
 pool, then poweroff (without exporting the pool). Then it can boot.
 Somewhat inline with what you described.
 
 The documented way is to
 reinstall the OS upon HW changes. Both are inconvenient to
 say the least.


 Any chance to touch /reconfigure, power off, then change the BIOS
 settings and reboot, like in the old days?   Or maybe with passing -r
 and optionally -s and -v from grub like the old way we used to
 reconfigure Solaris?


 Tried that, does not help. Adding forceloads to /etc/system
 and remaking the boot archive - also no.
 
 On Ubuntu + zfsonlinux + root/boot on zfs, the boot script helper is
 smart enough to try all available device nodes, so it wouldn't
 matter if the dev path/id/name changed. But ONLY if there's no
 zpool.cache in the initramfs.
 
 Not sure how easy it would be to port that functionality to solaris.
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing rpool device paths/drivers

2012-10-04 Thread Jens Elkner
On Thu, Oct 04, 2012 at 07:57:34PM -0500, Jerry Kemp wrote:
 I remember a similar video that was up on YouTube as done by some of the
 Sun guys employed in Germany.  They build a big array from USB drives,
 then exported the pool.  Once the system was down, they re-arranged all
 the drives in random order and ZFS was able to figure out how to put the
 raid all back together.  I need to go find that video.

http://constantin.glez.de/blog/2011/01/how-save-world-zfs-and-12-usb-sticks-4th-anniversary-video-re-release-edition
?

Have fun,
jel.
-- 
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 12768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing rpool device paths/drivers

2012-10-04 Thread Jerry Kemp
thanks for the link.

This was the youtube link that I had.

http://www.youtube.com/watch?v=1zw8V8g5eT0


Jerry




On 10/ 4/12 08:07 PM, Jens Elkner wrote:
 On Thu, Oct 04, 2012 at 07:57:34PM -0500, Jerry Kemp wrote:
 I remember a similar video that was up on YouTube as done by some of the
 Sun guys employed in Germany.  They build a big array from USB drives,
 then exported the pool.  Once the system was down, they re-arranged all
 the drives in random order and ZFS was able to figure out how to put the
 raid all back together.  I need to go find that video.
 
 http://constantin.glez.de/blog/2011/01/how-save-world-zfs-and-12-usb-sticks-4th-anniversary-video-re-release-edition
 ?
 
 Have fun,
 jel.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Neil Perrin

On 10/04/12 15:59, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) 
wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Neil Perrin

The ZIL code chains blocks together and these are allocated round robin
among slogs or
if they don't exist then the main pool devices.

So, if somebody is doing sync writes as fast as possible, would they gain more 
bandwidth by adding multiple slog devices?


In general - yes, but it really depends. Multiple synchronous writes of any size
across multiple file systems will fan out across the log devices. That is
because there is a separate independent log chain for each file system.

Also large synchronous writes (eg 1MB) within a specific file system will be 
spread out.
The ZIL code will try to allocate a block to hold all the records it needs to
commit up to the largest block size - which currently for you should be 128KB.
Anything larger will allocate a new block - on a different device if there are
multiple devices.

However, lots of small synchronous writes to the same file system might not
use more than one 128K block and benefit from multiple slog devices.

Neil.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss