date:20071120

Re: [zfs-discuss] ZFS snapshot GUI

2007-11-20 Thread Anton B. Rang

How does the ability to set a snapshot schedule for a particular *file* or 
*folder* interact with the fact that ZFS snapshots are on a per-filesystem 
basis?  This seems a poor fit.  If I choose to snapshot my "Important 
Documents" folder every 5 minutes, that's implicitly creating snapshots of my 
"Giant Video Downloads" folder every 5 minutes too, if they're both in the same 
file system.  It seems unwise not to expose this to the user.

One possibility would be for the "enable snapshots" menu item to implicitly 
apply to the root of the file system in which the selected item is.  So in the 
example shown, right-clicking on "Documents" would bring up a dialog labeled 
something like "Automatic snapshots for /home/cb114949".

==

I don't think it's a good idea to replace "Enable Automatic Snapshots" by 
"Restore from Snapshot" because there's no obvious way to "Disable Automatic 
Snapshots" (or change their properties). (It appears one could probably do that 
from the properties dialog, but that's certainly not obvious to a user who has 
turned this on using the menu and now wants to make a change -- if you can turn 
it on in the menu, you should be able to turn it off in the menu too.)

==

If "Roll back" affects the whole file system, it definitely should NOT be an 
option when right-clicking on a file or folder within the file system! This is 
a recipe for disaster. I would not present this as an option at all -- it's 
already in the "Restore Files" dialog.

Also, "All files will be restored" is not a good description for rollback.  
That really means "All changes since the selected snapshot will be lost."  I 
can readily imagine a user thinking, "I deleted three files, so if I choose to 
restore all files, I'll get those three back [without losing the other work 
I've done]."

==

Just a few random comments.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread can you guess?

...

 just rearrange your blocks sensibly -
> and to at least some degree you could do that while
> they're still cache-resident

Lots of discussion has passed under the bridge since that observation above, 
but it may have contained the core of a virtually free solution:  let your 
table become fragmented, but each time that a sequential scan is performed on 
it determine whether the region that you're currently scanning is 
*sufficiently* fragmented that you should retain the sequential blocks that 
you've just had to access anyway in cache until you've built up around 1 MB of 
them and then (in a background thread) flush the result contiguously back to a 
new location in a single bulk 'update' that changes only their location rather 
than their contents.

1.  You don't incur any extra reads, since you were reading sequentially anyway 
and already have the relevant blocks in cache.  Yes, if you had reorganized 
earlier in the background the current scan would have gone faster, but if scans 
occur sufficiently frequently for their performance to be a significant issue 
then the *previous* scan will probably not have left things *all* that 
fragmented.  This is why you choose a fragmentation threshold to trigger reorg 
rather than just do it whenever there's any fragmentation at all, since the 
latter would probably not be cost-effective in some circumstances; conversely, 
if you only perform sequential scans once in a blue moon, every one may be 
completely fragmented but it probably wouldn't have been worth defragmenting 
constantly in the background to avoid this, and the occasional reorg triggered 
by the rare scan won't constitute enough additional overhead to justify heroic 
efforts to avoid it.  Such a 'threshold' is a crude but possi
 bly adequate metric; a better but more complex one would perhaps nudge up the 
threshold value every time a sequential scan took place without an intervening 
update, such that rarely-updated but frequently-scanned files would eventually 
approach full contiguity, and an even finer-grained metric would maintain such 
information about each individual *region* in a file, but absent evidence that 
the single, crude, unchanging threshold (probably set to defragment moderately 
aggressively - e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 
1 MB region) is inadequate these sound a bit like over-kill.

2.  You don't defragment data that's never sequentially scanned, avoiding 
unnecessary system activity and snapshot space consumption.

3.  You still incur additional snapshot overhead for data that you do decide to 
defragment for each block that hadn't already been modified since the most 
recent snapshot, but performing the local reorg as a batch operation means that 
only a single copy of all affected ancestor blocks will wind up in the snapshot 
due to the reorg (rather than potentially multiple copies in multiple snapshots 
if snapshots were frequent and movement was performed one block at a time).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz DEGRADED state

2007-11-20 Thread Joe Little

On Nov 20, 2007 6:34 AM, MC <[EMAIL PROTECTED]> wrote:
> > So there is no current way to specify the creation of
> > a 3 disk raid-z
> > array with a known missing disk?
>
> Can someone answer that?  Or does the zpool command NOT accommodate the 
> creation of a degraded raidz array?
>

can't started degraded, but you can make it so..

If one can make a sparse file, then you'd be set. Just create the
file, make a zpool out of the two disks and the file, and then drop
the file from the pool _BEFORE_ copying over the data. I believe then
you can add the third disk as a replacement. The gotcha (and why the
sparse may be needed) is that it will only use per disk the size of
the smallest disk.

>
> This message posted from opensolaris.org
> ___
>
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover

2007-11-20 Thread asa

I am "rolling my own" replication using zfs send|recv through the  
cluster agent framework and a custom HA shared local storage set of  
scripts(similar to http://www.posix.brte.com.br/blog/?p=75  but  
without avs).  I am not using zfs off of shared storage in the  
supported way. So this is a bit of a lonely area. =)

As these are two different zfs volumes on different zpools of  
differing underlying vdev topology, it appears they are not sharing  
the same fsid and so are assumedly presenting different file handles  
from each other.

I have the cluster parts out of the way(mostly =)), I now need to  
solve the nfs side of things so that at the point of failing over.

I have isolated zfs out of the equation, I receive the same stale file  
handle errors if I try and share an arbitrary UFS folder to the client  
through the cluster interface.

Yeah I am a hack.

Asa

On Nov 20, 2007, at 7:27 PM, Richard Elling wrote:

> asa wrote:
>> Well then this is probably the wrong list to be hounding
>>
>> I am looking for something like 
>> http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/
>> Where when fileserver A dies, fileserver B can come up, grab the  
>> same  IP address via some mechanism(in this case I am using sun  
>> cluster) and  keep on trucking without the lovely stale file handle  
>> errors I am  encountering.
>>
>
> If you are getting stale file handles, then the Solaris cluster is  
> misconfigured.
> Please double check the NFS installation guide for Solaris Cluster and
> verify that the paths are correct.
> -- richard
>

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] which would be faster

2007-11-20 Thread Rob Logan


 > On the other hand, the pool of 3 disks is obviously
 > going to be much slower than the pool of 5

while today that's true, "someday" io will be
balanced by the latency of vdevs rather than
the number... plus two vdevs are always going
to be faster than one vdev, even if one is slower
than the other.

so do 4+1 and 2+1 in the same pool rather than
separate pools. this will let zfs balance
the load (always) between the two vdevs rather than
you trying the balance the load between pools.

Rob

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover

2007-11-20 Thread Richard Elling

asa wrote:
> Well then this is probably the wrong list to be hounding
>
> I am looking for something like 
> http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/
> Where when fileserver A dies, fileserver B can come up, grab the same  
> IP address via some mechanism(in this case I am using sun cluster) and  
> keep on trucking without the lovely stale file handle errors I am  
> encountering.
>   

If you are getting stale file handles, then the Solaris cluster is 
misconfigured.
Please double check the NFS installation guide for Solaris Cluster and
verify that the paths are correct.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover

2007-11-20 Thread asa

Well then this is probably the wrong list to be hounding

I am looking for something like 
http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/
Where when fileserver A dies, fileserver B can come up, grab the same  
IP address via some mechanism(in this case I am using sun cluster) and  
keep on trucking without the lovely stale file handle errors I am  
encountering.

My clients are Linux, Servers are sol 10u4.

it seems that it is impossible to change the fsid on solaris, can you  
point me towards the appropriate NFS client behavior option lingo if  
you have a minute?(just the terminology would be great, there are a  
ton of confusing options in the land of nfs:  Client recovery,  
failover, replicas etc)
I am unable to use block base replication(AVS) underneath the ZFS  
layer because I would like to run with different zpool schemes on each  
server( fast primary server, slower, larger failover server only to be  
used during downtime on the main server.)

Worst case scenario here seems to be that I would have to forcibly  
unmount and remount all my client mounts.

Ill start bugging the nfs-discuss people.

Thank you.

Asa

On Nov 12, 2007, at 1:21 PM, Darren J Moffat wrote:

> asa wrote:
>> I would like for all my NFS clients to hang during the failover,  
>> then  pick up trucking on this new filesystem, perhaps obviously  
>> failing  their writes back to the apps which are doing the  
>> writing.  Naive?
>
> The OpenSolaris NFS client does this already - has done since IIRC  
> around Solaris 2.6.  The knowledge is in the NFS client code.
>
> For NFSv4 this functionality is part of the standard.
>
> -- 
> Darren J Moffat

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Jaz

>>> the 3124 looks perfect. The only problem is the only thing I found on 
>>> ebay
>>> was for the 3132, which is PCIe, which doesn't help me. :) I'm not 
>>> finding
>>> anything for 3124 other than the data on silicon image's site.  Do you 
>>> know
>>> of any cards I should be looking for that uses this chip?
>>
>> http://www.cooldrives.com/sata-cards.html
>>
>> There are a couple on there for about $80.  Not quite where you want to 
>> get I am sure but it is an option.
>
> Yep - I see: http://www.cooldrives.com/saiiraco2esa.html for $60.

I got a Sil3114 (4 internal ports) off ebay for $AU30 including postage. 
Didnt look at any PCIe stuff since I'm building up from old parts.

> Regards,
>
> Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
>Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
> OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
> http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
> Graduate from "sugar-coating school"?  Sorry - I never attended! :)
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] which would be faster

2007-11-20 Thread Al Hopper

On Tue, 20 Nov 2007, Tim Cook wrote:

> So I have 8 drives total.
>
> 5x500GB seagate 7200.10
> 3x300GB seagate 7200.10
>
> I'm trying to decide, would I be better off just creating two separate pools?
>
> pool1 = 5x500gb raidz
> pool2= 3x300gb raidz

... reformatted ...

> or would I be better off creating one large pool, with two raid 
> sets?  I'm trying to figure out if it would be faster this way since 
> it should be striping across the two pools (from what I understand). 
> On the other hand, the pool of 3 disks is obviously going to be much 
> slower than the pool of 5.
>
> In a perfect world I'd just benchmark both ways, but due to some 
> constraints, that may not be possible.  Any insight?
>

Hi Tim,

Let me give you a 3rd option for your consideration.  In general, 
there is no "one-pool-fits-all-workloads" solution.  On a 10 disk 
system here, we ended up with a:

5 disk raidz1 pool
2 disk mirror pool
3 disk mirror pool

Each have their strengths/weaknesses.  The raidz set is ideal for 
large file sequential access type workloads - but the IOPS are 
limited to the IOPS of a single drive.  The 3-way mirror is 
ideal for a workload with a high read to write ratio - which describes 
many real-world type workloads (e.g. software development) - since ZFS 
will load balance read ops amoung all members of the mirror set.  So 
read IOPS is 3x the IOPS rating of a single disk.

I would suggest/recommend you configure a 5 disk raidz1 pool (with the 
500Gb disks) and a 2nd pool using a 3-way mirror.  You can then match 
pool/filesystems to the best fit with your different workloads.

Remember the incredibly useful blogs at: http://blogs.sun.com/relling/ 
(Thank you Richard) to determine the relative reliability/failure 
rates of different ZFS configs.

PS: If we had to do it over, I'd probably go with a 6-disk raidz2, 
in place of the 5-disk raidz1 - due to the much higher relibility of 
that config.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from "sugar-coating school"?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz2 testing

2007-11-20 Thread Richard Elling

Brian Lionberger wrote:
> Is there a preferred method to test a raidz2.
> I would like to see the the disks recover on there own after simulating 
> a disk failure.
> I'm have a 4 disk configuration.

It really depends on what failure mode you're interested in.  The
most common failure we see from disks in the field is an uncorrectable
read.  Pulling a disk will not simulate an uncorrectable read.

For such tests, there are really two different parts of the system
you are exercising: the fault detection and the recovery/reconfiguration.
When we do RAS benchmarking, we ofen find that the recovery/reconfiguration
code path is the interesting part and the fault detection less so.
In other words, there will be little difference in the recovery/
reconfiguration between initiating a zpool replace from the command line
vs fault injection.  Unless you are really interested in the maze of
fault detection code, you might want to stick with the command line
interfaces to stimulate a reconfiguration.

If you really do want to stimulate the fault detection code, then a
simple online test which requires no hands-on changes, is to change
the partition table to zero out the size of the partition or slice.
This will have the effect of causing an I/O to receive an ENXIO error
which should then kick off the recovery.

prtvtoc will show you a partition map which can be sent to fmthard -s
to populate the VTOC.  Be careful here, this is a place where mistakes
can be painful to overcome.

Dtrace can be used to perform all sorts of nasty fault injection,
but that may be more than you want to bite off at first.

b77 adds a zpool failmode property which will allow you to set the
mode to something other than panic -- options are: wait(default),
continue, and panic.  See zpool(1m) for more info.  You will want to
know the failmode if you are experimenting with fault injection.

Finally, you will want to be aware of the FMA commands for viewing
reports and diagnosis status.  See fmadm(1m), fmdump(1m), and fmstat(1m)
If you want to experiment with fault injection, you'll want to pay
particular attention to the SERD engines and reset them between runs.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] which would be faster

2007-11-20 Thread Tim Cook

So I have 8 drives total. 

5x500GB seagate 7200.10
3x300GB seagate 7200.10

I'm trying to decide, would I be better off just creating two separate pools?

pool1 = 5x500gb raidz
pool2= 3x300gb raidz

or would I be better off creating one large pool, with two raid sets?  I'm 
trying to figure out if it would be faster this way since it should be striping 
across the two pools (from what I understand).  On the other hand, the pool of 
3 disks is obviously going to be much slower than the pool of 5.

In a perfect world I'd just benchmark both ways, but due to some constraints, 
that may not be possible.  Any insight?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Richard Elling

Asif Iqbal wrote:
> On Nov 19, 2007 11:47 PM, Richard Elling <[EMAIL PROTECTED]> wrote:
>   
>> Asif Iqbal wrote:
>> 
>>> I have the following layout
>>>
>>> A 490 with 8 1.8Ghz and 16G mem. 6 6140s with 2 FC controllers using
>>> A1 anfd B1 controller port 4Gbps speed.
>>> Each controller has 2G NVRAM
>>>
>>> On 6140s I setup raid0 lun per SAS disks with 16K segment size.
>>>
>>> On 490 I created a zpool with 8 4+1 raidz1s
>>>
>>> I am getting zpool IO of only 125MB/s with zfs:zfs_nocacheflush = 1 in
>>> /etc/system
>>>
>>> Is there a way I can improve the performance. I like to get 1GB/sec IO.
>>>
>>>   
>> I don't believe a V490 is capable of driving 1 GByte/s of I/O.
>> 
>
> Well I am getting ~190MB/s right now. I sure not hitting any where close
> to that ceiling
>
>   
>> The V490 has two schizos and the schizo is not a full speed
>> bridge.  For more information see Section 1.2 of:
>> http://www.sun.com/processors/manuals/External_Schizo_PRM.pdf
>> 

[err - see Section 1.3]

You will notice from Table 1-1, the read bandwidth limit for a schizo 
PCI leaf is
204 MBytes/s.  With two schizos, you can expect to max out at 816 
MBytes/s or
less, depending on resource contention.  It makes no difference that a 4 
Gbps FC
card could read 400 MBytes/s, the best you can do for the card is 204 
MBytes/s.
1 GBytes/s of read throughput will not be attainable with a V490.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Al Hopper

On Tue, 20 Nov 2007, Jason P. Warr wrote:

>
>
>> the 3124 looks perfect. The only problem is the only thing I found on ebay
>> was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding
>> anything for 3124 other than the data on silicon image's site.  Do you know
>> of any cards I should be looking for that uses this chip?
>
> http://www.cooldrives.com/sata-cards.html
>
> There are a couple on there for about $80.  Not quite where you want to get I 
> am sure but it is an option.

Yep - I see: http://www.cooldrives.com/saiiraco2esa.html for $60.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from "sugar-coating school"?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Jason P. Warr



>the 3124 looks perfect. The only problem is the only thing I found on ebay
>was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding
>anything for 3124 other than the data on silicon image's site.  Do you know
>of any cards I should be looking for that uses this chip?

http://www.cooldrives.com/sata-cards.html

There are a couple on there for about $80.  Not quite where you want to get I 
am sure but it is an option.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Brian Hechinger

On Tue, Nov 20, 2007 at 02:01:34PM -0600, Al Hopper wrote:
> 
> a) the SuperMicro AOC-SAT2-MV8 is an 8-port SATA card available for 
> around $110 IIRC.

Yeah, I'd like to spend a lot less than that, especially as I only need
2 ports. :)

> b) There is also a PCI-X version of the older LSI 4-port (internal) 
> PCI Express SAS3041E card which is still available for around $165 and 
> works well with ZFS (SATA or SAS drives).

I actually just picked up a SAS3080X for my Ultra80 on ebay for $30. I
guess I can always scour ebay for something similar.

> c) Any card based on the SiliconImage 3124/3132 chips will work.  But, 
> ensure you're running an OS with the latest version of the si3124 
> drivers - or - you can swap out the older drivers using the files 
> from:

the 3124 looks perfect. The only problem is the only thing I found on ebay
was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding
anything for 3124 other than the data on silicon image's site.  Do you know
of any cards I should be looking for that uses this chip?

These will be OS disks, and I'm willing to run whichever version is best
for this hardware and ZFS (I'm going to try the most recent SXCE once
I have all the hardware together).  Any recommendations as related to
this card?

-brian
-- 
"Perl can be fast and elegant as much as J2EE can be fast and elegant.
In the hands of a skilled artisan, it can and does happen; it's just
that most of the shit out there is built by people who'd be better
suited to making sure that my burger is cooked thoroughly."  -- Jonathan 
Patschke
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Asif Iqbal

On Nov 20, 2007 10:40 AM, Andrew Wilson <[EMAIL PROTECTED]> wrote:
>
>  What kind of workload are you running. If you are you doing these
> measurements with some sort of "write as fast as possible" microbenchmark,

Oracle database with blocksize 16K .. populating the database as fast I can

> once the 4 GB of nvram is full, you will be limited by backend performance
> (FC disks and their interconnect) rather than the host / controller bus.
>
>  Since, best case, 4 gbit FC can transfer 4 GBytes of data in about 10
> seconds, you will fill it up, even with the backend writing out data as fast
> as it can, in about 20 seconds. Once the nvram is full, you will only see
> the backend (e.g. 2 Gbit) rate.
>
>  The reason these controller buffers are useful with real applications is
> that they smooth the bursts of writes that real applications tend to
> generate, thus reducing the latency of those writes and improving
> performance. They will then "catch up" during periods when few writes are
> being issued. But a typical microbenchmark that pumps out a steady stream of
> writes won't see this benefit.
>
>  Drew Wilson
>
>
>
>  Asif Iqbal wrote:
>  On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote:
>
>
>  On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote:
>
>
>  On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote:
>
>
>  On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote:
>
>
>  (Including storage-discuss)
>
> I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
> ST337FC (300GB - 1 RPM FC-AL)
>
>  Those disks are 2Gb disks, so the tray will operate at 2Gb.
>
>
>  That is still 256MB/s . I am getting about 194MB/s
>
>  2Gb fibre channel is going to max out at a data transmission rate
>
>  But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
> 300GB FC 10K rpm (2Gb/s) disks
>
> So I should get "a lot" more than ~ 200MB/s. Shouldn't I?
>
>
>
>
>  around 200MB/s rather than the 256MB/s that you'd expect. Fibre
> channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
> in 10 bits on the wire. So while 256MB/s is being transmitted on the
> connection itself, only 200MB/s of that is the data that you're
> transmitting.
>
> Chad Mynhier
>
>
>
>
>
>
>



-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Al Hopper

On Mon, 19 Nov 2007, Brian Hechinger wrote:

> On Sun, Nov 18, 2007 at 02:18:21PM +0100, Peter Schuller wrote:
>>> Right now I have noticed that LSI has recently began offering some
>>> lower-budget stuff; specifically I am looking at the MegaRAID SAS
>>> 8208ELP/XLP, which are very reasonably priced.
>
> I looked up the 8204XLP, which is really quite expensive compared to
> the Supermicro MV based card.
>
> That being said, for a small 1U box that is only going to have two SATA
> disks, the Supermicro card is way overkill/overpriced for my needs.
>
> Does anyone know if there are any PCI-X cards based on the MV88SX6041?
>
> I'm not having much luck finding any.

A couple of options:

a) the SuperMicro AOC-SAT2-MV8 is an 8-port SATA card available for 
around $110 IIRC.

b) There is also a PCI-X version of the older LSI 4-port (internal) 
PCI Express SAS3041E card which is still available for around $165 and 
works well with ZFS (SATA or SAS drives).

c) Any card based on the SiliconImage 3124/3132 chips will work.  But, 
ensure you're running an OS with the latest version of the si3124 
drivers - or - you can swap out the older drivers using the files 
from:

http://www.opensolaris.org/jive/servlet/JiveServlet/download/80-32437-138083-3390/si3124.tar.gz

Note: if these drives are your boot drives, you'll need to do this 
after booting from a CDROM/DVD disk, otherwise you can unload the 
driver and swap out the files.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from "sugar-coating school"?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv-76 panics on installation

2007-11-20 Thread Michael Schuster

Bill Moloney wrote:
> I have an Intel based server running dual P3 Xeons (Intel A46044-609,
> 1.26GHz) with a BIOS from American Megatrends Inc (AMIBIOS, SCB2
> production BIOS rev 2.0, BIOS build 0039) with 2GB of RAM
> 
> when I attempt to install snv-76 the system panics during the initial
> boot from CD

please post the panic stack (to the list, not to me alone), if possible, 
and as much other information as you have (ie. what step does the panic 
happen at, etc.)

where did you get the media from (is it really a CD, or a DVD?)?
Can you read/mount the CD when running an older build? if no, are there 
errors in the messages file? ...

HTH
Michael
-- 
Michael Schuster
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Al Hopper

On Tue, 20 Nov 2007, Ross wrote:

>>> doing these writes now sounds like a
>>> lot of work.  I'm guessing that needing two full-path
>>> updates to achieve this means you're talking about a
>>> much greater write penalty.
>>
>> Not all that much.  Each full-path update is still
>> only a single write request to the disk, since all
>> the path blocks (again, possibly excepting the
>> superblock) are batch-written together, thus mostly
>> increasing only streaming bandwidth consumption.
>
... reformatted ...

> Ok, that took some thinking about.  I'm pretty new to ZFS, so I've 
> only just gotten my head around how CoW works, and I'm not used to 
> thinking about files at this kind of level.  I'd not considered that

Here's a couple of resources that'll help you get up to speed with ZFS 
internals:

a) From the London OpenSolaris User Group (LOSUG) session, presented 
by Jarod Nash, TSC Systems Engineer entitled: "ZFS: Under The Hood":

ZFS-UTH_3_v1.1_LOSUG.pdf
zfs_data_structures_for_single_file.pdf

also referred to as "ZFS Internals Lite".

and b) the ZFS on-disk Specification:

ondiskformat0822.pdf

> path blocks would be batch-written close together, but of course 
> that makes sense.
>
> What I'd been thinking was that ordinarily files would get 
> fragmented as they age, which would make these updates slower as 
> blocks would be scattered over the disk, so a full-path update would 
> take some time.  I'd forgotten that the whole point of doing this is 
> to prevent fragmentation...
>
> So a nice side effect of this approach is that if you use it, it 
> makes itself more efficient :D
>

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from "sugar-coating school"?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] snv-76 panics on installation

2007-11-20 Thread Bill Moloney

I have an Intel based server running dual P3 Xeons (Intel A46044-609, 1.26GHz) 
with a BIOS from American Megatrends Inc (AMIBIOS, SCB2 production BIOS rev 
2.0, BIOS build 0039) with 2GB of RAM

when I attempt to install snv-76 the system panics during the initial boot from 
CD

I've been using this system for extensive testing with ZFS and have had no 
problems installing snv-68, 69 or 70, but I'm having this problem with snv-76

any information regarding this problem or a potential workaround would be 
appreciated

Thx ... bill moloney
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz2

2007-11-20 Thread Richard Elling

comment on retries below...

Paul Boven wrote:
> Hi Eric, everyone,
> 
> Eric Schrock wrote:
>> There have been many improvements in proactively detecting failure,
>> culminating in build 77 of Nevada.  Earlier builds:
>>
>> - Were unable to distinguish device removal from devices misbehaving,
>>   depending on the driver and hardware.
>>
>> - Did not diagnose a series of I/O failures as disk failure.
>>
>> - Allowed several (painful) SCSI retries and continued to queue up I/O,
>>   even if the disk was fatally damaged.
> 
>> Most classes of hardware would behave reasonably well on device removal,
>> but certain classes caused cascading failures in ZFS, all which should
>> be resolved in build 77 or later.
> 
> I seem to be having exactly the problems you are describing (see my
> postings with the subject 'zfs on a raid box'). So I would very much
> like to give b77 a try. I'm currently running b76, as that's the latest
> sxce that's available. Are the sources to anything beyond b76 already
> available? Would I need to build it, or bfu?
> 
> I'm seeing zfs not making use of available hot-spares when I pull a
> disk, long and indeed painful SCSI retries and very poor write
> performance on a degraded zpool - I hope to be able to test if b77 fares
> any better with this.

The SCSI retries are implemented at the driver level (usually sd) below
ZFS.  By default, the timeout (60s) and retry (3 or 5) counters are
somewhat conservative and intended to apply to a wide variety of hardware,
including slow CD-ROMs and ancient processors.  Depending on your
situation and business requirements, these may be tuned.  There is a
pretty good article on BigAdmin which describes tuning the FC side of
the equation (ssd driver).
http://www.sun.com/bigadmin/features/hub_articles/tuning_sfs.jsp

Beware, making these tunables too small can lead to an unstable system.
The article does a good job of explaining how interdependent the tunables
are, so hopefully you can make wise choices.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz2

2007-11-20 Thread Eric Schrock

On Tue, Nov 20, 2007 at 11:02:55AM +0100, Paul Boven wrote:
> 
> I seem to be having exactly the problems you are describing (see my
> postings with the subject 'zfs on a raid box'). So I would very much
> like to give b77 a try. I'm currently running b76, as that's the latest
> sxce that's available. Are the sources to anything beyond b76 already
> available? Would I need to build it, or bfu?

The sources, yes (you can pull them from the ON mercurial mirror).  It
looks like the latest SX:CE is still on build 76, so it doesn't seem
like you can get a binary distro, yet.

> 
> I'm seeing zfs not making use of available hot-spares when I pull a
> disk, long and indeed painful SCSI retries and very poor write
> performance on a degraded zpool - I hope to be able to test if b77 fares
> any better with this.

What hardware/driver are you using?  Build 76 should have the ability to
recognize removed devices via DKIOCGETSTATE and immediately transition
to the REMOVED state instead of going through the SCSI retry logic (3x
60 seconds).  Build 77 added a 'probe' operation on I/O failure that
will try to read/write some basic data to the disk and if that fails
will immediately determine the disk as FAULTED without having to wait
for retries to fail and FMA diagnosis to offline the device.

- Eric

--
Eric Schrock, FishWorkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] raidz2 testing

2007-11-20 Thread Brian Lionberger

Is there a preferred method to test a raidz2.
I would like to see the the disks recover on there own after simulating 
a disk failure.
I'm have a 4 disk configuration.

Brian.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Will Murnane

On Nov 20, 2007 5:33 PM, can you guess? <[EMAIL PROTECTED]> wrote:
> > But the whole point of snapshots is that they don't
> > take up extra space on the disk.  If a file (and
> > hence a block) is in every snapshot it doesn't mean
> > you've got multiple copies of it.  You only have one
> > copy of that block, it's just referenced by many
> > snapshots.
>
> I used the wording "copies of a parent" loosely to mean "previous
> states of the parent that also contain pointers to the current state of
> the child about to be updated in place".
But children are never updated in place.  When a new block is written
to a leaf, new blocks are used for all the ancestors back to the
superblock, and then the old ones are either freed or held on to by
the snapshot.

> And in every earlier version of the parent that was updated for some
> *other* reason and still contains a pointer to the current child that
> someone using that snapshot must be able to follow correctly.
The snapshot doesn't get the 'current' child - it gets the one that
was there when the snapshot was taken.

> No:  every version of the parent that points to the current version of
> the child must be updated.
Even with clones, the two 'parent' and the 'clone' are allowed to
diverge - they contain different data.

Perhaps I'm missing something.  Excluding ditto blocks, when in ZFS
would two parents point to the same child, and need to both be updated
when the child is updated?

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Albert Chin

On Tue, Nov 20, 2007 at 11:10:20AM -0600, [EMAIL PROTECTED] wrote:
> 
> [EMAIL PROTECTED] wrote on 11/20/2007 10:11:50 AM:
> 
> > On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote:
> > > Resilver and scrub are broken and restart when a snapshot is created
> > > -- the current workaround is to disable snaps while resilvering,
> > > the ZFS team is working on the issue for a long term fix.
> >
> > But, no snapshot was taken. If so, zpool history would have shown
> > this. So, in short, _no_ ZFS operations are going on during the
> > resilvering. Yet, it is restarting.
> >
> 
> Does 2007-11-20.02:37:13 actually match the expected timestamp of
> the original zpool replace command before the first zpool status
> output listed below?

No. We ran some 'zpool status' commands after the last 'zpool
replace'. The 'zpool status' output in the initial email is from this
morning. The only ZFS command we've been running is 'zfs list', 'zpool
list tww', 'zpool status', or 'zpool status -v' after the last 'zpool
replace'.

Server is on GMT time.

> Is it possible that another zpool replace is further up on your
> pool history (ie it was rerun by an admin or automatically from some
> service)?

Yes, but a zpool replace for the same bad disk:
  2007-11-20.00:57:40 zpool replace tww c0t600A0B8000299966059E4668CBD3d0
  c0t600A0B800029996606584741C7C3d0
  2007-11-20.02:35:22 zpool detach tww c0t600A0B800029996606584741C7C3d0
  2007-11-20.02:37:13 zpool replace tww c0t600A0B8000299966059E4668CBD3d0
  c0t600A0B8000299CCC06734741CD4Ed0

We accidentally removed c0t600A0B800029996606584741C7C3d0 from the
array, hence the 'zpool detach'.

The last 'zpool replace' has been running for 15h now.

> -Wade
> 
> 
> > >
> > > [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM:
> > >
> > > > On b66:
> > > >   # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
> > > >   c0t600A0B8000299CCC06734741CD4Ed0
> > > >   < some hours later>
> > > >   # zpool status tww
> > > > pool: tww
> > > >state: DEGRADED
> > > >   status: One or more devices is currently being resilvered.  The
> pool
> > > will
> > > >   continue to function, possibly in a degraded state.
> > > >   action: Wait for the resilver to complete.
> > > >scrub: resilver in progress, 62.90% done, 4h26m to go
> > > >   < some hours later>
> > > >   # zpool status tww
> > > > pool: tww
> > > >state: DEGRADED
> > > >   status: One or more devices is currently being resilvered.  The
> pool
> > > will
> > > >   continue to function, possibly in a degraded state.
> > > >   action: Wait for the resilver to complete.
> > > >scrub: resilver in progress, 3.85% done, 18h49m to go
> > > >
> > > >   # zpool history tww | tail -1
> > > >   2007-11-20.02:37:13 zpool replace tww
> > > c0t600A0B8000299966059E4668CBD3d0
> > > >   c0t600A0B8000299CCC06734741CD4Ed0
> > > >
> > > > So, why did resilvering restart when no zfs operations occurred? I
> > > > just ran zpool status again and now I get:
> > > >   # zpool status tww
> > > > pool: tww
> > > >state: DEGRADED
> > > >   status: One or more devices is currently being resilvered.  The
> pool
> > > will
> > > >   continue to function, possibly in a degraded state.
> > > >   action: Wait for the resilver to complete.
> > > >scrub: resilver in progress, 0.00% done, 134h45m to go
> > > >
> > > > What's going on?
> > > >
> > > > --
> > > > albert chin ([EMAIL PROTECTED])
> > > > ___
> > > > zfs-discuss mailing list
> > > > zfs-discuss@opensolaris.org
> > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > >
> > > ___
> > > zfs-discuss mailing list
> > > zfs-discuss@opensolaris.org
> > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > >
> > >
> >
> > --
> > albert chin ([EMAIL PROTECTED])
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> 

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread can you guess?

> But the whole point of snapshots is that they don't
> take up extra space on the disk.  If a file (and
> hence a block) is in every snapshot it doesn't mean
> you've got multiple copies of it.  You only have one
> copy of that block, it's just referenced by many
> snapshots.

I used the wording "copies of a parent" loosely to mean "previous states of the 
parent that also contain pointers to the current state of the child about to be 
updated in place".

> 
> The thing is, the location of that block isn't saved
> separately in every snapshot either - the location is
> just stored in it's parent.

And in every earlier version of the parent that was updated for some *other* 
reason and still contains a pointer to the current child that someone using 
that snapshot must be able to follow correctly.

  So moving a block is
> just a case of updating one parent.

No:  every version of the parent that points to the current version of the 
child must be updated.

...

> If you think about it, that has to work for the old
> data since as I said before, ZFS already has this
> functionality.  If ZFS detects a bad block, it moves
> it to a new location on disk.  If it can already do
> that without affecting any of the existing snapshots,
> so there's no reason to think we couldn't use the
> same code for a different purpose.

Only if it works the way you think it works, rather than, say, by using a 
look-aside list of moved blocks (there shouldn't be that many of them), or by 
just leaving the bad block in the snapshot (if it's mirrored or 
parity-protected, it'll still be usable there unless a second failure occurs; 
if not, then it was lost anyway).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Wade . Stuart


[EMAIL PROTECTED] wrote on 11/20/2007 10:11:50 AM:

> On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote:
> > Resilver and scrub are broken and restart when a snapshot is created
> > -- the current workaround is to disable snaps while resilvering,
> > the ZFS team is working on the issue for a long term fix.
>
> But, no snapshot was taken. If so, zpool history would have shown
> this. So, in short, _no_ ZFS operations are going on during the
> resilvering. Yet, it is restarting.
>

Does 2007-11-20.02:37:13 actually match the expected timestamp of the
original zpool replace command before the first zpool status output listed
below?  Is it possible that another zpool replace is further up on your
pool history (ie it was rerun by an admin or automatically from some
service)?

-Wade


> >
> > [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM:
> >
> > > On b66:
> > >   # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
> > >   c0t600A0B8000299CCC06734741CD4Ed0
> > >   < some hours later>
> > >   # zpool status tww
> > > pool: tww
> > >state: DEGRADED
> > >   status: One or more devices is currently being resilvered.  The
pool
> > will
> > >   continue to function, possibly in a degraded state.
> > >   action: Wait for the resilver to complete.
> > >scrub: resilver in progress, 62.90% done, 4h26m to go
> > >   < some hours later>
> > >   # zpool status tww
> > > pool: tww
> > >state: DEGRADED
> > >   status: One or more devices is currently being resilvered.  The
pool
> > will
> > >   continue to function, possibly in a degraded state.
> > >   action: Wait for the resilver to complete.
> > >scrub: resilver in progress, 3.85% done, 18h49m to go
> > >
> > >   # zpool history tww | tail -1
> > >   2007-11-20.02:37:13 zpool replace tww
> > c0t600A0B8000299966059E4668CBD3d0
> > >   c0t600A0B8000299CCC06734741CD4Ed0
> > >
> > > So, why did resilvering restart when no zfs operations occurred? I
> > > just ran zpool status again and now I get:
> > >   # zpool status tww
> > > pool: tww
> > >state: DEGRADED
> > >   status: One or more devices is currently being resilvered.  The
pool
> > will
> > >   continue to function, possibly in a degraded state.
> > >   action: Wait for the resilver to complete.
> > >scrub: resilver in progress, 0.00% done, 134h45m to go
> > >
> > > What's going on?
> > >
> > > --
> > > albert chin ([EMAIL PROTECTED])
> > > ___
> > > zfs-discuss mailing list
> > > zfs-discuss@opensolaris.org
> > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
> >
>
> --
> albert chin ([EMAIL PROTECTED])
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Unsubscribe

2007-11-20 Thread Hay, Mausul W



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Calum Benson
Sent: Tuesday, November 20, 2007 11:12 AM
To: Darren J Moffat
Cc: Henry Zhang; zfs-discuss@opensolaris.org; Desktop discuss; Christian
Kelly
Subject: Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI


On 20 Nov 2007, at 15:04, Darren J Moffat wrote:

> Calum Benson wrote:
>> On 20 Nov 2007, at 13:35, Christian Kelly wrote:
>>> Take the example I gave before, where you have a pool called,  
>>> say,  pool1. In the pool you have two ZFSes: pool1/export and  
>>> pool1/ export/home. So, suppose the user chooses /export in  
>>> nautilus and  adds this to the backup list. Will the user be  
>>> aware, from browsing  through nautilus, that /export/home may or  
>>> may not be backed up -  depending on whether the -r (?) option is  
>>> used.
>> I'd consider that to be a fairly strong requirement, but it's not   
>> something I particularly thought through for the mockups.
>> One solution might be to change the nautilus background for  
>> folders  that are being backed up, another might be an indicator  
>> in the status  bar, another might be emblems on the folder icons  
>> themselves.
>
> I think changing the background is a non starter since users can  
> change the background already anyway.

You're right that they "can", and while that probably does write it  
off, I wonder how many really do.  (And we could possibly do  
something clever like a semi-opaque overlay anyway, we may not have  
to replace the background entirely.)

All just brainstorming at this stage though, other ideas welcome :)

> An emblem is good for the case where you are looking from "above" a  
> dataset that is tagged for backup.
>
> An indicator in the status bar is good for when you are "in" a  
> dataset that is tagged for backup.

Yep, all true.  Also need to bear in mind that nowadays, with the  
(fairly) new nautilus treeview, you can potentially see both "in" and  
"above" at the same time, so any solution would have to work  
elegantly with that view too.

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Ross

But the whole point of snapshots is that they don't take up extra space on the 
disk.  If a file (and hence a block) is in every snapshot it doesn't mean 
you've got multiple copies of it.  You only have one copy of that block, it's 
just referenced by many snapshots.

The thing is, the location of that block isn't saved separately in every 
snapshot either - the location is just stored in it's parent.  So moving a 
block is just a case of updating one parent.  So regardless of how many 
snapshots the parent is in, you only have to update one parent to point it at 
the new location for the *old* data.  Then you save the new data to the old 
location and ensure the current tree points to that.

If you think about it, that has to work for the old data since as I said 
before, ZFS already has this functionality.  If ZFS detects a bad block, it 
moves it to a new location on disk.  If it can already do that without 
affecting any of the existing snapshots, so there's no reason to think we 
couldn't use the same code for a different purpose.

Ultimately, your old snapshots get fragmented, but the live data stays 
contiguous.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Chris Csanady

On Nov 19, 2007 10:08 PM, Richard Elling <[EMAIL PROTECTED]> wrote:
> James Cone wrote:
> > Hello All,
> >
> > Here's a possibly-silly proposal from a non-expert.
> >
> > Summarising the problem:
> >- there's a conflict between small ZFS record size, for good random
> > update performance, and large ZFS record size for good sequential read
> > performance
> >
>
> Poor sequential read performance has not been quantified.

I think this is a good point.  A lot of solutions are being thrown
around, and the problems are only theoretical at the moment.
Conventional solutions may not even be appropriate for something like
ZFS.

The point that makes me skeptical is this: blocks do not need to be
logically contiguous to be (nearly) physically contiguous.  As long as
you reallocate the blocks close to the originals, chances are that a
scan of the file will end up being mostly physically contiguous reads
anyway.  ZFS's intelligent prefetching along with the disk's track
cache should allow for good performance even in this case.

ZFS may or may not already do this, I haven't checked.  Obviously, you
won't want to keep a years worth of snapshots, or run the pool near
capacity.  With a few minor tweaks though, it should work quite well.
Talking about fundamental ZFS design flaws at this point seems
unnecessary to say the least.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Darren J Moffat

Calum Benson wrote:
> You're right that they "can", and while that probably does write it off, 
> I wonder how many really do.  (And we could possibly do something clever 
> like a semi-opaque overlay anyway, we may not have to replace the 
> background entirely.)

Almost everyone I've seen using the filemanager other than myself has 
done this :-)

If you do a semi-opaque overlay thats going to require lots of colour 
selection stuff - plus what if the background is a complex image (why 
people do this I don't know but I've seen it done).

>> An emblem is good for the case where you are looking from "above" a 
>> dataset that is tagged for backup.
>>
>> An indicator in the status bar is good for when you are "in" a dataset 
>> that is tagged for backup.
> 
> Yep, all true.  Also need to bear in mind that nowadays, with the 
> (fairly) new nautilus treeview, you can potentially see both "in" and 
> "above" at the same time, so any solution would have to work elegantly 
> with that view too.

I would expect emblem in the tree and status bar indicator for the non 
tree part.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson

On 20 Nov 2007, at 14:31, Christian Kelly wrote:
>
> Ah, I see. So, for phase 0, the 'Enable Automatic Snapshots' option
> would only be available for/work for existing ZFSes. Then at some  
> later
> stage, create them on the fly.

Yes, that's the scenario for the mockups I posted, anyway... if the  
requirements are bogus, then of course we'll have to change them :)

My original mockup did allow you to create a pool/filesystem on the  
fly if required, but it felt like the wrong place to be doing that--  
if you could understand the dialog to do that, you would probably  
know how to do it better on the command line anyway.  Longer term, I  
guess we might be wanting to ship some sort of ZFS management GUI  
that might be better suited to that sort of thing (maybe like the  
Nexenta app that Roman mentioned earlier, but I haven't really looked  
at that yet...)

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson


On 20 Nov 2007, at 15:04, Darren J Moffat wrote:

> Calum Benson wrote:
>> On 20 Nov 2007, at 13:35, Christian Kelly wrote:
>>> Take the example I gave before, where you have a pool called,  
>>> say,  pool1. In the pool you have two ZFSes: pool1/export and  
>>> pool1/ export/home. So, suppose the user chooses /export in  
>>> nautilus and  adds this to the backup list. Will the user be  
>>> aware, from browsing  through nautilus, that /export/home may or  
>>> may not be backed up -  depending on whether the -r (?) option is  
>>> used.
>> I'd consider that to be a fairly strong requirement, but it's not   
>> something I particularly thought through for the mockups.
>> One solution might be to change the nautilus background for  
>> folders  that are being backed up, another might be an indicator  
>> in the status  bar, another might be emblems on the folder icons  
>> themselves.
>
> I think changing the background is a non starter since users can  
> change the background already anyway.

You're right that they "can", and while that probably does write it  
off, I wonder how many really do.  (And we could possibly do  
something clever like a semi-opaque overlay anyway, we may not have  
to replace the background entirely.)

All just brainstorming at this stage though, other ideas welcome :)

> An emblem is good for the case where you are looking from "above" a  
> dataset that is tagged for backup.
>
> An indicator in the status bar is good for when you are "in" a  
> dataset that is tagged for backup.

Yep, all true.  Also need to bear in mind that nowadays, with the  
(fairly) new nautilus treeview, you can potentially see both "in" and  
"above" at the same time, so any solution would have to work  
elegantly with that view too.

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Albert Chin

On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote:
> Resilver and scrub are broken and restart when a snapshot is created
> -- the current workaround is to disable snaps while resilvering,
> the ZFS team is working on the issue for a long term fix.

But, no snapshot was taken. If so, zpool history would have shown
this. So, in short, _no_ ZFS operations are going on during the
resilvering. Yet, it is restarting.

> -Wade
> 
> [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM:
> 
> > On b66:
> >   # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
> >   c0t600A0B8000299CCC06734741CD4Ed0
> >   < some hours later>
> >   # zpool status tww
> > pool: tww
> >state: DEGRADED
> >   status: One or more devices is currently being resilvered.  The pool
> will
> >   continue to function, possibly in a degraded state.
> >   action: Wait for the resilver to complete.
> >scrub: resilver in progress, 62.90% done, 4h26m to go
> >   < some hours later>
> >   # zpool status tww
> > pool: tww
> >state: DEGRADED
> >   status: One or more devices is currently being resilvered.  The pool
> will
> >   continue to function, possibly in a degraded state.
> >   action: Wait for the resilver to complete.
> >scrub: resilver in progress, 3.85% done, 18h49m to go
> >
> >   # zpool history tww | tail -1
> >   2007-11-20.02:37:13 zpool replace tww
> c0t600A0B8000299966059E4668CBD3d0
> >   c0t600A0B8000299CCC06734741CD4Ed0
> >
> > So, why did resilvering restart when no zfs operations occurred? I
> > just ran zpool status again and now I get:
> >   # zpool status tww
> > pool: tww
> >state: DEGRADED
> >   status: One or more devices is currently being resilvered.  The pool
> will
> >   continue to function, possibly in a degraded state.
> >   action: Wait for the resilver to complete.
> >scrub: resilver in progress, 0.00% done, 134h45m to go
> >
> > What's going on?
> >
> > --
> > albert chin ([EMAIL PROTECTED])
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> 

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Andrew Wilson

And, just to add one more point, since pretty much everything the host
writes to the controller eventually has to make it out to the disk
drives, the long term average write rate cannot exceed the rate that
the backend disk subsystem can absorb the writes, regardless of the
workload. (An exception is if the controller can combine some
overlapping writes). Basically just like putting water into a reservoir
at twice the rate it is being withdrawn, the reservoir will eventually
overflow! At least in this case the controller can limit the input from
the host and avoid an actual data overflow situation.

Drew

Andrew Wilson wrote:

  What kind of workload are you running. If you are you doing these measurements 
with some sort of "write as fast as possible" microbenchmark, once the 4 GB of 
nvram is full, you will be limited by backend performance (FC disks and their 
interconnect) rather than the host / controller bus.

Since, best case, 4 gbit FC can transfer 4 GBytes of data in about 10 seconds, 
you will fill it up, even with the backend writing out data as fast as it can, 
in about 20 seconds. Once the nvram is full, you will only see the backend (e.g. 
2 Gbit) rate.

The reason these controller buffers are useful with real applications is that 
they smooth the bursts of writes that real applications tend to generate, thus 
reducing the latency of those writes and improving performance. They will then 
"catch up" during periods when few writes are being issued. But a typical 
microbenchmark that pumps out a steady stream of writes won't see this benefit.

Drew Wilson

Asif Iqbal wrote:

>On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote:
>  
>
>>On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote:
>>
>>
>>>On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote:
>>>  
>>>
On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote:

>(Including storage-discuss)
>
>I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
>ST337FC (300GB - 1 RPM FC-AL)
>  
>
Those disks are 2Gb disks, so the tray will operate at 2Gb.

>>>That is still 256MB/s . I am getting about 194MB/s
>>>  
>>>
>>2Gb fibre channel is going to max out at a data transmission rate
>>
>>
>
>But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
>300GB FC 10K rpm (2Gb/s) disks
>
>So I should get "a lot" more than ~ 200MB/s. Shouldn't I?
>
>
>  
>
>>around 200MB/s rather than the 256MB/s that you'd expect.  Fibre
>>channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
>>in 10 bits on the wire.  So while 256MB/s is being transmitted on the
>>connection itself, only 200MB/s of that is the data that you're
>>transmitting.
>>
>>Chad Mynhier
>>
>>
>>
>
>
>
>  
>

___
perf-discuss mailing list
[EMAIL PROTECTED]

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] NFS performance considerations (Linux vs Solaris)

2007-11-20 Thread msl

Hello all...
 I think all of you agree that "performance" is a great topic in NFS. 
 So, when we talk about NFS and ZFS we imagine a great combination/solution. 
But one is not dependent on another, actually are two well distinct 
technologies. ZFS has a lot of features that all we know about, and "maybe", 
all of us want in a NFS share (maybe not). The point is: Two technologies with 
diferent priorities.
 So, what i think is important, is a "document" (here on NFS/ZFS discuss), that 
lists and explains the ZFS features that have a "real" performance impact. I 
know that there is the solarisinternals wiki about ZFS/NFS integration, but 
what i think is really important is a comparison between Linux and Solaris/ZFS 
on server side.
 That would be very useful to see for example, what "consistency" i have with 
Linux and (XFS, ext3, etc), with "that" performance. And "how" can i configure 
a similar NFS service on solaris/ZFS. 
 Here we have some information about it: 
http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine
 but there is no relation with Linux, what i think is important.
 What i do mean, is that the people that knows a lot about the NFS protocol, 
and about the filesystem features, should make such comparison (to facilitate 
the adoption and users' comparison). I think there are many users comparing 
oranges with apples.
 Another example (correct me if i am wrong), Until the kernel 2.4.20 (at 
least), the default export option for sync/async was "async" (in solaris i 
think always was "sync"). Another point was about the "commit" operation in 
vers2, that was not implemented, the server just reply with an "OK", but the 
data was not in stable storage yet (here the ZIL and the roch blog entry is 
excellent).
 That's it, i'm proposing the creation of a "matrix/table" with features and 
performance impact, as well as a comparison with other 
implementations/implications.
 Thanks very much for your time, and sorry for the long post.

 Leal.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Wade . Stuart

Resilver and scrub are broken and restart when a snapshot is created -- the
current workaround is to disable snaps while resilvering,  the ZFS team is
working on the issue for a long term fix.

-Wade

[EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM:

> On b66:
>   # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
>   c0t600A0B8000299CCC06734741CD4Ed0
>   < some hours later>
>   # zpool status tww
> pool: tww
>state: DEGRADED
>   status: One or more devices is currently being resilvered.  The pool
will
>   continue to function, possibly in a degraded state.
>   action: Wait for the resilver to complete.
>scrub: resilver in progress, 62.90% done, 4h26m to go
>   < some hours later>
>   # zpool status tww
> pool: tww
>state: DEGRADED
>   status: One or more devices is currently being resilvered.  The pool
will
>   continue to function, possibly in a degraded state.
>   action: Wait for the resilver to complete.
>scrub: resilver in progress, 3.85% done, 18h49m to go
>
>   # zpool history tww | tail -1
>   2007-11-20.02:37:13 zpool replace tww
c0t600A0B8000299966059E4668CBD3d0
>   c0t600A0B8000299CCC06734741CD4Ed0
>
> So, why did resilvering restart when no zfs operations occurred? I
> just ran zpool status again and now I get:
>   # zpool status tww
> pool: tww
>state: DEGRADED
>   status: One or more devices is currently being resilvered.  The pool
will
>   continue to function, possibly in a degraded state.
>   action: Wait for the resilver to complete.
>scrub: resilver in progress, 0.00% done, 134h45m to go
>
> What's going on?
>
> --
> albert chin ([EMAIL PROTECTED])
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Albert Chin

On b66:
  # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
  c0t600A0B8000299CCC06734741CD4Ed0
  < some hours later>
  # zpool status tww
pool: tww
   state: DEGRADED
  status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
   scrub: resilver in progress, 62.90% done, 4h26m to go
  < some hours later>
  # zpool status tww
pool: tww
   state: DEGRADED
  status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
   scrub: resilver in progress, 3.85% done, 18h49m to go

  # zpool history tww | tail -1
  2007-11-20.02:37:13 zpool replace tww c0t600A0B8000299966059E4668CBD3d0
  c0t600A0B8000299CCC06734741CD4Ed0

So, why did resilvering restart when no zfs operations occurred? I
just ran zpool status again and now I get:
  # zpool status tww
pool: tww
   state: DEGRADED
  status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
   scrub: resilver in progress, 0.00% done, 134h45m to go

What's going on?

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread can you guess?

Rats - I was right the first time:  there's a messy problem with snapshots.

The problem is that the parent of the child that you're about to update in 
place may *already* be in one or more snapshots because one or more of its 
*other* children was updated since each snapshot was created.  If so, then each 
snapshot copy of the parent is pointing to the location of the existing copy of 
the child you now want to update in place, and unless you change the snapshot 
copy of the parent (as well as the current copy of the parent) the snapshot 
will point to the *new* copy of the child you are now about to update (with an 
incorrect checksum to boot).

With enough snapshots, enough children, and bad enough luck, you might have to 
change the parent (and of course all its ancestors...) in every snapshot.

In other words, Nathan's approach is pretty much infeasible in the presence of 
snapshots.  Background defragmention works as long as you move the entire 
region (which often has a single common parent) to a new location, which if the 
source region isn't excessively fragmented may not be all that expensive; it's 
probably not something you'd want to try at normal priority *during* an update 
to make Nathan's approach work, though, especially since you'd then wind up 
moving the entire region on every such update rather than in one batch in the 
background.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Andrew Wilson





What kind of workload are you running. If you are you doing these
measurements with some sort of "write as fast as possible"
microbenchmark, once the 4 GB of nvram is full, you will be limited by
backend performance (FC disks and their interconnect) rather than the
host / controller bus.

Since, best case, 4 gbit FC can transfer 4 GBytes of data in about 10
seconds, you will fill it up, even with the backend writing out data as
fast as it can, in about 20 seconds. Once the nvram is full, you will
only see the backend (e.g. 2 Gbit) rate.

The reason these controller buffers are useful with real applications
is that they smooth the bursts of writes that real applications tend to
generate, thus reducing the latency of those writes and improving
performance. They will then "catch up" during periods when few writes
are being issued. But a typical microbenchmark that pumps out a steady
stream of writes won't see this benefit.

Drew Wilson

Asif Iqbal wrote:

  On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote:
  
  
On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote:


  On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote:
  
  
On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote:


  (Including storage-discuss)

I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
ST337FC (300GB - 1 RPM FC-AL)
  

Those disks are 2Gb disks, so the tray will operate at 2Gb.


  
  That is still 256MB/s . I am getting about 194MB/s
  

2Gb fibre channel is going to max out at a data transmission rate

  
  
But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
300GB FC 10K rpm (2Gb/s) disks

So I should get "a lot" more than ~ 200MB/s. Shouldn't I?


  
  
around 200MB/s rather than the 256MB/s that you'd expect.  Fibre
channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
in 10 bits on the wire.  So while 256MB/s is being transmitted on the
connection itself, only 200MB/s of that is the data that you're
transmitting.

Chad Mynhier


  
  


  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Chad Mynhier

On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote:
> On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote:
> > On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote:
> > > On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote:
> > > > On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote:
> > > > > (Including storage-discuss)
> > > > >
> > > > > I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
> > > > > ST337FC (300GB - 1 RPM FC-AL)
> > > >
> > > > Those disks are 2Gb disks, so the tray will operate at 2Gb.
> > > >
> > >
> > > That is still 256MB/s . I am getting about 194MB/s
> >
> > 2Gb fibre channel is going to max out at a data transmission rate
> > around 200MB/s rather than the 256MB/s that you'd expect.  Fibre
> > channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
> > in 10 bits on the wire.  So while 256MB/s is being transmitted on the
> > connection itself, only 200MB/s of that is the data that you're
> > transmitting.
>
> But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
> 300GB FC 10K rpm (2Gb/s) disks
>
> So I should get "a lot" more than ~ 200MB/s. Shouldn't I?

Here, I'm relying on what Louwtjie said above, that the tray itself is
going to be limited to 2Gb/s because of the 2Gb/s FC disks.

Chad Mynhier
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Darren J Moffat

Calum Benson wrote:
> On 20 Nov 2007, at 13:35, Christian Kelly wrote:
>> Take the example I gave before, where you have a pool called, say,  
>> pool1. In the pool you have two ZFSes: pool1/export and pool1/ 
>> export/home. So, suppose the user chooses /export in nautilus and  
>> adds this to the backup list. Will the user be aware, from browsing  
>> through nautilus, that /export/home may or may not be backed up -  
>> depending on whether the -r (?) option is used.
> 
> I'd consider that to be a fairly strong requirement, but it's not  
> something I particularly thought through for the mockups.
> 
> One solution might be to change the nautilus background for folders  
> that are being backed up, another might be an indicator in the status  
> bar, another might be emblems on the folder icons themselves. 

I think changing the background is a non starter since users can change 
the background already anyway.

An emblem is good for the case where you are looking from "above" a 
dataset that is tagged for backup.

An indicator in the status bar is good for when you are "in" a dataset 
that is tagged for backup.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Asif Iqbal

On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote:
> On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote:
> > On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote:
> > > On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote:
> > > > (Including storage-discuss)
> > > >
> > > > I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
> > > > ST337FC (300GB - 1 RPM FC-AL)
> > >
> > > Those disks are 2Gb disks, so the tray will operate at 2Gb.
> > >
> >
> > That is still 256MB/s . I am getting about 194MB/s
>
> 2Gb fibre channel is going to max out at a data transmission rate

But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
300GB FC 10K rpm (2Gb/s) disks

So I should get "a lot" more than ~ 200MB/s. Shouldn't I?


> around 200MB/s rather than the 256MB/s that you'd expect.  Fibre
> channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
> in 10 bits on the wire.  So while 256MB/s is being transmitted on the
> connection itself, only 200MB/s of that is the data that you're
> transmitting.
>
> Chad Mynhier
>



-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Asif Iqbal

On Nov 20, 2007 1:48 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote:
> >
> > That is still 256MB/s . I am getting about 194MB/s
>
> No, I don't think you can take 2Gbit / 8bits per byte and say 256MB is
> what you should get...
> Someone with far more FC knowledge can comment here.  There must be
> some overhead in transporting data (as with regular SCSI) ... in the
> same way ULTRA 320MB SCSI never yields close to 320 MB/s ... even
> though it might seem so.
>
> > Adding a second loop by adding another non active port I may have to 
> > rebuild the
> > FS, no?
>
> No. Use MPXio to help you out here ... Solaris will see the same LUN's
> on each of the 2,3 or 4 ports on the primary controller ... but with
> multi-pathing switched on will only give you 1 vhci LUN to work with.
>
> What I would do is export the zpool(s). Hook up more links to the
> primary and enable scsi_vhci. Reboot and look for the new cX vhci
> devices.
>
> zpool import should rebuilt the pools from the multipath devices just fine.
>
> Interesting test though.
>
> > I am gettin 194MB/s. Hmm my 490 has 16G memory. I really I could benefit 
> > some
> > from OS and controller RAM, atleast for Oracle IO
>
> Close to 200MB seems good from 1 x 2Gb.

Should I not gain a lot (I am not getting any) of performance gain
with 2 x 2GB RAM on my raid controllers
NVRAM?

>
> Something else to try ... when creating hardware LUNs, one can assign
> the LUN to either controller A or B (as preferred or owner). By doing
> assignments one can use the secondary controller ... you are going to
> then "stripe" over controllers .. as one way of looking at it.
>
> PS: Is this a direct connection? Switched fabric?
>



-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Ross

> > doing these writes now sounds like a
> > lot of work.  I'm guessing that needing two full-path
> > updates to achieve this means you're talking about a
> > much greater write penalty.
> 
> Not all that much.  Each full-path update is still
> only a single write request to the disk, since all
> the path blocks (again, possibly excepting the
> superblock) are batch-written together, thus mostly
> increasing only streaming bandwidth consumption.

Ok, that took some thinking about.  I'm pretty new to ZFS, so I've only just 
gotten my head around how CoW works, and I'm not used to thinking about files 
at this kind of level.  I'd not considered that path blocks would be 
batch-written close together, but of course that makes sense.

What I'd been thinking was that ordinarily files would get fragmented as they 
age, which would make these updates slower as blocks would be scattered over 
the disk, so a full-path update would take some time.  I'd forgotten that the 
whole point of doing this is to prevent fragmentation...

So a nice side effect of this approach is that if you use it, it makes itself 
more efficient :D
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz DEGRADED state

2007-11-20 Thread MC

> So there is no current way to specify the creation of
> a 3 disk raid-z
> array with a known missing disk?

Can someone answer that?  Or does the zpool command NOT accommodate the 
creation of a degraded raidz array?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Christian Kelly

Calum Benson wrote:
> Right, for Phase 0 the thinking was that you'd really have to manually 
> set up whatever pools and filesystems you required first.  So in your 
> example, you (or, perhaps, the Indiana installer) would have had to 
> set up /export/home/chris/Documents as a ZFS filesystem in its own 
> right before you could start taking snapshots of it.
>
> Were we to stick with this general design, in later phases, creating a 
> new ZFS filesystem on the fly, and migrating the contents of the 
> existing folder into to it, would hopefully happen behind the scenes 
> when you selected that folder to be backed up.  (That could presumably 
> be quite a long operation, though, for folders with large contents.)
>

Ah, I see. So, for phase 0, the 'Enable Automatic Snapshots' option 
would only be available for/work for existing ZFSes. Then at some later 
stage, create them on the fly.

> I have no problem looking at it from that angle if it turns out that's 
> what people want-- much of the UI would be fairly similar.  But at the 
> same time, I don't necessarily always expect OSX users' requirements 
> to be the same as Solaris users' requirements-- I'd especially like to 
> hear from people who are already using Tim's snapshot and backup 
> services, to find out how they use it and what their needs are.

Yes, absolutely, OSX users' requirements probably vary wildly from those 
of a Solaris users'. I guess I fall into what we might call the 'lazy' 
category of user ;) I'm aware of Tim's tool, don't use it though.

-Christian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Moore, Joe

Louwtjie Burger wrote:
> Richard Elling wrote:
> >
> > >- COW probably makes that conflict worse
> > >
> > >
> >
> > This needs to be proven with a reproducible, real-world 
> workload before it
> > makes sense to try to solve it.  After all, if we cannot 
> measure where
> > we are,
> > how can we prove that we've improved?
> 
> I agree, let's first find a reproducible example where "updates"
> negatively impacts large table scans ... one that is rather simple (if
> there is one) to reproduce and then work from there.

I'd say it would be possible to define a reproducible workload that
demonstrates this using the Filebench tool... I haven't worked with it
much (maybe over the holidays I'll be able to do this), but I think a
workload like:

1) create a large file (bigger than main memory) on an empty ZFS pool.
2) time a sequential scan of the file
3) random write i/o over say, 50% of the file (either with or without
matching blocksize)
4) time a sequential scan of the file

The difference between times 2 and 4 are the "penalty" that COW block
reordering (which may introduce seemingly-random seeks between
"sequential" blocks) imposes on the system.

It would be interesting to watch seeksize.d's output during this run
too.

--Joe

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson

On 20 Nov 2007, at 13:35, Christian Kelly wrote:
>
> Take the example I gave before, where you have a pool called, say,  
> pool1. In the pool you have two ZFSes: pool1/export and pool1/ 
> export/home. So, suppose the user chooses /export in nautilus and  
> adds this to the backup list. Will the user be aware, from browsing  
> through nautilus, that /export/home may or may not be backed up -  
> depending on whether the -r (?) option is used.

I'd consider that to be a fairly strong requirement, but it's not  
something I particularly thought through for the mockups.

One solution might be to change the nautilus background for folders  
that are being backed up, another might be an indicator in the status  
bar, another might be emblems on the folder icons themselves.  Which  
approach works best would probably depend on whether we expect most  
of  the folders people are browsing reguarly to be backed up, or not  
backed up-- in general, you'd want any sort of indicator to show the  
less common state.

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread can you guess?

...

> With regards sharing the disk resources with other
> programs, obviously it's down to the individual
> admins how they would configure this,

Only if they have an unconstrained budget.

 but I would
> suggest that if you have a database with heavy enough
> requirements to be suffering noticable read
> performance issues due to fragmentation, then that
> database really should have it's own dedicated drives
> and shouldn't be competing with other programs.

You're not looking at it from a whole-system viewpoint (which if you're 
accustomed to having your own dedicated storage devices is understandable).

Even if your database performance is acceptable, if it's performing 50x as many 
disk seeks as it would otherwise need to when scanning a table that's affecting 
the performance of *other* applications.

> 
> Also, I'm not saying defrag is bad (it may be the
> better solution here), just that if you're looking at
> performance in this kind of depth, you're probably
> experienced enough to have created the database in a
> contiguous chunk in the first place :-)

As I noted, ZFS may not allow you to ensure that and in any event if the 
database grows that contiguity may need to be reestablished.  You could grow 
the db in separate files, each of which was preallocated in full (though again 
ZFS may not allow you to ensure that each is created contiguously on disk), but 
while databases may include such facilities as a matter of course it would 
still (all other things being equal) be easier to manage everything if it could 
just extend a single existing file (or one file per table, if they needed to be 
kept separate) as it needed additional space.

> 
> I do agree that doing these writes now sounds like a
> lot of work.  I'm guessing that needing two full-path
> updates to achieve this means you're talking about a
> much greater write penalty.

Not all that much.  Each full-path update is still only a single write request 
to the disk, since all the path blocks (again, possibly excepting the 
superblock) are batch-written together, thus mostly increasing only streaming 
bandwidth consumption.

...

> It may be that ZFS is not a good fit for this kind of
> use, and that if you're really concerned about this
> kind of performance you should be looking at other
> file systems.

I suspect that while it may not be a great fit now with relatively minor 
changes it could be at least an acceptable one.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson

On 20 Nov 2007, at 12:56, Christian Kelly wrote:

> Hi Calum,
>
> heh, as it happens, I was tinkering with pygtk to see how difficult  
> this would be :)
>
> Supposing I have a ZFS on my machine called root/export/home which  
> is mounted on /export/home. Then I have my home dir as /export/home/ 
> chris. Say I only want to snapshot and backup /export/home/chris/ 
> Documents. I can't create a snapshot of /export/home/chris/ 
> Documents as it is a directory, I have to create a snapshot of the  
> parent ZFS, in this case /export/home/. So there isn't really the  
> granularity that the attached spec implies. Someone correct me if  
> I'm wrong, but I just tried it and it didn't work.

Right, for Phase 0 the thinking was that you'd really have to  
manually set up whatever pools and filesystems you required first.   
So in your example, you (or, perhaps, the Indiana installer) would  
have had to set up /export/home/chris/Documents as a ZFS filesystem  
in its own right before you could start taking snapshots of it.

Were we to stick with this general design, in later phases, creating  
a new ZFS filesystem on the fly, and migrating the contents of the  
existing folder into to it, would hopefully happen behind the scenes  
when you selected that folder to be backed up.  (That could  
presumably be quite a long operation, though, for folders with large  
contents.)

> I've had a bit of a look at 'Time Machine' and I'd be more in  
> favour of that style of backup. Just back up everything so I don't  
> have to think about it. My feeling is that picking individual  
> directories out just causes confusion. Think of it this way: how  
> much change is there on a daily basis on your desktop/laptop? Those  
> snapshots aren't going to grow very quickly.

I have no problem looking at it from that angle if it turns out  
that's what people want-- much of the UI would be fairly similar.   
But at the same time, I don't necessarily always expect OSX users'  
requirements to be the same as Solaris users' requirements-- I'd  
especially like to hear from people who are already using Tim's  
snapshot and backup services, to find out how they use it and what  
their needs are.

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Tim Foster

On Tue, 2007-11-20 at 13:35 +, Christian Kelly wrote:
> What I'm suggesting is that the configuration presents a list of pools 
> and their ZFSes and that you have a checkbox, backup/don't backup sort 
> of an option.

That's basically the (hacked-up) zenity GUI I have at the moment on my
blog, download & install the packages and you'll see - I think getting
that in a proper tree-structure help ? Right now, there's a bug in my
gui, such that with:

 [X] tank
 [ ] tank/timf
 [ ] tank/timf/Documents
 [ ] tank/timf/Music

Selecting "tank" implicitly marks the other filesystems for backup
because of the way zfs properties inherit. (load the above gui again
having just selected tank, and you'll see the other filesystems being
selected for you)

Having said that, I like Calum's ideas - and am happy to defer the
decision about the gui to someone a lot more qualified than I in this
area :-)

I think that when browsing directories in nautilus, it would be good to
have some sort of "backup" or "snapshot" icon (Ã¡la the little padlock in
secure web-browsing sessions) to let you know this directory is being
either backed up, and/or included in snapshots.

cheers,
tim

-- 
Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops
http://blogs.sun.com/timf

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Ross

Hmm... that's a pain if updating the parent also means updating the parent's 
checksum too.  I guess the functionality is there for moving bad blocks, but 
since that's likely to be a rare occurence, it wasn't something that would need 
to be particularly efficient.

With regards sharing the disk resources with other programs, obviously it's 
down to the individual admins how they would configure this, but I would 
suggest that if you have a database with heavy enough requirements to be 
suffering noticable read performance issues due to fragmentation, then that 
database really should have it's own dedicated drives and shouldn't be 
competing with other programs.

I'm not saying defrag is bad (it may be the better solution here), just that if 
you're looking at performance in this kind of depth, you're probably 
experienced enough to have created the database in a contiguous chunk in the 
first place :-)

I do agree that doing these writes now sounds like a lot of work.  I'm guessing 
that needing two full-path updates to achieve this means you're talking about a 
much greater write penalty.  And that means you can probably expect significant 
read penalty if you have any significant volume of writes at all, which would 
rather defeat the point.  After all, if you have a low enough amount of writes 
to not suffer from this penalty, your database isn't going to be particularly 
fragmented.

However, I'm now in over my depth.  This needs somebody who knows the internal 
architecture of ZFS to decide whether it's feasible or desirable, and whether 
defrag is a good enough workaround.

It may be that ZFS is not a good fit for this kind of use, and that if you're 
really concerned about this kind of performance you should be looking at other 
file systems.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Christian Kelly


> Time Machine is storing all in the system by default, but you still can 
> select some ones that you don't like to store. And Time Machine don't 
> use ZFS.
> Here we will use ZFS snapshot, and what it's working with is file 
> system. In Nevada, the default file system is not ZFS, it means some 
> directory is not ZFS, so seems you have to select some directory which 
> is ZFS, and it's impossible for you to store all, (some are not ZFS)...
>   

What I'm suggesting is that the configuration presents a list of pools 
and their ZFSes and that you have a checkbox, backup/don't backup sort 
of an option. When you start having nested ZFSes it could get confusing 
as to what you are actually backing up if you start browsing down 
through the filesystem with the likes of nautilus.

Take the example I gave before, where you have a pool called, say, 
pool1. In the pool you have two ZFSes: pool1/export and 
pool1/export/home. So, suppose the user chooses /export in nautilus and 
adds this to the backup list. Will the user be aware, from browsing 
through nautilus, that /export/home may or may not be backed up - 
depending on whether the -r (?) option is used. I guess what I'm saying 
is, how aware of the behavior of ZFS must the user be in order to use 
this backup system?

-Christian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread can you guess?

...

> My understanding of ZFS (in short: an upside down
> tree) is that each block is referenced by it's
> parent. So regardless of how many snapshots you take,
> each block is only ever referenced by one other, and
> I'm guessing that the pointer and checksum are both
> stored there.
> 
> If that's the case, to move a block it's just a case
> of:
> - read the data
> - write to the new location
> - update the pointer in the parent block

Which changes the contents of the parent block (the change in the data checksum 
changed it as well), and thus requires that this parent also be rewritten 
(using COW), which changes the pointer to it (and of course its checksum as 
well) in *its* parent block, which thus also must be re-written... and finally 
a new copy of the superblock is written to reflect the new underlying tree 
structure - all this in a single batch-written 'transaction'.

The old version of each of these blocks need only be *saved* if a snapshot 
exists and it hasn't previously been updated since that snapshot was created.  
But all the blocks need to be COWed even if no snapshot exists (in which case 
the old versions are simply discarded).

...
 
> PS.
> 
> >1. You'd still need an initial defragmentation pass
> to ensure that the file was reasonably piece-wise
> contiguous to begin with.
> 
> No, not necessarily.  If you were using a zpool
> configured like this I'd hope you were planning on
> creating the file as a contiguous block in the first
> place :)

I'm not certain that you could ensure this if other updates in the system were 
occurring concurrently.  Furthermore, the file may be extended dynamically as 
new data is inserted, and you'd like to have some mechanism that could restore 
reasonable contiguity to the result (which can be difficult to accomplish in 
the foreground if, for example, free space doesn't happen to exist on the disk 
right after the existing portion of the file).

...
 
> Any zpool with this option would probably be
> dedicated to the database file and nothing else.  In
> fact, even with multiple databases I think I'd have a
> single pool per database.

It's nice if you can afford such dedicated resources, but it seems a bit 
cavalier to ignore users who just want decent performance from a database that 
has to share its resources with other activity.

Your prompt response is probably what prevented me from editing my previous 
post after I re-read it and realized I had overlooked the fact that 
over-writing the old data complicates things.  So I'll just post the revised 
portion here:


3.  Now you must make the above transaction persistent, and then randomly 
over-write the old data block with the new data (since that data must be in 
place before you update the path to it below, and unfortunately since its 
location is not arbitrary you can't combine this update with either the 
transaction above or the transaction below).

4.  You can't just slide in the new version of the block using the old 
version's existing set of ancestors because a) you just deallocated that path 
above (introducing additional mechanism to preserve it temporarily almost 
certainly would not be wise), b) the data block checksum changed, and c) in any 
event this new path should be *newer* than the path to the old version's new 
location that you just had to establish (if a snapshot exists, that's the path 
that should be propagated to it by the COW mechanism).  However, this is just 
the normal situation whenever you update a data block (save for the fact that 
the block itself was already written above):  all the *additional* overhead 
occurred in the previous steps.

So instead of a single full-path update that fragments the file, you have two 
full-path updates, a random write, and possibly a random read initially to 
fetch the old data.  And you still need an initial defrag pass to establish 
initial contiguity.  Furthermore, these additional resources are consumed at 
normal rather than the reduced priority at which a background reorg can 
operate.  On the plus side, though, the file would be kept contiguous all the 
time rather than just returned to contiguity whenever there was time to do so.

...

> Taking it a stage further, I wonder if this would
> work well with the prioritized write feature request
> (caching writes to a solid state disk)?
>  http://www.genunix.org/wiki/index.php/OpenSolaris_Sto
> age_Developer_Wish_List
> 
> That could potentially mean there's very little
> slowdown:
>  - Read the original block
> - Save that to solid state disk
>  - Write the the new block in the original location
> - Periodically stream writes from the solid state
> disk to the main storage

I'm not sure this would confer much benefit if things in fact need to be 
handled as I described above.  In particular, if a snapshot exists you almost 
certainly must establish the old version in its new location in the snapshot 
rather than just capture it in the log; if no snapshot exists you could c

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Henry Zhang



Christian Kelly 写道:
> Hi Calum,
> 
> heh, as it happens, I was tinkering with pygtk to see how difficult this 
> would be :)
> 
> Supposing I have a ZFS on my machine called root/export/home which is 
> mounted on /export/home. Then I have my home dir as /export/home/chris. 
> Say I only want to snapshot and backup /export/home/chris/Documents. I 
> can't create a snapshot of /export/home/chris/Documents as it is a 
> directory, I have to create a snapshot of the parent ZFS, in this case 
> /export/home/. So there isn't really the granularity that the attached 
> spec implies. Someone correct me if I'm wrong, but I just tried it and 
> it didn't work.
> 
> I've had a bit of a look at 'Time Machine' and I'd be more in favour of 
> that style of backup. Just back up everything so I don't have to think 
> about it. My feeling is that picking individual directories out just 
> causes confusion. Think of it this way: how much change is there on a 
> daily basis on your desktop/laptop? Those snapshots aren't going to grow 
> very quickly.
Time Machine is storing all in the system by default, but you still can 
select some ones that you don't like to store. And Time Machine don't 
use ZFS.
Here we will use ZFS snapshot, and what it's working with is file 
system. In Nevada, the default file system is not ZFS, it means some 
directory is not ZFS, so seems you have to select some directory which 
is ZFS, and it's impossible for you to store all, (some are not ZFS)...
> 
> -Christian
> 
> 
> 
> Calum Benson wrote:
>> Hi all,
>>
>> We've been thinking a little about a more integrated desktop presence  
>> for Tim Foster's ZFS backup and snapshot services[1].  Here are some  
>> initial ideas about what a Phase 0 (snapshot only, not backup) user  
>> experience might look like... comments welcome.
>>
>> http://www.genunix.org/wiki/index.php/ZFS_Snapshot
>>
>> (I'm not subscribed to zfs-discuss, so please make sure either  
>> desktop-discuss or I remain cc'ed on any replies if you want me to  
>> see them...)
>>
>> Cheeri,
>> Calum.
>>
>> [1] http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people
>>
>>   
> 
> ___
> desktop-discuss mailing list
> [EMAIL PROTECTED]

-- 
Henry Zhang
JDS Software Development, OPG

Sun China Engineering & Research Institute
Sun Microsystems, Inc.
10/F Chuang Xin Plaza, Tsinghua Science Park
Beijing 100084, P.R. China
Tel: +86 10 62673866
Fax: +86 10 62780969
eMail: [EMAIL PROTECTED]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Christian Kelly

Hi Calum,

heh, as it happens, I was tinkering with pygtk to see how difficult this 
would be :)

Supposing I have a ZFS on my machine called root/export/home which is 
mounted on /export/home. Then I have my home dir as /export/home/chris. 
Say I only want to snapshot and backup /export/home/chris/Documents. I 
can't create a snapshot of /export/home/chris/Documents as it is a 
directory, I have to create a snapshot of the parent ZFS, in this case 
/export/home/. So there isn't really the granularity that the attached 
spec implies. Someone correct me if I'm wrong, but I just tried it and 
it didn't work.

I've had a bit of a look at 'Time Machine' and I'd be more in favour of 
that style of backup. Just back up everything so I don't have to think 
about it. My feeling is that picking individual directories out just 
causes confusion. Think of it this way: how much change is there on a 
daily basis on your desktop/laptop? Those snapshots aren't going to grow 
very quickly.

-Christian

Calum Benson wrote:
> Hi all,
>
> We've been thinking a little about a more integrated desktop presence  
> for Tim Foster's ZFS backup and snapshot services[1].  Here are some  
> initial ideas about what a Phase 0 (snapshot only, not backup) user  
> experience might look like... comments welcome.
>
> http://www.genunix.org/wiki/index.php/ZFS_Snapshot
>
> (I'm not subscribed to zfs-discuss, so please make sure either  
> desktop-discuss or I remain cc'ed on any replies if you want me to  
> see them...)
>
> Cheeri,
> Calum.
>
> [1] http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people
>
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Ross

In that case, this may be a much tougher nut to crack than I thought.

I'll be the first to admit that other than having seen a few presentations I 
don't have a clue about the details of how ZFS works under the hood, however...

You mention that moving the old block means updating all it's ancestors.  I had 
naively assumed moving a block would be relatively simple, and would also 
update all the ancestors.  

My understanding of ZFS (in short: an upside down tree) is that each block is 
referenced by it's parent.  So regardless of how many snapshots you take, each 
block is only ever referenced by one other, and I'm guessing that the pointer 
and checksum are both stored there.

If that's the case, to move a block it's just a case of:
 - read the data
 - write to the new location
 - update the pointer in the parent block

Please let me know if I'm mis-understanding ZFS here.

The major problem with this is that I don't know if there's any easy way to 
identify the parent block from the child, or an effcient way to do this move.  
However, thinking about it, there must be.  ZFS intelligently moves data if it 
detects corruption, so there must already be tools in place to do exactly what 
we need here.

In which case, this is still relatively simple and much of the code already 
exists.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread can you guess?

...

> - Nathan appears to have suggested a good workaround.
> Could ZFS be updated to have a 'contiguous' setting
> where blocks are kept together.  This sacrifices
>  write performance for read.

I had originally thought that this would be incompatible with ZFS's snapshot 
mechanism, but with a minor tweak it may not be.

...

> - Bill seems to understand the issue, and added some
> useful background (although in an entertaining but
> rather condascending way).

There is a bit of nearby history that led to that.

...

> One point that I haven't seen raised yet:  I believe
> most databases will have had years of tuning based
> around the assumption that their data is saved
> contigously on disk.  They will be optimising their
> disk access based on that and this is not something
> we should ignore.

Ah - nothing like real, experienced user input.  I tend to agree with ZFS's 
general philosophy of attempting to minimize the number of knobs that need 
tuning, but this can lead to forgetting that higher-level software may have 
knobs of its own.  My original assumption was that databases automatically 
attempted to leverage on-disk contiguity (which the more evolved ones certainly 
do when they're controlling the on-disk layout themselves and one might suspect 
try to do even when running on top of files by assuming that the file system is 
trying to preserve on-disk contiguity), but of course admins play a major role 
as well (e.g., in determining which indexes need not be created because 
sequential table scans can get the job done efficiently).

...
 
> I definately don't think defragmentation is the
> solution (although that is needed in ZFS for other
> scenarios).  If your database is under enough read
> strain to need the fix suggested here, your disks
> definately do not have the time needed to scan and
> defrag the entire system.

Well, it's only this kind of randomly-updated/sequentially-scanned data that 
needs much defragmention in the first place.  Data that's written once and then 
only read at worst needs a single defragmentation pass (if the original writes 
got interrupted by a lot of other update activity), data that's not read 
sequentially (e.g., indirect blocks) needn't be defragmented at all, nor need 
data that's seldom read and/or not very fragmented in the first place.

> 
> It would seem to me that Nathan's suggestion right at
> the start of the thread is the way to go.  It
> guarantees read performance for the database, and
> would seem to be relatively easy to implement at the
> zpool level.  Yes it adds considerable overhead to
> writes, but that is a decision database
> administrators can make given the expected load.  
> 
> If I'm understanding Nathan right, saving a block of
> data would mean:
> - Reading the original block (may be cached if we're
>  lucky)
> - Saving that block to a new location
>  - Saving the new data to the original location

1.  You'd still need an initial defragmentation pass to ensure that the file 
was reasonably piece-wise contiguous to begin with.

2.  You can't move the old version of the block without updating all its 
ancestors (since the pointer to it changes).  When you update this path to the 
old version, you need to suppress the normal COW behavior if a snapshot exists 
because it would otherwise maintain the old path pointing to the old data 
location that you're just about to over-write below.  This presumably requires 
establishing the entire new path and deallocating the entire old path in a 
single transaction but this may just be equivalent to a normal data block 
'update' (that just doesn't happen to change any data in the block) when no 
snapshot exists.  I don't *think* that there should be any new issues raised 
with other updates that may be combined in the same 'transaction', even if they 
may affect some of the same ancestral blocks.

3.  You can't just slide in the new version of the block using the old 
version's existing set of ancestors because a) you just deallocated that path 
above (introducing additional mechanism to preserve it temporarily almost 
certainly would not be wise), b) the data block checksum changed, and c) in any 
event this new path should be *newer* than the path to the old version's new 
location that you just had to establish (if a snapshot exists, that's the path 
that should be propagated to it by the COW mechanism).  However, this is just 
the normal situation whenever you update a data block:  all the *additional* 
overhead occurred in the previous steps.

Given that doing the update twice, as described above, only adds to the 
bandwidth consumed (steps 2 and 3 should be able to be combined in a single 
transaction), the only additional disk seek would be that required to re-read 
the original data if it wasn't cached.  So you may well be correct that this 
approach would likely consume fewer resources than background defragmentation 
would (though, as noted above, you'd still need an initial defrag p

[zfs-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson

Hi all,

We've been thinking a little about a more integrated desktop presence  
for Tim Foster's ZFS backup and snapshot services[1].  Here are some  
initial ideas about what a Phase 0 (snapshot only, not backup) user  
experience might look like... comments welcome.

http://www.genunix.org/wiki/index.php/ZFS_Snapshot

(I'm not subscribed to zfs-discuss, so please make sure either  
desktop-discuss or I remain cc'ed on any replies if you want me to  
see them...)

Cheeri,
Calum.

[1] http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Chad Mynhier

On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote:
> On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote:
> > On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote:
> > > (Including storage-discuss)
> > >
> > > I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
> > > ST337FC (300GB - 1 RPM FC-AL)
> >
> > Those disks are 2Gb disks, so the tray will operate at 2Gb.
> >
>
> That is still 256MB/s . I am getting about 194MB/s

2Gb fibre channel is going to max out at a data transmission rate
around 200MB/s rather than the 256MB/s that you'd expect.  Fibre
channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
in 10 bits on the wire.  So while 256MB/s is being transmitted on the
connection itself, only 200MB/s of that is the data that you're
transmitting.

Chad Mynhier
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-20 Thread Ross

My initial thought was that this whole thread may be irrelevant - anybody 
wanting to run such a database is likely to use a specialised filesystem 
optimised for it.  But then I realised that for a database admin the integrity 
checking and other benefits of ZFS would be very tempting, but only if ZFS can 
guarantee equivalent performance to other filesystems.

So, let me see if I understand this right:

- Louwtjie is concerned that ZFS will fragment databases, potentially leading 
to read performance issues for some databases.

- Nathan appears to have suggested a good workaround.  Could ZFS be updated to 
have a 'contiguous' setting where blocks are kept together.  This sacrifices 
write performance for read.

- Richard isn't convinced there's a problem as he's not seen any data 
supporting this.  I can see his point, but I don't agree that this is a non 
starter.  For certain situations it could be very useful, and balancing read 
and write performance is an integral part in the choice of storage 
configuration.

- Bill seems to understand the issue, and added some useful background 
(although in an entertaining but rather condascending way).

Richard then went into a little more detail.  I think he's pointing out here 
that while contiguous data is fastest if you consider a single disk, is not 
necessarily the fastest approach when your data is spread across multiple 
disks.  Instead he feels a 'diverse stochastic spread' is needed.  I guess that 
means you want the data spread so all the disks can be used in parallel.

I think I'm now seeing why Richard is asking for real data.  I think he 
believes that ZFS may already be faster or equal to a standard contiguous 
filesystem in this scenario.  Richard seems to be using a random or statistical 
approach to this:  If data is saved randomly, you're likely to be using all 
disks when reading data.

I do see the point, and yes, data would be useful, but I think I agree with 
Bill on this.  For reading data, while random locations are likely to be fast 
in terms of using multiple disks, that data is also likely to be spread and so 
is almost certain to result in more disk seeks.  Whereas if you have contiguous 
data you can guarantee that it will be striped across the maximum possible 
number of disks, with the minimum number of seeks.  As a database admin I would 
take guaranteed performance over probable performance any day of the week.  
Especially if I can be sure that performance will be consistent and will not 
degrade as the database ages.

One point that I haven't seen raised yet:  I believe most databases will have 
had years of tuning based around the assumption that their data is saved 
contigously on disk.  They will be optimising their disk access based on that 
and this is not something we should ignore.

Yes, until we have data to demonstrate the problem it's just theoretical.  
However that may be hard to obtain and in the meantime I think the theory is 
sound, and the solution easy enough that it is worth tackling.

I definately don't think defragmentation is the solution (although that is 
needed in ZFS for other scenarios).  If your database is under enough read 
strain to need the fix suggested here, your disks definately do not have the 
time needed to scan and defrag the entire system.

It would seem to me that Nathan's suggestion right at the start of the thread 
is the way to go.  It guarantees read performance for the database, and would 
seem to be relatively easy to implement at the zpool level.  Yes it adds 
considerable overhead to writes, but that is a decision database administrators 
can make given the expected load.  

If I'm understanding Nathan right, saving a block of data would mean:
 - Reading the original block (may be cached if we're lucky)
 - Saving that block to a new location
 - Saving the new data to the original location

So you've got a 2-3x slowdown in write performance, but you guarantee read 
performance will at least match existing filesystems (with ZFS caching, it may 
exceed it).  ZFS then works much better with all the existing optimisations 
done within the database software, and you still keep all the benefits of ZFS - 
full data integrity, snapshots, clones, etc...

For many database admins, I think that would be an option they would like to 
have.

Taking it a stage further, I wonder if this would work well with the 
prioritized write feature request (caching writes to a solid state disk)?  
http://www.genunix.org/wiki/index.php/OpenSolaris_Storage_Developer_Wish_List

That could potentially mean there's very little slowdown:
 - Read the original block
 - Save that to solid state disk
 - Write the the new block in the original location
 - Periodically stream writes from the solid state disk to the main storage

In theory there's no need for the drive head to move at all between the read 
and the write, so this should only be fractionally slower than traditional ZFS 
writes.  Yes the data needs to be

Re: [zfs-discuss] zfs on a raid box

2007-11-20 Thread Paul Boven

Hi MP,

MP wrote:
>> but my issue is that
>> not only the 'time left', but also the progress
>> indicator itself varies
>> wildly, and keeps resetting itself to 0%, not giving
>> any indication that
> 
> Are you sure you are not being hit by this bug:
> 
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6343667
> 
> i.e. scrub or resilver get's reset to 0% on a snapshot creation or deletion.
>Cheers.

I'm very sure of that: I've never done a snapshot on these, and I am the
only user on the machine (it's not in production yet).

Regards, Paul Boven.
-- 
Paul Boven <[EMAIL PROTECTED]> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe - www.jive.nl
VLBI - It's a fringe science
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs on a raid box

2007-11-20 Thread MP

> but my issue is that
> not only the 'time left', but also the progress
> indicator itself varies
> wildly, and keeps resetting itself to 0%, not giving
> any indication that

Are you sure you are not being hit by this bug:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6343667

i.e. scrub or resilver get's reset to 0% on a snapshot creation or deletion.
   Cheers.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz2

2007-11-20 Thread Paul Boven

Hi Eric, everyone,

Eric Schrock wrote:
> There have been many improvements in proactively detecting failure,
> culminating in build 77 of Nevada.  Earlier builds:
> 
> - Were unable to distinguish device removal from devices misbehaving,
>   depending on the driver and hardware.
> 
> - Did not diagnose a series of I/O failures as disk failure.
> 
> - Allowed several (painful) SCSI retries and continued to queue up I/O,
>   even if the disk was fatally damaged.

> Most classes of hardware would behave reasonably well on device removal,
> but certain classes caused cascading failures in ZFS, all which should
> be resolved in build 77 or later.

I seem to be having exactly the problems you are describing (see my
postings with the subject 'zfs on a raid box'). So I would very much
like to give b77 a try. I'm currently running b76, as that's the latest
sxce that's available. Are the sources to anything beyond b76 already
available? Would I need to build it, or bfu?

I'm seeing zfs not making use of available hot-spares when I pull a
disk, long and indeed painful SCSI retries and very poor write
performance on a degraded zpool - I hope to be able to test if b77 fares
any better with this.

Regards, Paul Boven.
-- 
Paul Boven <[EMAIL PROTECTED]> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe - www.jive.nl
VLBI - It's a fringe science
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

65 matches

Mail list logo