[zfs-discuss] ZFS and Storage

2006-06-26 Thread Mika Borner
Hi

Now that Solaris 10 06/06 is finally downloadable I have some questions
about ZFS.

-We have a big storage sytem supporting RAID5 and RAID1. At the moment,
we only use RAID5 (for non-solaris systems as well). We are thinking
about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5
seems like overkill, an option would be to use RAID1 with RAID-Z. Then
again, this is a waist of space, as it needs more disks, due to the
mirroring. Later on, we might be using asynchronous replication to
another storage system using SAN, even more waste of space. This looks
somehow like storage virtualization as of today just doesn't work nicely
together. What we need, would be the feature to use JBODs.

-Does ZFS in the current version support LUN extension? With UFS, we
have to zero the VTOC, and then adjust the new disk geometry. How does
it look like with ZFS?

-I've read the threads about zfs and databases. Still I'm not 100%
convenienced about read performance. Doesn't the fragmentation of the
large database files (because of the concept of COW) impact
read-performance? 

-Does anybody have any experience in database cloning using the ZFS
mechanism? What factors influence the performance, when running the
cloned database in parallel? 
-I really like the idea to keep all needed databasefiles together, to
allow fast and consistent cloning.

Thanks

Mika


# mv Disclaimer.txt /dev/null





-
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

2006-06-26 Thread Roch

So if you have a single thread doing open/write/close of 8K
files and get 1.25MB/sec, that tells me you have something
like a 6ms I/O latency. Which look reasonable also.
What does iostat -x svc_t (client side) says ?

400ms seems high for the workload _and_ doesn't match my
formula, so I don't like it ;-)
Quick look at your script looks fine tough; but something
just does not compute here.

Why  this formula (which applies to  any NFS single threaded
client  app working on small  files).  Even if the open  and
write parts are infinitely fast, on close(2), NFS must insure
that  data is  set to disk.  So  at a minimum every close(2)
must wait 1 I/O latency. During  that wait the single thread
client applicationwillnot  initiate   the  following
open/write/close segment. At  best  you get one file  output
per I/O latency. The I/O latency is the one seen by the
client and includes network part but that should be small
compared to the physical I/O.


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs list -o usage info missing 'name'

2006-06-26 Thread Gavin Maltby

Hi

Probbaly been reported a while back, but 'zfs list -o' does not
list the rather useful (and obvious) 'name' property, and nor does the manpage
at a quick read.  snv_42.

# zfs list -o
missing argument for 'o' option
usage:
list [-rH] [-o property[,property]...] [-t type[,type]...]
[filesystem|volume|snapshot] ...

The following properties are supported:

PROPERTY   EDIT  INHERIT   VALUES

type NO   NO   filesystem | volume | snapshot
creation NO   NO   date
used NO   NO   size
availableNO   NO   size
referenced   NO   NO   size
compressratioNO   NO   1.00x or higher if compressed
mounted  NO   NO   yes | no | -
origin   NO   NO   snapshot
quota   YES   NO   size | none
reservation YES   NO   size | none
volsize YES   NO   size
volblocksize NO   NO   512 to 128k, power of 2
recordsize  YES  YES   512 to 128k, power of 2
mountpoint  YES  YES   path | legacy | none
sharenfsYES  YES   on | off | share(1M) options
checksumYES  YES   on | off | fletcher2 | fletcher4 | sha256
compression YES  YES   on | off | lzjb
atime   YES  YES   on | off
devices YES  YES   on | off
execYES  YES   on | off
setuid  YES  YES   on | off
readonlyYES  YES   on | off
zoned   YES  YES   on | off
snapdir YES  YES   hidden | visible
aclmode YES  YES   discard | groupmask | passthrough
aclinherit  YES  YES   discard | noallow | secure | passthrough

Gavin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Storage

2006-06-26 Thread Roch

About:

  -I've read the threads about zfs and databases. Still I'm not 100%
  convenienced about read performance. Doesn't the fragmentation of the
  large database files (because of the concept of COW) impact
  read-performance? 

I do need to get back to this thread. The way I am currently 
looking at this is this:

ZFS will perform great at doing the transaction
component (say the small (8K) O_DSYNC writes)
because the ZIL will aggregate them in fewer larger
I/Os and the block allocation will stream them to the 
surface.

On the other hand, read streaming will require a
good prefetch code (under review) to get the read
performance we want.


If  the   requirements balances   random  writes   and  read
streaming, then  ZFS  should be  right there  with the  best
FS. If the critical requirement  focuses exclusively on read
streaming a file that was written randomly and, in addition,
the  number  of spindles is limited   then  that is  not the
sweetspot of ZFS.  Read performance  should still scale with
number of  spindles.  And, ifthe load can accomodate   a
reorder, to  get top per-spindle read-streaming performance,
a cp(1) of the file should do wonders on the layout.


-r


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: ZFS Wiki?

2006-06-26 Thread Jeff Victor

A lesson we learned with Solaris Zones applies here to ZFS.  Accomplishing
high-level goals, e.g. prepare an appropriate environment for application XYZ
installation (Zones) or prepare an appropriate filesystem for application XYZ 
data (ZFS) is different than it was before Solaris 10.  For Zones, a Sun 
BluePrint, Solaris Containers Technology Architecture Guide, was written to 
begin to address this need.


Fortunately, with ZFS it will be easier to determine appropriate factors and
settings than it was for earlier filesystems.  However, documenting the lessons
learned in a wiki would be very valuable.

Nathanael Burton wrote:

Just some random thoughts on this...

One of the initial design criteria of ZFS is that it's simple. If it's not,
that was a bug...

If we need tutorials to use the zfs commands, has something missed the mark?

If the information that is needed to do the work is NOT in the man pages,
perhaps we could look to address that...

Personally, I'd prefer to read a manpage than scour the web for a tutorial
that may or may not be current.

hm... man zfs_tutorial? :)

Nathan.



Tutorial might have been the wrong word.  Man pages are good for finding quick
reference about specific commands, syntax, basic functionality.  I understand
that ZFS was built around being simple but powerful.  Some users/admins have
trouble seeing the big picture...putting it all together.  This is where I
feel the power of a wiki, or any centralized documentation space related to
ZFS, could be of benefit.

Also things that may not be explained in a man page such as tying other
applications in with ZFS, such as NetBackup and ZFS(?), ZFS and zones, ZFS and
Oracle, etc.  Most of those topics wouldn't be in a man page(except the zones
one), but are important topics that could be very useful.

-Nate


This message posted from opensolaris.org 
___ zfs-discuss mailing list 
zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS Wiki?

2006-06-26 Thread Jeff Victor

Mike Gerdts wrote:

On 6/25/06, Nathan Kroenert [EMAIL PROTECTED] wrote:

Now, looking forward a bit, where does the ZFS integration with zones
documentation belong?  


Some of it will appear in the next update to the Sun BluePrint Solaris Containers 
Architecture Technology Guide.



How about real world replication strategies
with zfs send/receive, including appropriate utility scripts?
Converting UFS root to ZFS root?


If the information that is needed to do the work is NOT in the man
pages, perhaps we could look to address that...



All of the information is in man pages.  Often times, stringing man
pages together to bigger concepts is too hard.  Hence the general fear
of man pages by UNIX newbies and some oldbies.


Should every sys admin who is maintaining an Oracle (or MySQL, or...) database be 
forced to go through the process of determining good combinations of ZFS settings? 
 (There aren't many settings, but there are a few.)  Will people learn that there 
are limitations to ZFS that are not documented?  Wouldn't a wiki be useful as a 
central repository of such knowledge?



--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Storage

2006-06-26 Thread Gregory Shaw


On Jun 26, 2006, at 1:15 AM, Mika Borner wrote:


Hi

Now that Solaris 10 06/06 is finally downloadable I have some  
questions

about ZFS.

-We have a big storage sytem supporting RAID5 and RAID1. At the  
moment,

we only use RAID5 (for non-solaris systems as well). We are thinking
about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5
seems like overkill, an option would be to use RAID1 with RAID-Z. Then
again, this is a waist of space, as it needs more disks, due to the
mirroring. Later on, we might be using asynchronous replication to
another storage system using SAN, even more waste of space. This looks
somehow like storage virtualization as of today just doesn't work  
nicely

together. What we need, would be the feature to use JBODs.



If you've got hardware raid-5, why not just run regular (non-raid)  
pools on top of the raid-5?


I wouldn't go back to JBOD.   Hardware arrays offer a number of  
advantages to JBOD:

- disk microcode management
- optimized access to storage
- large write caches
- RAID computation can be done in specialized hardware
	- SAN-based hardware products allow sharing of storage among  
multiple hosts.  This allows storage to be utilized more effectively.



-Does ZFS in the current version support LUN extension? With UFS, we
have to zero the VTOC, and then adjust the new disk geometry. How does
it look like with ZFS?



I don't understand what you're asking.  What problem is solved by  
zeroing the vtoc?



-I've read the threads about zfs and databases. Still I'm not 100%
convenienced about read performance. Doesn't the fragmentation of the
large database files (because of the concept of COW) impact
read-performance?



This is discussed elsewhere in the zfs discussion group.


-Does anybody have any experience in database cloning using the ZFS
mechanism? What factors influence the performance, when running the
cloned database in parallel?
-I really like the idea to keep all needed databasefiles together, to
allow fast and consistent cloning.

Thanks

Mika


# mv Disclaimer.txt /dev/null





-- 
---

This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-- 
---


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-2773
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382  [EMAIL PROTECTED] (work)
Louisville, CO 80028-4382[EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. - Linus  
Torvalds




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: where has all my space gone? (with zfs mountroot + b38)

2006-06-26 Thread Tabriz

James C. McPherson wrote:

James C. McPherson wrote:

Jeff Bonwick wrote:

6420204 root filesystem's delete queue is not running

The workaround for this bug is to issue to following command...
# zfs set readonly=off pool/fs_name
This will cause the delete queue to start up and should flush your 
queue.
Thanks for the update.  James, please let us know if this solves 
your problem.

yes, I've tried that several times and it didn't work for me at all.
One thing that worked a *little* bit was to set readonly=on, then
go in with mdb -kw and set the drained flag on root_pool to 0 and
then re-set readonly=off. But that only freed up about 2Gb.


Here's the next installment in the saga. I bfu'd to include Mark's
recent putback, rebooted, re-ran the set readonly=off op on the
root pool and root filesystem, and waited. Nothing. Nada. Not a
sausage.

Here's my root filesystem delete head:


 ::fsinfo ! head -2
VFSP FS  MOUNT
fbcaa4e0 zfs /
 fbcaa4e0::print struct vfs vfs_data |::print struct zfsvfs 
z_delete_head

{
z_delete_head.z_mutex = {
_opaque = [ 0 ]
}
z_delete_head.z_cv = {
_opaque = 0
}
z_delete_head.z_quiesce_cv = {
_opaque = 0
}
z_delete_head.z_drained = 0x1
z_delete_head.z_draining = 0
z_delete_head.z_thread_target = 0
z_delete_head.z_thread_count = 0
z_delete_head.z_znode_count = 0x5ce4
z_delete_head.z_znodes = {
list_size = 0xc0
list_offset = 0x10
list_head = {
list_next = 0x9232ded0
list_prev = 0xfe820d2c16b0
}
}
}




I also went in with mdb -kw and set z_drained to 0, then re-set the
readonly flag... still nothing. Pool usage is now up to ~93%, and a
zdb run shows lots of leaked space too:

[snip bazillions of entries re leakage]

block traversal size 273838116352 != alloc 274123164672 (leaked 
285048320)


bp count: 5392224
bp logical:454964635136  avg:  84374
bp physical:   272756334592  avg:  50583compression:   
1.67
bp allocated:  273838116352  avg:  50783compression:   
1.66

SPA allocated: 274123164672 used: 91.83%

Blocks  LSIZE   PSIZE   ASIZE avgcomp   %Total  Type
 3  48.0K  8K   24.0K  8K6.00 0.00  L1 
deferred free
 5  44.0K   14.5K   37.0K   7.40K3.03 0.00  L0 
deferred free

 8  92.0K   22.5K   61.0K   7.62K4.09 0.00  deferred free
 1512 512  1K  1K1.00 0.00  object directory
 3  1.50K   1.50K   3.00K  1K1.00 0.00  object array
 116K   1.50K   3.00K   3.00K   10.67 0.00  packed nvlist
 -  -   -   -   -   --  packed nvlist 
size

 116K  1K   3.00K   3.00K   16.00 0.00  L1 bplist
 116K 16K 32K 32K1.00 0.00  L0 bplist
 232K   17.0K   35.0K   17.5K1.88 0.00  bplist
 -  -   -   -   -   --  bplist header
 -  -   -   -   -   --  SPA space map 
header
   140  2.19M364K   1.06M   7.79K6.16 0.00  L1 SPA 
space map
 5.01K  20.1M   15.4M   30.7M   6.13K1.31 0.01  L0 SPA 
space map

 5.15K  22.2M   15.7M   31.8M   6.17K1.42 0.01  SPA space map
 1  28.0K   28.0K   28.0K   28.0K1.00 0.00  ZIL intent log
 232K  2K   6.00K   3.00K   16.00 0.00  L6 DMU dnode
 232K  2K   6.00K   3.00K   16.00 0.00  L5 DMU dnode
 232K  2K   6.00K   3.00K   16.00 0.00  L4 DMU dnode
 232K   2.50K   7.50K   3.75K   12.80 0.00  L3 DMU dnode
15   240K   50.5K152K   10.1K4.75 0.00  L2 DMU dnode
   594  9.28M   3.88M   11.6M   20.1K2.39 0.00  L1 DMU dnode
 68.7K  1.07G274M549M   7.99K4.00 0.21  L0 DMU dnode
 69.3K  1.08G278M561M   8.09K3.98 0.21  DMU dnode
 3  3.00K   1.50K   4.50K   1.50K2.00 0.00  DMU objset
 -  -   -   -   -   --  DSL directory
 3  1.50K   1.50K   3.00K  1K1.00 0.00  DSL directory 
child map
 2 1K  1K  2K  1K1.00 0.00  DSL dataset 
snap map

 5  64.5K   7.50K   15.0K   3.00K8.60 0.00  DSL props
 -  -   -   -   -   --  DSL dataset
 -  -   -   -   -   --  ZFS znode
 -  -   -   -   -   --  ZFS ACL
 2.82K  45.1M   2.93M   5.85M   2.08K   15.41 0.00  L2 ZFS 
plain file
  564K  8.81G612M   1.19G   2.17K   14.76 0.47  L1 ZFS 
plain file
 4.40M   414G253G253G   57.5K1.6399.21  L0 ZFS 
plain file

 4.95M   422G254G254G   51.4K1.6799.68  ZFS plain file
 116K  1K   3.00K   3.00K   16.00 0.00  L2 ZFS 
directory
  

Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS

2006-06-26 Thread Neil Perrin



Robert Milkowski wrote On 06/25/06 04:12,:

Hello Neil,

Saturday, June 24, 2006, 3:46:34 PM, you wrote:

NP Chris,

NP The data will be written twice on ZFS using NFS. This is because NFS
NP on closing the file internally uses fsync to cause the writes to be
NP committed. This causes the ZIL to immediately write the data to the intent 
log.
NP Later the data is also written committed as part of the pools transaction 
group
NP commit, at which point the intent block blocks are freed.

NP It does seem inefficient to doubly write the data. In fact for blocks
NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed)
NP we write the data block and also an intent log record with the block 
pointer.
NP During txg commit we link this block into the pool tree. By experimentation
NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K
NP they do not benefit from this.

Is 32KB easily tuned (mdb?)?


I'm not sure. NFS folk?


I guess not but perhaps.

And why only for blocks larger than zfs_immediate_write_sz?


When data is large enough (currently 32K) it's more efficient to directly
write the block, and additionally save the block pointer in a ZIL record.
Otherwise it's more efficient to copy the data into a large log block
potentially along with other writes.

--

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] status question regarding sol10u2

2006-06-26 Thread Nicholas Senedzuk
I had the same problem.On 6/26/06, Shannon Roddy [EMAIL PROTECTED] wrote:
Noel Dellofano wrote: Solaris 10u2 was released today.You can now download it from here: http://www.sun.com/software/solaris/get.jsp
Seems the download links are dead except for x86-64.No Sparc downloads.___zfs-discuss mailing listzfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] status question regarding sol10u2

2006-06-26 Thread Dennis Clarke

 Noel Dellofano wrote:
 Solaris 10u2 was released today.  You can now download it from here:

 http://www.sun.com/software/solaris/get.jsp

 Seems the download links are dead except for x86-64.  No Sparc downloads.


Everything works perfectly.

$ ls -1
sol-10-u2-ga-sparc-lang-iso.zip
sol-10-u2-ga-sparc-lang.iso
sol-10-u2-ga-sparc-v1-iso.zip
sol-10-u2-ga-sparc-v1.iso
sol-10-u2-ga-sparc-v2-iso.zip
sol-10-u2-ga-sparc-v2.iso
sol-10-u2-ga-sparc-v3-iso.zip
sol-10-u2-ga-sparc-v3.iso
sol-10-u2-ga-sparc-v4-iso.zip
sol-10-u2-ga-sparc-v4.iso
sol-10-u2-ga-sparc-v5-iso.zip
sol-10-u2-ga-sparc-v5.iso

I have the x86 CDROM's also.

A quick set of links is at the top of the page at Blastwave :

   www.blastwave.org




-- 
Dennis Clarke

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Solaris 10 6/06 now available for download

2006-06-26 Thread Larry Wake

Shannon Roddy wrote:


Noel Dellofano wrote:
 


Solaris 10u2 was released today.  You can now download it from here:

http://www.sun.com/software/solaris/get.jsp
   



Seems the download links are dead except for x86-64.  No Sparc downloads.
 




There were some problems getting the links set up on the Sun download 
center, which should all be sorted out now.  Have at it...


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Storage

2006-06-26 Thread Philip Brown

Roch wrote:

And, ifthe load can accomodate   a
reorder, to  get top per-spindle read-streaming performance,
a cp(1) of the file should do wonders on the layout.



but there may not be filesystem space for double the data.
Sounds like there is a need for a zfs-defragement-file utility perhaps?

Or if you want to be politically cagey about naming choice, perhaps,

zfs-seq-read-optimize-file ?  :-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS and Storage

2006-06-26 Thread Nathanael Burton
 If you've got hardware raid-5, why not just run
 regular (non-raid)  
 pools on top of the raid-5?
 
 I wouldn't go back to JBOD.   Hardware arrays offer a
 number of  
 advantages to JBOD:
   - disk microcode management
   - optimized access to storage
   - large write caches
 - RAID computation can be done in specialized
 d hardware
 - SAN-based hardware products allow sharing of
 f storage among  
 multiple hosts.  This allows storage to be utilized
 more effectively.
 

I'm a little confused by the first poster's message as well, but you lose some 
benefits of ZFS if you don't create your pools with either RAID1 or RAIDZ, such 
as data corruption detection.  The array isn't going to detect that because all 
it knows about are blocks. 

-Nate
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Storage

2006-06-26 Thread Olaf Manczak

Eric Schrock wrote:

On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:

You're using hardware raid.  The hardware raid controller will rebuild
the volume in the event of a single drive failure.  You'd need to keep
on top of it, but that's a given in the case of either hardware or
software raid.


True for total drive failure, but not there are a more failure modes
than that.  With hardware RAID, there is no way for the RAID controller
to know which block was bad, and therefore cannot repair the block.
With RAID-Z, we have the integrated checksum and can do combinatorial
analysis to know not only which drive was bad, but what the data
_should_ be, and can repair it to prevent more corruption in the future.


Keep in mind that each disk data block is accompanied by a pretty
long error correction code (ECC) which allows for (a) verification
of data integrity (b) repair of lost/misread bits (typically up to
about 10% of the block data).

Therefore, in case of single block errors there are several possible
situations:

- non-recoverable errors - the amount of correct bits in the combined
  data + ECC in insufficient - such errors are visible to the RAID
  controller, the controller can use a redundant copy of the data, and
  the controller can perform the repair

- recoverable errors - some bits can't be read correctly but they
  can be reconstructed  using ECC - these errors are not directly
  visible to either the RAID controller or ZFS. However, the disks
  keep the count of recoverable errors so disk scrubbers can identify
  disk areas with rotten blocks and force block relocation

- silent data corruption - it can happen in memory before the data
  was written to disk, it can occur in the disk cache, it can be caused
  by a bug in disk firmware. Here the disk controller can't do
  anything and the end-to-end checksums, which ZFS offers,
  are the only solution.

-- Olaf

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Storage

2006-06-26 Thread Bart Smaalders

Gregory Shaw wrote:

On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote:

How would ZFS self heal in this case?




You're using hardware raid.  The hardware raid controller will rebuild
the volume in the event of a single drive failure.  You'd need to keep
on top of it, but that's a given in the case of either hardware or
software raid.

If you've got requirements for surviving an array failure, the
recommended solution in that case is to mirror between volumes on
multiple arrays.   I've always liked software raid (mirroring) in that
case, as no manual intervention is needed in the event of an array
failure.  Mirroring between discrete arrays is usually reserved for
mission-critical applications that cost thousands of dollars per hour in
downtime.



In other words, it won't.  You've spent the disk space, but
because you're mirroring in the wrong place (the raid array)
all ZFS can do is tell you that your data is gone.  With luck,
subsequent reads _might_ get the right data, but maybe not.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Storage

2006-06-26 Thread Richard Elling

Olaf Manczak wrote:

Eric Schrock wrote:

On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:

You're using hardware raid.  The hardware raid controller will rebuild
the volume in the event of a single drive failure.  You'd need to keep
on top of it, but that's a given in the case of either hardware or
software raid.


True for total drive failure, but not there are a more failure modes
than that.  With hardware RAID, there is no way for the RAID controller
to know which block was bad, and therefore cannot repair the block.
With RAID-Z, we have the integrated checksum and can do combinatorial
analysis to know not only which drive was bad, but what the data
_should_ be, and can repair it to prevent more corruption in the future.


Keep in mind that each disk data block is accompanied by a pretty
long error correction code (ECC) which allows for (a) verification
of data integrity (b) repair of lost/misread bits (typically up to
about 10% of the block data).


AFAIK, typical disk ECC will correct 8 bytes.  I'd love for it to be
10% (51 bytes).  Do you have a pointer to such information?


Therefore, in case of single block errors there are several possible
situations:

- non-recoverable errors - the amount of correct bits in the combined
  data + ECC in insufficient - such errors are visible to the RAID
  controller, the controller can use a redundant copy of the data, and
  the controller can perform the repair

- recoverable errors - some bits can't be read correctly but they
  can be reconstructed  using ECC - these errors are not directly
  visible to either the RAID controller or ZFS. However, the disks
  keep the count of recoverable errors so disk scrubbers can identify
  disk areas with rotten blocks and force block relocation

- silent data corruption - it can happen in memory before the data
  was written to disk, it can occur in the disk cache, it can be caused
  by a bug in disk firmware. Here the disk controller can't do
  anything and the end-to-end checksums, which ZFS offers,
  are the only solution.


Another mode occurs when you use a format(1m)-like utility to scan
and repair disks.  For such utilities, if the data cannot be reconstructed
it is zero-filled.  If there was real data stored there, then ZFS will
detect it and the majority of other file systems will not detect it.
For an array, one should not be able to readily access such utilities,
and cause such corrective actions, but I would not bet the farm on it --
end-to-end error detection will always prevail.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss