[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

2006-09-15 Thread can you guess?
 On 9/13/06, Matthew Ahrens [EMAIL PROTECTED]
 wrote:
  Sure, if you want *everything* in your pool to be
 mirrored, there is no
  real need for this feature (you could argue that
 setting up the pool
  would be easier if you didn't have to slice up the
 disk though).
 
 Not necessarily.  Implementing this on the FS level
 will still allow
 the administrator to turn on copies on the entire
 pool if since the
 pool is technically also a FS and the property is
 inherited by child
 FS's.  Of course, this will allow the admin to turn
 off copies to the
 FS containing junk.

Implementing it at the directory and file levels would be even more flexible:  
redundancy strategy would no longer be tightly tied to path location, but 
directories and files could themselves still inherit defaults from the 
filesystem and pool when appropriate (but could be individually handled when 
desirable).

I've never understood why redundancy was a pool characteristic in ZFS - and the 
addition of 'ditto blocks' and now this new proposal (both of which introduce 
completely new forms of redundancy to compensate for the fact that pool-level 
redundancy doesn't satisfy some needs) just makes me more skeptical about it.

(Not that I intend in any way to minimize the effort it might take to change 
that decision now.)

 
  It could be recommended in some situations.  If you
 want to protect
  against disk firmware errors, bit flips, part of
 the disk getting
  scrogged, then mirroring on a single disk (whether
 via a mirror vdev or
  copies=2) solves your problem.  Admittedly, these
 problems are probably
  less common that whole-disk failure, which
 mirroring on a single disk
  does not address.
 
 I beg to differ from experience that the above errors
 are more common
 than whole disk failures.  It's just that we do not
 notice the disks
 are developing problems but panic when they finally
 fail completely.

It would be interesting to know whether that would still be your experience in 
environments that regularly scrub active data as ZFS does (assuming that said 
experience was accumulated in environments that don't).  The theory behind 
scrubbing is that all data areas will be hit often enough that they won't have 
time to deteriorate (gradually) to the point where they can't be read at all, 
and early deterioration encountered during the scrub pass (or other access) in 
which they have only begun to become difficult to read will result in immediate 
revectoring (by the disk or, if not, by the file system) to healthier locations.

Since ZFS-style scrubbing detects even otherwise-indetectible 'silent 
corruption' missed by the disk's own ECC mechanisms, that lower-probability 
event is also covered (though my impression is that the probability of even a 
single such sector may be significantly lower than that of whole-disk failure, 
especially in laptop environments).

All that being said, keeping multiple copies on a single disk of most metadata 
(the loss of which could lead to wide-spread data loss) definitely makes sense 
(especially given its typically negligible size), and it probably makes sense 
for some files as well.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Access to ZFS checksums would be nice and very useful feature

2006-09-15 Thread Ceri Davies
On Thu, Sep 14, 2006 at 05:08:18PM -0500, Nicolas Williams wrote:
 On Thu, Sep 14, 2006 at 10:32:59PM +0200, Henk Langeveld wrote:
  Bady, Brant RBCM:EX wrote:
  Part of the archiving process is to generate checksums (I happen to use
  MD5), and store them with other metadata about the digital object in
  order to verify data integrity and demonstrate the authenticity of the
  digital object over time.
  
  Wouldn't it be helpful if there was a utility to access/read  the
  checksum data created by ZFS, and use it for those same purposes.
  
  Doesn't ZFS use block-level checksums?
 
 Yes, but the checksum is stored with the pointer.
 
 So then, for each file/directory there's a dnode, and that dnode has
 several block pointers to data blocks or indirect blocks, and indirect
 blocks have pointers to... and so on.

Does ZFS have block fragments?  If so, then updating an unrelated file
would change the checksum.

Ceri
-- 
That must be wonderful!  I don't understand it at all.
  -- Moliere


pgpzabNG9m5HW.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: zfs panic installing a brandz zone

2006-09-15 Thread Mark Maybee

Yup, its almost certain that this is the bug you are hitting.

-Mark

Alan Hargreaves wrote:
I know, bad form replying to myself, but I am wondering if it might be 
related to


 6438702 error handling in zfs_getpage() can trigger page not 
locked


Which is marked fix in progress with a target of the current build.

alan.

Alan Hargreaves wrote:

Folks, before I start delving too deeply into this crashdump, has 
anyone seen anything like it?


The background is that I'm running a non-debug open build of b49 and 
was in the process of running the zoneadm -z redlx install 


After a bit, the machine panics, initially looking at the crashdump, 
I'm down to 88mb free (out of a gig) and see the following stack.


fe8000de7800 page_unlock+0x3b(180218720)
fe8000de78d0 zfs_getpage+0x236(89b84d80, 12000, 2000, 
fe8000de7a1c, fe8000de79b8, 2000, fbc29b20, 
fe808180a000, 1,

80826dc8)
fe8000de7950 fop_getpage+0x52(89b84d80, 12000, 2000, 
fe8000de7a1c, fe8000de79b8, 2000, fbc29b20, 
fe8081818000, 1,

80826dc8)
fe8000de7a50 segmap_fault+0x1d6(801a6f38, 
fbc29b20, fe8081818000, 2000, 0, 1)
fe8000de7b30 segmap_getmapflt+0x67a(fbc29b20, 
89b84d80, 12000, 2000, 1, 1)

fe8000de7bd0 lofi_strategy_task+0x14b(959d2400)
fe8000de7c60 taskq_thread+0x1a7(84453da8)
fe8000de7c70 thread_start+8()

%rax = 0x %r9  = 0x0300430e
%rbx = 0x000e %r10 = 0x1000
%rcx = 0xfe8081819000 %r11 = 0x113709b0
%rdx = 0xfe8000de7c80 %r12 = 0x000180218720
%rsi = 0x00013000 %r13 = 0xfbc52160 
pse_mutex+0x200

%rdi = 0xfbc52160 pse_mutex+0x200 %r14 = 0x4000
%r8  = 0x0200 %r15 = 0xfe8000de79d8

%rip = 0xfb8474fb page_unlock+0x3b
%rbp = 0xfe8000de7800
%rsp = 0xfe8000de77e0
%rflags = 0x00010246
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=of,df,IF,tf,sf,ZF,af,PF,cf

%cs = 0x0028%ds = 0x0043%es = 0x0043
%trapno = 0xe   %fs = 0xfsbase = 0x8000
   %err = 0x0   %gs = 0x01c3gsbase = 0xfbc27b70

While the panic string says NULL pointer dereference, it appears that 
0x180218720 is not mapped. The dereference looks like the first 
dereference in page_unlock(), which looks at pp-p_selock.


I can spend a little time looking at it, but was wondering if anyone 
had seen this kind of panic previously?


I have two identical crashdumps created in exactly the same way.

alan.





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] [Blade 150] ZFS: extreme low performance

2006-09-15 Thread Mathias F
Hi forum,

I'm currently a little playing around with ZFS on my workstation.
I created a standard mirrored pool over 2 disk-slices.

# zpool status
 Pool: mypool
 Status: ONLINE
 scrub: Keine erforderlich
config:

NAME  STATE READ WRITE CKSUM
mypoolONLINE   0 0 0
  mirrorONLINE   0 0 0
c0t0d0s4  ONLINE   0 0 0
c0t2d0s4  ONLINE   0 0 0

Then i created a ZFS with no extra options:

# zfs create mypool/zfs01
# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
mypool 106K  27,8G  25,5K  /mypool
mypool/zfs01  24,5K  27,8G  24,5K  /mypool/zfs01

When I now send a mkfile on the new FS, the performance of the whole system 
breaks down near zero:

# mkfile 5g test

last pid: 25286;  load avg:  3.54,  2.28,  1.29;   up 0+01:44:26
   16:16:24
66 processes: 61 sleeping, 3 running, 1 zombie, 1 on cpu
CPU states:  0.0% idle,  2.1% user, 97.9% kernel,  0.0% iowait,  0.0% swap
Memory: 512M phys mem, 65M free mem, 2050M swap, 2050M free swap

   PID USERNAME LWP PRI NICE  SIZE   RES STATETIMECPU COMMAND
 25285 root   1   84 1184K  752K run  0:09 66.28% mkfile


It seams that some kind of kernel activity while writing to ZFS blocks the 
system.
Is this a known problem? Do you need additional information?

regards
Mathias
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: [Blade 150] ZFS: extreme low performance

2006-09-15 Thread Jürgen Keil
The disks in that Blade 100, are these IDE disks?

The performance problem is probably bug 6421427:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6421427


A fix for the issue was integrated into the Opensolaris 20060904 source
drop (actually closed binary drop):

http://dlc.sun.com/osol/on/downloads/20060904/on-changelog-20060904.html

... but has been removed in the next update:

http://dlc.sun.com/osol/on/downloads/20060911/on-changelog-20060911.html
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Access to ZFS checksums would be nice and very useful feature

2006-09-15 Thread Luke Scharf

Luke Scharf wrote:
It sounded to me like he wanted to implement tripwire, but save some 
time and CPU power by querying the checksumming-work that was already 
done by ZFS.
Nevermind.  The e-mail client that I chose to use broke up the thread, 
and I didn't see that the issue had already been thoroughly discussed.


-Luke



smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Sol 10 x86_64 intermittent SATA device locks up server

2006-09-15 Thread Humberto Ramirez
What's the brand and model of the cards ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Access to ZFS checksums would be nice and very useful feature

2006-09-15 Thread Nicolas Williams
On Fri, Sep 15, 2006 at 09:31:04AM +0100, Ceri Davies wrote:
 On Thu, Sep 14, 2006 at 05:08:18PM -0500, Nicolas Williams wrote:
  Yes, but the checksum is stored with the pointer.
  
  So then, for each file/directory there's a dnode, and that dnode has
  several block pointers to data blocks or indirect blocks, and indirect
  blocks have pointers to... and so on.
 
 Does ZFS have block fragments?  If so, then updating an unrelated file
 would change the checksum.

No.  It has variable sized blocks.

A block pointer in ZFS is much more than just a block number.  Among
other things a block pointer has the checksum of the block it points to.
See the on-disk layout document for more info.

There is no way that updating one file could change another's checksum.

What does matter is that the ZFS checksum of a file, to be O(1), depends
on the on-disk layout of the file, and anything that would change that
(today nothing would) would change the ZFS checksum of the file.  So I
think that ZFS checksums, if exposed, are best left as a file change
test optimization, not as an actual checksum of the file.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

2006-09-15 Thread Bill Moore
On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote:
 Implementing it at the directory and file levels would be even more
 flexible:  redundancy strategy would no longer be tightly tied to path
 location, but directories and files could themselves still inherit
 defaults from the filesystem and pool when appropriate (but could be
 individually handled when desirable).

The problem boils down to not having a way to express your intent that
works over NFS (where you're basically limited by POSIX) that you can
use from any platform (esp. ones where ZFS isn't installed).  If you
have some ideas, this is something we'd love to hear about.

 I've never understood why redundancy was a pool characteristic in ZFS
 - and the addition of 'ditto blocks' and now this new proposal (both
 of which introduce completely new forms of redundancy to compensate
 for the fact that pool-level redundancy doesn't satisfy some needs)
 just makes me more skeptical about it.

We have thought long and hard about this problem and even know how to
implement it (the name we've been using is Metaslab Grids, which isn't
terribly descriptive, or as Matt put it a bag o' disks).  There are
two main problems with it, though.  One is failures.  The problem is
that you want the set of disks implementing redundancy (mirror, RAID-Z,
etc.) to be spread across fault domains (controller, cable, fans, power
supplies, geographic sites) as much as possible.  There is no generic
mechanism to obtain this information and act upon it.  We could ask the
administrator to supply it somehow, but such a description takes effort,
is not easy, and prone to error.  That's why we have the model right now
where the administrator specifies how they want the disks spread out
across fault groups (vdevs).

The second problem comes back to accounting.  If you can specify, on a
per-file or per-directory basis, what kind of replication you want, how
do you answer the statvfs() question?  I think the recent discussions
on this list illustrate the complexity and passion on both sides of the
argument.

 (Not that I intend in any way to minimize the effort it might take to
 change that decision now.)

The effort is not actually that great.  All the hard problems we needed
to solve in order to implement this were basically solved when we did
the RAID-Z code.  As a matter of fact, you can see it in the on-disk
specification as well.  In the DVA, you'll notice an 8-bit field labeled
GRID.  These are the bits that would describe, on a per-block basis,
what kind of redundancy we used.


--Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Bizzare problem with ZFS filesystem

2006-09-15 Thread Neil Perrin

It is highly likely you are seeing a duplicate of:

6413510 zfs: writing to ZFS filesystem slows down fsync() on
other files in the same FS

which was fixed recently in build 48 on Nevada.
The symptoms are very similar. That is a fsync from the vi would, prior
to the bug being fixed, have to force out all other data through the
intent log.

Neil.


Anantha N. Srirama wrote On 09/13/06 15:58,:
One more piece of information. I was able to ascertain the slowdown happens only when ZFS is used heavily; meaning lots of inflight I/O. This morning when the system was quiet my writes to the /u099 filesystem was excellent and it has gone south like I reported earlier. 


I am currently awaiting the completion of a write to /u099, well over 60 
seconds. At the same time I was able create/save files in /u001 without any 
problems. The only difference between the /u001 and /u099 is the size of the 
filesystem (256GB vs 768GB).

Per your suggestion I ran a 'zfs set' command and it completed after a wait of 
around 20 seconds while my file save from vi against /u099 is still pending!!!
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] no automatic clearing of zoned eh?

2006-09-15 Thread ozan s. yigit

s10u2, once zoned, always zoned? i see that zoned property is not
cleared after removing the dataset from a zone cfg or even
uninstalling the entire zone... [right, i know how to clear it by
hand, but maybe i am missing a bit of magic otherwise anodyne
zonecfg et al.]

oz
--
ozan s. yigit | [EMAIL PROTECTED]
don't be afraid to find the rhinoceros to pick fleas from.
 -- richard gabriel [patterns of software]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: reslivering, how long will it take?

2006-09-15 Thread Tim Cook
the status showed 19.46% the first time I ran it, then 9.46% the second.  The 
question I have is I added the new disk, but it's showing the following:

Device: c5d0
Storage Pool: fserv
Type: Disk
Device State: Faulted (cannot open)

The disk is currently unpartitioned and unformatted.  I was under the 
impression ZFS was going to take care of all of that.  Do I need to setup 
partitioning and formatting before trying to add it to a pool?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: reslivering, how long will it take?

2006-09-15 Thread Bill Moore
On Fri, Sep 15, 2006 at 01:10:25PM -0700, Tim Cook wrote:
 the status showed 19.46% the first time I ran it, then 9.46% the
 second.  The question I have is I added the new disk, but it's showing
 the following:
 
 Device: c5d0
 Storage Pool: fserv
 Type: Disk
 Device State: Faulted (cannot open)

Did you run zpool replace fserv c5d0?  We're working on the
auto-replace when we detect a hot-plug, but it's not in yet.

 The disk is currently unpartitioned and unformatted.  I was under the
 impression ZFS was going to take care of all of that.  Do I need to
 setup partitioning and formatting before trying to add it to a pool?

ZFS should take care of all that.


--Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: reslivering, how long will it take?

2006-09-15 Thread Tim Cook
hrmm... cannot replace c5d0 with c5d0: cannot replace a replacing device
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on production servers with SLA

2006-09-15 Thread David Bustos
Quoth Darren J Moffat on Fri, Sep 08, 2006 at 01:59:16PM +0100:
 Nicolas Dorfsman wrote:
  Regarding system partitions (/var, /opt, all mirrored + alternate 
  disk), what would be YOUR recommendations ?  ZFS or not ?
 
 /var for now must be UFS since Solaris 10 doesn't not have ZFS root 
 support and that means /, /etc/, /var/, /usr.

Once 6354489 was fixed, I believe Stephen Hahn got zfs-on-/usr working.
That might be painful to upgrade, though.

 I've run systems with 
 /opt as a ZFS filesystem and it works just fine.  However note that the 
 Solaris installed puts stuff in /opt (for backwards compat reasons, 
 ideally it wouldn't) and that may cause issues with live upgrade or 
 require you to move that stuff onto your ZFS /opt datasets.

I also use zfs for /opt.  I have to unmount it before using Live
Upgrade, though, because it refuses to leave /opt on a separate
filesystem.  I suppose it's right, since the package database may refer
to files in /opt, but I haven't had any problems.


David
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Re: reslivering, how long will it take?

2006-09-15 Thread Tim Cook
Yes sir:

[EMAIL PROTECTED]:/
# zpool status -v fserv
  pool: fserv
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool
will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 5.90% done, 27h13m to go
config:

NAMESTATE READ WRITE CKSUM
fserv   DEGRADED 0 0 0
  raidz1DEGRADED 0 0 0
replacing   DEGRADED 0 0 0
  c5d0s0/o  UNAVAIL  0 0 0  cannot open
  c5d0  ONLINE   0 0 0
c3d0ONLINE   0 0 0
c3d1ONLINE   0 0 0
c4d0ONLINE   0 0 0

errors: No known data errors


-Original Message-
From: Bill Moore [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 15, 2006 4:45 PM
To: Tim Cook
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Re: reslivering, how long will it take?

On Fri, Sep 15, 2006 at 01:26:21PM -0700, Tim Cook wrote:
 says it's online now so I can only assume it's working.  Doesn't seem
 to be reading from any of the other disks in the array though.  Can it
 sliver without traffic to any other disks?  /noob

Can you send the output of zpool status -v pool?


--Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Proposal: multiple copies of user

2006-09-15 Thread can you guess?
(I looked at my email before checking here, so I'll just cut-and-paste the 
email response in here rather than send it.  By the way, is there a way to view 
just the responses that have accumulated in this forum since I last visited - 
or just those I've never looked at before?)

Bill Moore wrote:
 On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote:
 Implementing it at the directory and file levels would be even more
 flexible:  redundancy strategy would no longer be tightly tied to path
 location, but directories and files could themselves still inherit
 defaults from the filesystem and pool when appropriate (but could be
 individually handled when desirable).
 
 The problem boils down to not having a way to express your intent that
 works over NFS (where you're basically limited by POSIX) that you can
 use from any platform (esp. ones where ZFS isn't installed).  If you
 have some ideas, this is something we'd love to hear about.

Well, one idea is that it seems downright silly to gate ZFS facilities 
on the basis of two-decade-old network file access technology:  sure, 
it's important to be able to *access* ZFS files using NFS, but does 
anyone really care if NFS can't express the full range of ZFS features - 
at least to the degree that they think such features should be 
suppressed as a result (rather than made available to local users plus any 
remote users employing a possibly future mechanism that *can* support them)?

That being said, you could always adopt the ReiserFS approach of 
allowing access to file/directory metadata via extended path 
specifications in environments like NFS where richer forms of 
interaction aren't available:  yes, it may feel a bit kludgey, but it gets the 
job done.

And, of course, even if you did nothing to help NFS its users would 
still benefit from inheriting whatever arbitrarily fine-grained 
redundancy levels had been established via more comprehensive means: 
they just wouldn't be able to tweak redundancy levels themselves (any 
more, or any less, than they can do so today).

 
 I've never understood why redundancy was a pool characteristic in ZFS
 - and the addition of 'ditto blocks' and now this new proposal (both
 of which introduce completely new forms of redundancy to compensate
 for the fact that pool-level redundancy doesn't satisfy some needs)
 just makes me more skeptical about it.
 
 We have thought long and hard about this problem and even know how to
 implement it (the name we've been using is Metaslab Grids, which isn't
 terribly descriptive, or as Matt put it a bag o' disks).

Yes, 'a bag o' disks' - used intelligently at a higher level - is pretty much 
what I had in mind.

  There are
 two main problems with it, though.  One is failures.  The problem is
 that you want the set of disks implementing redundancy (mirror, RAID-Z,
 etc.) to be spread across fault domains (controller, cable, fans, power
 supplies, geographic sites) as much as possible.  There is no generic
 mechanism to obtain this information and act upon it.  We could ask the
 administrator to supply it somehow, but such a description takes effort,
 is not easy, and prone to error.  That's why we have the model right now
 where the administrator specifies how they want the disks spread out
 across fault groups (vdevs).

Without having looked at the code I may be missing something here. 
Even with your current implementation, if there's indeed no automated 
way to obtain such information the administrator has to exercise manual 
control over disk groupings if they're going to attain higher 
availability by avoiding other single points of failure instead of just guard 
against unrecoverable data loss from disk failure.  Once that 
information has been made available to the system, letting it make use 
of it at a higher level rather than just aggregating entire physical 
disks should not entail additional administrator effort.

I admit that I haven't considered the problem in great detail, since my 
bias is toward solutions that employ redundant arrays of inexpensive 
nodes to scale up rather than a small number of very large nodes (in part 
because a single large node itself can often be a single point of 
failure even if many of its subsystems carefully avoid being so in the 
manner that you suggest).  Each such small node has a relatively low 
disk count and little or no internal redundancy, and thus comprises its 
own little fault-containment environment, avoiding most such issues; as 
a plus, such node sizes mesh well with the bandwidth available from very 
inexpensive Gigabit Ethernet interconnects and switches (even when 
streaming data sequentially, such as video on demand) and allow 
fine-grained incremental system scaling (by the time faster 
interconnects become inexpensive, disk bandwidth should have increased 
enough that such a balance will still be fairly good).

Still, if you can group whole disks intelligently in a large system with 
respect to supplementing