date:20060627

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Mika Borner

The vdev can handle dynamic lun growth, but the underlying VTOC or  
EFI label
may need to be zero'd and reapplied if you setup the initial vdev on 

a slice.  If
you introduced the entire disk to the pool you should be fine, but I 

believe you'll
still need to offline/online the pool.

Fine, at least the vdev can handle this...

I asked about this feature in October and hoped that it would be
implemented when integrating ZFS into Sol10U2 ...

http://www.opensolaris.org/jive/thread.jspa?messageID=11646

Does anybody know something about when this feature is finally coming?
This would keep the number of  LUNs low on the host. Especially as
devicenames can be really ugly (long!).

//Mika

# mv Disclaimer.txt /dev/null


-
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Mika Borner

I'm a little confused by the first poster's message as well, but you
lose some benefits of ZFS if you don't create your pools with either
RAID1 or RAIDZ, such as data corruption detection.  The array isn't
going to detect that because all it knows about are blocks. 

That's the dilemma, the array provides nice features like RAID1 and
RAID5, but those are of no real use when using ZFS. 

The advantages  to use ZFS on such array are e.g. the sometimes huge
write cache available, use of consolidated storage and in SAN
configurations, cloning and sharing storage between hosts.

The price comes of course in additional administrative overhead (lots
of microcode updates, more components that can fail in between, etc).

Also, in bigger companies there usually is a team of storage
specialist, that mostly do not know about the applications running on
top of it, or do not care... (like: here you have your bunch of
gigabytes...)

//Mika

# mv Disclaimer.txt /dev/null


-
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS + NFS perfromance ?

2006-06-27 Thread Patrick


Hi,

I've just started using ZFS + NFS, and i was wondering if there is
anything i can do to optimise it for being used as a mailstore ? (
small files, lots of them, with lots of directory's and high
concurrent access )

So any ideas guys?

P
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Mika Borner

but there may not be filesystem space for double the data.
Sounds like there is a need for a zfs-defragement-file utility
perhaps?
Or if you want to be politically cagey about naming choice, perhaps,
zfs-seq-read-optimize-file ?  :-)

For Datawarehouse and streaming applications a 
seq-read-omptimization could bring additional performance. For
normal databases this should be benchmarked...

This brings me back to another question. We have a production database,
that is cloned on every end of month for end-of-month processing
(currently with a feature on our storage array).

I'm thinking about a ZFS version of this task. Requirements: the
production database should not suffer from performance degradation,
whilst running the clone in parallel. As ZFS does not clone all the
blocks, I wonder how much the procution database will suffer from
sharing most of the data with the clone (concurrent access vs. caching)

Maybe we need a feature in ZFS to do a full clone (speak: copy all
blocks) inside the pool, if performance is an issue just like the
Quick Copy vs. Shadow Image -features on HDS Arrays... 






-
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Casper . Dik



That's the dilemma, the array provides nice features like RAID1 and
RAID5, but those are of no real use when using ZFS. 


RAID5 is not a nice feature when it breaks.

A RAID controller cannot guarantee that all bits of a RAID5 stripe
are written when power fails; then you have data corruption and no
one can tell you what data was corrupted.  ZFS RAIDZ can.

The advantages  to use ZFS on such array are e.g. the sometimes huge
write cache available, use of consolidated storage and in SAN
configurations, cloning and sharing storage between hosts.

Are huge write caches really a advantage?  Or are you taking about huge
write caches with non-volatile storage?

The price comes of course in additional administrative overhead (lots
of microcode updates, more components that can fail in between, etc).

Also, in bigger companies there usually is a team of storage
specialist, that mostly do not know about the applications running on
top of it, or do not care... (like: here you have your bunch of
gigabytes...)

True enough 

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + NFS perfromance ?

2006-06-27 Thread grant beattie

On Tue, Jun 27, 2006 at 10:14:06AM +0200, Patrick wrote:

 Hi,
 
 I've just started using ZFS + NFS, and i was wondering if there is
 anything i can do to optimise it for being used as a mailstore ? (
 small files, lots of them, with lots of directory's and high
 concurrent access )
 
 So any ideas guys?

check out this thread, which may answer some of your questions:

http://www.opensolaris.org/jive/thread.jspa?messageID=40617

sounds like your workload is very similar to mine. is all public
access via NFS?

also, check out this blog entry from Roch:

http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs

for small file workloads, setting recordsize to a value lower than the
default (128k) may prove useful.

grant.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS

2006-06-27 Thread Roch


Chris Csanady writes:
  On 6/26/06, Neil Perrin [EMAIL PROTECTED] wrote:
  
  
   Robert Milkowski wrote On 06/25/06 04:12,:
Hello Neil,
   
Saturday, June 24, 2006, 3:46:34 PM, you wrote:
   
NP Chris,
   
NP The data will be written twice on ZFS using NFS. This is because NFS
NP on closing the file internally uses fsync to cause the writes to be
NP committed. This causes the ZIL to immediately write the data to the 
intent log.
NP Later the data is also written committed as part of the pools 
transaction group
NP commit, at which point the intent block blocks are freed.
   
NP It does seem inefficient to doubly write the data. In fact for blocks
NP larger than zfs_immediate_write_sz (was 64K but now 32K after 
6440499 fixed)
NP we write the data block and also an intent log record with the block 
pointer.
NP During txg commit we link this block into the pool tree. By 
experimentation
NP we found 32K to be the (current) cutoff point. As the nfsd at most 
write 32K
NP they do not benefit from this.
   
Is 32KB easily tuned (mdb?)?
  
   I'm not sure. NFS folk?
  
  I think he is referring to the zfs_immediate_write_sz variable, but
  NFS will support
  larger block sizes as well.  Unfortunately, since the maximum IP
  datagram size is
  64k, after headers are taken into account, the largest useful value is
  60k.  If this is
  to be laid out as an indirect write, will it be written as
  32k+16k+8k+4k blocks?  If so,
  this seems like it would be quite inefficient for RAID-Z, and writes
  would best be
  left at 32k.
  
  Chris
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


I think the 64K issue refers to UDP. That limits the max
block size the NFS may use. But with TCP mounts, NFS is not
bounded by this. It should be possible to adjust the nfs
blocksize up.

For this I think you need to adjust nfs4_bsize on client :

echo nfs4_bsize/W131072 | mdb -kw

And it could also help to tune up the transfer size

echo nfs4_max_transfer_size/W131072 | mdb -kw

I also wonder if general purpose NFS exports should not have 
their recordsize set to 32K in order to match the default
NFS bsize. But I have not really looked at this perf yet.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + NFS perfromance ?

2006-06-27 Thread Darren Reed


grant beattie wrote:


On Tue, Jun 27, 2006 at 10:14:06AM +0200, Patrick wrote:

 


Hi,

I've just started using ZFS + NFS, and i was wondering if there is
anything i can do to optimise it for being used as a mailstore ? (
small files, lots of them, with lots of directory's and high
concurrent access )

So any ideas guys?
   



check out this thread, which may answer some of your questions:

http://www.opensolaris.org/jive/thread.jspa?messageID=40617

sounds like your workload is very similar to mine. is all public
access via NFS?

also, check out this blog entry from Roch:

http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs

for small file workloads, setting recordsize to a value lower than the
default (128k) may prove useful.
 



So what about for software development (like Solaris :-) where
we've got lots of small files that we might be editting (biggest
might be 128k) but when it comes time to compile, we can be
writing out megabytes of data.

Has anyone done a build of OpenSolaris over NFS served by
ZFS and compared with a local ZFS build?
How do both of those compared with UFS ?

Darren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + NFS perfromance ?

2006-06-27 Thread Patrick


Hi,


sounds like your workload is very similar to mine. is all public
access via NFS?


Well it's not 'public directly', courier-imap/pop3/postfix/etc... but
the maildirs are accessed directly by some programs for certain
things.


for small file workloads, setting recordsize to a value lower than the
default (128k) may prove useful.


When changing things like recordsize, can i do it on the fly on a
volume ? ( and then if i can what happens to the data already on the
volume ? )

Also, another question, when turning compression on does the data
already on the volume become compressed in the background ? or is in
writes from then on ?

P
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + NFS perfromance ?

2006-06-27 Thread grant beattie

On Tue, Jun 27, 2006 at 11:16:40AM +0200, Patrick wrote:

 sounds like your workload is very similar to mine. is all public
 access via NFS?
 
 Well it's not 'public directly', courier-imap/pop3/postfix/etc... but
 the maildirs are accessed directly by some programs for certain
 things.

yes, that's what I meant. a notable characteristic of most MTAs is that
they are fsync() intensive, which can have an impact on ZFS
performance. you will probably want to benchmark your IO pattern with
various different configurations.

 for small file workloads, setting recordsize to a value lower than the
 default (128k) may prove useful.
 
 When changing things like recordsize, can i do it on the fly on a
 volume ? ( and then if i can what happens to the data already on the
 volume ? )

yes, as with most (all?) ZFS properties, the recordsize can be changed
on the fly. existing data is unchanged - the modified settings only
affect new writes.

 Also, another question, when turning compression on does the data
 already on the volume become compressed in the background ? or is in
 writes from then on ?

as above, existing data remains unchanged.

it may be desirable to do such things in the background, because it
might be impractical or impossible to do it using regular filesystem
access without interrupting applications. the same applies to adding
new disks to an existing pool. I think there's an RFE for these sort
of operations.

grant.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Roch


Philip Brown writes:
  Roch wrote:
   And, ifthe load can accomodate   a
   reorder, to  get top per-spindle read-streaming performance,
   a cp(1) of the file should do wonders on the layout.
   
  
  but there may not be filesystem space for double the data.
  Sounds like there is a need for a zfs-defragement-file utility perhaps?
  
  Or if you want to be politically cagey about naming choice, perhaps,
  
  zfs-seq-read-optimize-file ?  :-)
  

Possibly or may using fcntl ?

Now the goal is to take a file with scattered blocks and order
them in contiguous chunks. So this is contigent on the
existence of regions of free contiguous disk space. This
will get more difficult as we get close to full on the
storage.

-r




  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Roch



Mika Borner writes:
  RAID5 is not a nice feature when it breaks.
  
  Let me correct myself...  RAID5 is a nice feature for systems without
  ZFS...
  
  Are huge write caches really a advantage?  Or are you taking about
  huge
  write caches with non-volatile storage?
  
  Yes, you are right. The huge cache is needed mostly because of poor
  write performance for RAID5 (of course battery backuped)...
  
  
  // Mika
  

Having a certain amount on non-volatile cache is great to
speed up the latency of ZIL operations which directly impact 
some application performance. 

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + NFS perfromance ?

2006-06-27 Thread grant beattie

On Tue, Jun 27, 2006 at 12:07:47PM +0200, Roch wrote:

for small file workloads, setting recordsize to a value lower than the
default (128k) may prove useful.
   
   When changing things like recordsize, can i do it on the fly on a
   volume ? ( and then if i can what happens to the data already on the
   volume ? )
 
 You do it On the  fly for a given FS  (recordsize it's not a
 property of ZVOL). Files that were largers than the previous
 recordsize will not change.  Files that we smaller and  thus
 were stored as a single  record, will continue to be  stored
 as  single record until a  write makes  the file bigger than
 the current  value of  recordsize. At  which point  they are
 store as multiple records of the new recordsize. Performance
 wise I don't worry too much about these things.

ah, yes. the key here is until a write makes the file bigger,
which would ~never happen given Maildir format mail as the files
are not modified after they are written.

they may be unlinked, renamed, or rewritten with a new name - but not
modified.

grant.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Robert Milkowski

Hello Nathanael,


NB I'm a little confused by the first poster's message as well, but
NB you lose some benefits of ZFS if you don't create your pools with
NB either RAID1 or RAIDZ, such as data corruption detection.  The
NB array isn't going to detect that because all it knows about are blocks.

Actually ZFS will detect data corruption if pool is not redundand but
it won't repair data (metadata are protected with 2 or/and 3 copies
anyway).




-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] assertion failure when destroy zpool on tmpfs

2006-06-27 Thread Enda o'Connor - Sun Microsystems Ireland - Software Engineer





Hi
Looks like same stack as 6413847, although it is pointed more towards hardware
failure.

the stack below is from 5.11 snv_38, but also seems to affect update 2 as
per above bug.

Enda

Thomas Maier-Komor wrote:

  Hi,

my colleage is just testing ZFS and created a zpool which had a backing store file on a TMPFS filesystem. After deleting the file everything still worked normally. But destroying the pool caused an assertion failure and a panic. I know this is neither a real-live szenario nor a good idea. The assertion failure occured on Solaris 10 update 2.

Below is some mdb output, in case someone is interested in this.

BTW: great to have Solaris 10 update 2 with ZFS. I can't wait to deploy it.

Cheers,
Tom

  
  
::panicinfo

  
   cpu1
  thread  2a100ea7cc0
 message 
assertion failed: vdev_config_sync(rvd, txg) == 0, file: ../../common/fs/zfs/spa
.c, line: 2149
  tstate   4480001601
  g1  30037505c40
  g2   10
  g32
  g42
  g53
  g6   16
  g7  2a100ea7cc0
  o0  11eb1e8
  o1  2a100ea7928
  o2  306f5b0
  o3  30037505c50
  o4  3c7a000
  o5   15
  o6  2a100ea6ff1
  o7  105e560
  pc  104220c
 npc  1042210
   y   10 
  
  
::stack

  
  vpanic(11eb1e8, 13f01d8, 13f01f8, 865, 600026d4ef0, 60002793ac0)
assfail+0x7c(13f01d8, 13f01f8, 865, 183e000, 11eb000, 0)
spa_sync+0x190(60001f244c0, 3dd9, 600047f3500, 0, 2a100ea7cc4, 2a100ea7cbc)
txg_sync_thread+0x130(60001f9c580, 3dd9, 2a100ea7ab0, 60001f9c6a0, 60001f9c692, 
60001f9c690)
thread_start+4(60001f9c580, 0, 0, 0, 0, 0)
  
  
::status

  
  debugging crash dump vmcore.0 (64-bit) from ai
operating system: 5.11 snv_38 (sun4u)
panic message: 
assertion failed: vdev_config_sync(rvd, txg) == 0, file: ../../common/fs/zfs/spa
.c, line: 2149
dump content: kernel pages only
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Jeff Victor


Does it make sense to solve these problems piece-meal:

* Performance: ZFS algorithms and NVRAM
* Error detection: ZFS checksums
* Error correction: ZFS RAID1 or RAIDZ

Nathanael Burton wrote:

If you've got hardware raid-5, why not just run regular (non-raid) pools on
top of the raid-5?

I wouldn't go back to JBOD.   Hardware arrays offer a number of advantages to
JBOD: - disk microcode management - optimized access to storage - large write
caches - RAID computation can be done in specialized d hardware - SAN-based
hardware products allow sharing of f storage among multiple hosts.  This
allows storage to be utilized more effectively.




I'm a little confused by the first poster's message as well, but you lose some
benefits of ZFS if you don't create your pools with either RAID1 or RAIDZ, such
as data corruption detection.  The array isn't going to detect that because all
it knows about are blocks.

--
--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Gregory Shaw

Yes, but the idea of using software raid on a large server doesn't  
make sense in modern systems.  If you've got a large database server  
that runs a large oracle instance, using CPU cycles for RAID is  
counter productive.  Add to that the need to manage the hardware  
directly (drive microcode, drive brownouts/restarts, etc.) and the  
idea of using JBOD in modern systems starts to lose value in a big way.


You will detect any corruption when doing a scrub.  It's not end-to- 
end, but it's no worse than today with VxVM.


On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote:


If you've got hardware raid-5, why not just run
regular (non-raid)
pools on top of the raid-5?

I wouldn't go back to JBOD.   Hardware arrays offer a
number of
advantages to JBOD:
- disk microcode management
- optimized access to storage
- large write caches
- RAID computation can be done in specialized
d hardware
- SAN-based hardware products allow sharing of
f storage among
multiple hosts.  This allows storage to be utilized
more effectively.



I'm a little confused by the first poster's message as well, but  
you lose some benefits of ZFS if you don't create your pools with  
either RAID1 or RAIDZ, such as data corruption detection.  The  
array isn't going to detect that because all it knows about are  
blocks.


-Nate


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382  [EMAIL PROTECTED] (work)
Louisville, CO 80028-4382 [EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. - Linus  
Torvalds



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Gregory Shaw

Most controllers support a background-scrub that will read a volume  
and repair any bad stripes.  This addresses the bad block issue in  
most cases.


It still doesn't help when a double-failure occurs.   Luckily, that's  
very rare.  Usually, in that case, you need to evacuate the volume  
and try to restore what was damaged.


On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote:


On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:


You're using hardware raid.  The hardware raid controller will  
rebuild
the volume in the event of a single drive failure.  You'd need to  
keep

on top of it, but that's a given in the case of either hardware or
software raid.


True for total drive failure, but not there are a more failure modes
than that.  With hardware RAID, there is no way for the RAID  
controller

to know which block was bad, and therefore cannot repair the block.
With RAID-Z, we have the integrated checksum and can do combinatorial
analysis to know not only which drive was bad, but what the data
_should_ be, and can repair it to prevent more corruption in the  
future.


- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/ 
eschrock


-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382  [EMAIL PROTECTED] (work)
Louisville, CO 80028-4382 [EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. - Linus  
Torvalds



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Torrey McMahon


Bart Smaalders wrote:

Gregory Shaw wrote:

On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote:

How would ZFS self heal in this case?




You're using hardware raid.  The hardware raid controller will rebuild
the volume in the event of a single drive failure.  You'd need to keep
on top of it, but that's a given in the case of either hardware or
software raid.

If you've got requirements for surviving an array failure, the
recommended solution in that case is to mirror between volumes on
multiple arrays.   I've always liked software raid (mirroring) in that
case, as no manual intervention is needed in the event of an array
failure.  Mirroring between discrete arrays is usually reserved for
mission-critical applications that cost thousands of dollars per hour in
downtime.



In other words, it won't.  You've spent the disk space, but
because you're mirroring in the wrong place (the raid array)
all ZFS can do is tell you that your data is gone.  With luck,
subsequent reads _might_ get the right data, but maybe not.


Careful here when you say wrong place. There are many scenarios where 
mirroring in the hardware is the correct way to go even when running ZFS 
on top of it.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Darren J Moffat


Peter Rival wrote:
storage arrays with the same arguments over and over without providing 
an answer to the customer problem doesn't do anyone any good.  So.  I'll 
restate the question.  I have a 10TB database that's spread over 20 
storage arrays that I'd like to migrate to ZFS.  How should I configure 
the storage array?  Let's at least get that conversation moving...


I'll answer your question with more questions:

What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ?

What of that doesn't work for you ?

What functionality of ZFS is it that you want to leverage ?

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Jeff Victor

Unfortunately, a storage-based RAID controller cannot detect errors which occurred 
between the filesystem layer and the RAID controller, in either direction - in or 
out.  ZFS will detect them through its use of checksums.


But ZFS can only fix them if it can access redundant bits.  It can't tell a 
storage device to provide the redundant bits, so it must use its own data 
protection system (RAIDZ or RAID1) in order to correct errors it detects.



Gregory Shaw wrote:
Most controllers support a background-scrub that will read a volume  and 
repair any bad stripes.  This addresses the bad block issue in  most cases.


It still doesn't help when a double-failure occurs.   Luckily, that's  
very rare.  Usually, in that case, you need to evacuate the volume  and 
try to restore what was damaged.


On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote:


On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:



You're using hardware raid.  The hardware raid controller will  rebuild
the volume in the event of a single drive failure.  You'd need to  keep
on top of it, but that's a given in the case of either hardware or
software raid.



True for total drive failure, but not there are a more failure modes
than that.  With hardware RAID, there is no way for the RAID  controller
to know which block was bad, and therefore cannot repair the block.
With RAID-Z, we have the integrated checksum and can do combinatorial
analysis to know not only which drive was bad, but what the data
_should_ be, and can repair it to prevent more corruption in the  future.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/ 
eschrock



-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382  [EMAIL PROTECTED] (work)
Louisville, CO 80028-4382 [EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. - Linus  
Torvalds



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Jeff Victor


Peter Rival wrote:


See, telling folks you should just use JBOD when they don't have JBOD 
and have invested millions to get to state they're in where they're 
efficiently utilizing their storage via a SAN infrastructure is just 
plain one big waste of everyone's time.  Shouting down the advantages of 
storage arrays with the same arguments over and over without providing 
an answer to the customer problem doesn't do anyone any good.  So.  I'll 
restate the question.  I have a 10TB database that's spread over 20 
storage arrays that I'd like to migrate to ZFS.  How should I configure 
the storage array?  Let's at least get that conversation moving...


In general, I'd say that if the storage has battery-backed cache, use RAID5 on the 
storage device - limit the amount of redundant data, but improve performance and 
achieve some data protection in fast special-purpose hardware.



Just my $.02.

--
--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Gregory Shaw

Not at all.  ZFS is a quantum leap in Solaris filesystem/VM  
functionality.


However,  I don't see a lot of use for RAID-Z (or Z2) in large  
enterprise customers situations.  For instance, does ZFS enable Sun  
to walk into an account and say You can now replace all of your high- 
end (EMC) disk with JBOD.?  I don't think many customers would bite  
on that.


RAID-Z is an excellent feature, however, it doesn't address many of  
the reasons for using high-end arrays:


- Exporting snapshots to alternate systems (for live database or  
backup purposes)

- Remote replication
- Sharing of storage among multiple systems (LUN masking and equivalent)
- Storage management (migration between tiers of storage)
- No-downtime failure replacement (the system doesn't even know)
- Clustering

I know that ZFS is still a work in progress, so some of the above may  
arrive in future versions of the product.


I see the RAID-Z[2] value in small-to-mid size systems where the  
storage is relatively small and you don't have high availability  
requirements.


On Jun 27, 2006, at 8:48 AM, Darren J Moffat wrote:

So everything you are saying seems to suggest you think ZFS was a  
waste of engineering time since hardware raid solves all the  
problems ?


I don't believe it does but I'm no storage expert and maybe I've  
drank too much cool aid.  I'm software person and for me ZFS is  
brilliant it is so much easier than managing any of the hardware  
raid systems I've dealt with.


--
Darren J Moffat


-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382  [EMAIL PROTECTED] (work)
Louisville, CO 80028-4382 [EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. - Linus  
Torvalds



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Gregory Shaw

This is getting pretty picky.  You're saying that ZFS will detect any  
errors introduced after ZFS has gotten the data.  However, as stated  
in a previous post, that doesn't guarantee that the data given to ZFS  
wasn't already corrupted.


If you don't trust your storage subsystem, you're going to encounter  
issues regardless of the software use to store data.  We'll have to  
see if ZFS can 'save' customers in this situation.  I've found that  
regardless of the storage solution in question you can't anticipate  
all issues and when a brownout or other ugly loss-of-service occurs,  
you may or may not be intact, ZFS or no.


I've never seen a product that can deal with all possible situations.

On Jun 27, 2006, at 9:01 AM, Jeff Victor wrote:

Unfortunately, a storage-based RAID controller cannot detect errors  
which occurred between the filesystem layer and the RAID  
controller, in either direction - in or out.  ZFS will detect them  
through its use of checksums.


But ZFS can only fix them if it can access redundant bits.  It  
can't tell a storage device to provide the redundant bits, so it  
must use its own data protection system (RAIDZ or RAID1) in order  
to correct errors it detects.



Gregory Shaw wrote:
Most controllers support a background-scrub that will read a  
volume  and repair any bad stripes.  This addresses the bad block  
issue in  most cases.
It still doesn't help when a double-failure occurs.   Luckily,  
that's  very rare.  Usually, in that case, you need to evacuate  
the volume  and try to restore what was damaged.

On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote:

On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:



You're using hardware raid.  The hardware raid controller will   
rebuild
the volume in the event of a single drive failure.  You'd need  
to  keep

on top of it, but that's a given in the case of either hardware or
software raid.



True for total drive failure, but not there are a more failure modes
than that.  With hardware RAID, there is no way for the RAID   
controller

to know which block was bad, and therefore cannot repair the block.
With RAID-Z, we have the integrated checksum and can do  
combinatorial

analysis to know not only which drive was bad, but what the data
_should_ be, and can repair it to prevent more corruption in the   
future.


- Eric

--
Eric Schrock, Solaris Kernel Development   http:// 
blogs.sun.com/ eschrock

-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382  [EMAIL PROTECTED] (work)
Louisville, CO 80028-4382 [EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. -  
Linus  Torvalds

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
-- 

Jeff VICTOR  Sun Microsystemsjeff.victor @  
sun.com

OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/ 
zones/faq
-- 



-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382  [EMAIL PROTECTED] (work)
Louisville, CO 80028-4382 [EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. - Linus  
Torvalds



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Casper . Dik


This is getting pretty picky.  You're saying that ZFS will detect any  
errors introduced after ZFS has gotten the data.  However, as stated  
in a previous post, that doesn't guarantee that the data given to ZFS  
wasn't already corrupted.

But there's a big difference between the time ZFS gets the data
and the time your typical storage system gets it.

And your typical storage system does not store any information which
allows it to detect all but the most simple errors.

Storage systems are complicated and have many failure modes at many
different levels.

- disks not writing data or writing data in incorrect location
- disks not reporting failures when they occur
- bit errors in disk write buffers causing data corruption
- storage array software with bugs
- storage array with undetected hardware errors
- data corruption in the path (such as switches with mangle
  packets but keep the TCP checksum working


If you don't trust your storage subsystem, you're going to encounter  
issues regardless of the software use to store data.  We'll have to  
see if ZFS can 'save' customers in this situation.  I've found that  
regardless of the storage solution in question you can't anticipate  
all issues and when a brownout or other ugly loss-of-service occurs,  
you may or may not be intact, ZFS or no.

I've never seen a product that can deal with all possible situations.

ZFS attempts to deal with more problems than any of the current
existing solutions by giving end-to-end verification of the data.

One of the reasons why ZFS was created was a particular large customer
who had datacorruption which occured two years (!) before it was
detected.  The bad data had migrated and corrupted; the good data
was no longer available on backups (which weren't very relevant
anyway after such a long time)

ZFS tries to give one important guarantee: if the data is bad, we will
not return it.

One case in point is the person in MPK with a SATA controller which
corrupts memory; he didn't discover this using UFS (except for perhaps
a few strange events he noticed).  After switch to ZFS he started to
find corruption so now he uses a self-healing ZFS mirror (or RAIDZ).

ZFS helps at the low end as much as it does at the highend.

I'll bet that ZFS will generate more calls about broken hardware
and fingers will be pointed at ZFS at first because it's the new
kid; it will be some time before people realize that the data was
rotting all along.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Nicolas Williams

On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote:
 This is getting pretty picky.  You're saying that ZFS will detect any  
 errors introduced after ZFS has gotten the data.  However, as stated  
 in a previous post, that doesn't guarantee that the data given to ZFS  
 wasn't already corrupted.

There will always be some place where errors can be introduced and go on
undetected.  But some parts of the system are more error prone than
others, and ZFS targets the most error prone of them: rotating rust.

For the rest, make sure you have ECC memory, that you're using secure
NFS (with krb5i or krb5p), and the probability of undetectable data
corruption errors should be much closer to zero than what you'd get with
other systems.

That said, there's a proposal to add end-to-end data checksumming over
NFSv4 (see the IETF NFSv4 WG list archives).  That proposal can't
protect meta-data, and it doesn't remove any one type of data corruption
error on the client side, but it does on the server side.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Joe Little


On 6/27/06, Erik Trimble [EMAIL PROTECTED] wrote:

Darren J Moffat wrote:

 Peter Rival wrote:

 storage arrays with the same arguments over and over without
 providing an answer to the customer problem doesn't do anyone any
 good.  So.  I'll restate the question.  I have a 10TB database that's
 spread over 20 storage arrays that I'd like to migrate to ZFS.  How
 should I configure the storage array?  Let's at least get that
 conversation moving...


 I'll answer your question with more questions:

 What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ?

 What of that doesn't work for you ?

 What functionality of ZFS is it that you want to leverage ?

It seems that the big thing we all want (relative to the discussion of
moving HW RAID to ZFS) from ZFS is the block checksumming (i.e. how to
reliabily detect that a given block is bad, and have ZFS compensate).
Now, how do we get things when using HW arrays, and not just treat them
like JBODs (which is impractical for large SAN and similar arrays that
are already configured).

Since the best way to get this is to use a Mirror or RAIDZ vdev, I'm
assuming that the proper way to get benefits from both ZFS and HW RAID
is the following:

(1)  ZFS mirror of  HW stripes, i.e.  zpool create tank mirror
hwStripe1 hwStripe2
(2)  ZFS RAIDZ of HW mirrors, i.e. zpool create tank raidz hwMirror1,
hwMirror2
(3)  ZFS RAIDZ of  HW stripes, i.e. zpool create tank raidz hwStripe1,
hwStripe2

mirrors of mirrors and raidz of raid5 is also possible, but I'm pretty
sure they're considerably less useful than the 3 above.

Personally, I can't think of a good reason to use ZFS with HW RAID5;
case (3) above seems to me to provide better performance with roughly
the same amount of redundancy (not quite true, but close).

I'd vote for (1) if you need high performance, at the cost of disk
space, (2) for maximum redundancy, and (3) as maximum space with
reasonable performance.


I'm making a couple of assumptions here:

(a)  you have the spare cycles on your hosts to allow for using ZFS
RAIDZ, which is a non-trivial cost (though not that big, folks).
(b)  your HW RAID controller uses NVRAM (or battery-backed cache), which
you'd like to be able to use to speed up writes
(c)  you HW RAID's NVRAM speeds up ALL writes, regardless of the
configuration of arrays in the HW
(d)  having your HW controller present individual disks to the machines
is a royal pain (way too many, the HW does other nice things with
arrays, etc)




The case for HW RAID 5 with ZFS is easy: when you use iscsi. You get
major performance degradation over iscsi when trying to coordinate
writes and reads serially over iscsi using RAIDZ. The sweet spot in
the iscsi world is let your targets do RAID5 or whatnot (RAID10,
RAID50, RAID6), and combine those into ZFS pools, mirrored or not.
There are other benefits to ZFS, including snapshots, easily managed
storage pools, and with iscsi, ease of switching head nodes with a
simple export/import.




Erik Trimble
Java System Support
Mailstop:  usca14-102
Phone:  x17195
Santa Clara, CA


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Torrey McMahon


[EMAIL PROTECTED] wrote:
  

That's the dilemma, the array provides nice features like RAID1 and
RAID5, but those are of no real use when using ZFS. 




RAID5 is not a nice feature when it breaks.

A RAID controller cannot guarantee that all bits of a RAID5 stripe
are written when power fails; then you have data corruption and no
one can tell you what data was corrupted.  ZFS RAIDZ can.



That depends on the raid controller. Some implementations use a log 
*and* a battery back up. In some cases the battery is a embedded UPS of 
sorts to make sure the power stays up long enough to take the writes 
from the host and get them to disk.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Torrey McMahon

Your example would prove more effective if you added, I've got ten 
databases. Five on AIX, Five on Solaris 8


Peter Rival wrote:
I don't like to top-post, but there's no better way right now.  This 
issue has recurred several times and there have been no answers to it 
that cover the bases.  The question is, say I as a customer have a 
database, let's say it's around 8 TB, all built on a series of high 
end storage arrays that _don't_ support the JBOD everyone seems to 
want - what is the preferred configuration for my storage arrays to 
present LUNs to the OS for ZFS to consume?


Let's say our choices are RAID0, RAID1, RAID0+1 (or 1+0) and RAID5 - 
that spans the breadth of about as good as it gets.  What should I as 
a customer do?  Should I create RAID0 sets and let ZFS self-heal via 
its own mirroring or RAIDZ when a disk blows in the set?  Should I use 
RAID1 and eat the disk space used?  RAID5 and be thankful I have a 
large write cache - and then which type of ZFS pool should I create 
over it?


See, telling folks you should just use JBOD when they don't have 
JBOD and have invested millions to get to state they're in where 
they're efficiently utilizing their storage via a SAN infrastructure 
is just plain one big waste of everyone's time.  Shouting down the 
advantages of storage arrays with the same arguments over and over 
without providing an answer to the customer problem doesn't do anyone 
any good.  So.  I'll restate the question.  I have a 10TB database 
that's spread over 20 storage arrays that I'd like to migrate to ZFS.  
How should I configure the storage array?  Let's at least get that 
conversation moving...


- Pete

Gregory Shaw wrote:
Yes, but the idea of using software raid on a large server doesn't 
make sense in modern systems.  If you've got a large database server 
that runs a large oracle instance, using CPU cycles for RAID is 
counter productive.  Add to that the need to manage the hardware 
directly (drive microcode, drive brownouts/restarts, etc.) and the 
idea of using JBOD in modern systems starts to lose value in a big way.


You will detect any corruption when doing a scrub.  It's not 
end-to-end, but it's no worse than today with VxVM.


On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote:


If you've got hardware raid-5, why not just run
regular (non-raid)
pools on top of the raid-5?

I wouldn't go back to JBOD.   Hardware arrays offer a
number of
advantages to JBOD:
- disk microcode management
- optimized access to storage
- large write caches
- RAID computation can be done in specialized
d hardware
- SAN-based hardware products allow sharing of
f storage among
multiple hosts.  This allows storage to be utilized
more effectively.



I'm a little confused by the first poster's message as well, but you 
lose some benefits of ZFS if you don't create your pools with either 
RAID1 or RAIDZ, such as data corruption detection.  The array isn't 
going to detect that because all it knows about are blocks.


-Nate






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Dale Ghent


Torrey McMahon wrote:

ZFS is greatfor the systems that can run it. However, any enterprise 
datacenter is going to be made up of many many hosts running many many 
OS. In that world you're going to consolidate on large arrays and use 
the features of those arrays where they cover the most ground. For 
example, if I've 100 hosts all running different OS and apps and I can 
perform my data replication and redundancy algorithms, in most cases 
Raid, in one spot then it will be much more cost efficient to do it there.


Exactly what I'm pondering.

In the near to mid term, Solaris with ZFS can be seen as sort of a 
storage virtualizer where it takes disks into ZFS pools and volumes and 
then presents them to other hosts and OSes via iSCSI, NFS, SMB and so 
on. At that point, those other OSes can enjoy the benefits of ZFS.


In the long term, it would be nice to see ZFS (or its concepts) 
integrated as the LUN provisioning and backing store mechanism on 
hardware RAID arrays themselves, supplanting the traditional RAID 
paradigms that have been in use for years.


/dale
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Root password fix on zfs root filesystem

2006-06-27 Thread Ron Halstead

Currently, when the root password is forgotten / munged, I boot from the cdrom 
into a shell, mount the root filesystem on /mnt and edit /mnt/etc/shadow, 
blowing away the root password.
 
What is going to happen when the root filesystem is ZFS? Hopefully the same 
mechanism will be available.
 
Ron Halstead
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Root password fix on zfs root filesystem

2006-06-27 Thread Lori Alt


Ron Halstead wrote:

Currently, when the root password is forgotten / munged, I boot from the cdrom 
into a shell, mount the root filesystem on /mnt and edit /mnt/etc/shadow, 
blowing away the root password.
 
What is going to happen when the root filesystem is ZFS? Hopefully the same mechanism will be available.
 


A similar mechanism will do what you want.  The only difference
is that while booted from the cdrom, you would have to use
the zfs import command to import the root pool.  Then you
can mount the root dataset and modify it as needed.

Lori
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Torrey McMahon


Jason Schroeder wrote:

Torrey McMahon wrote:


[EMAIL PROTECTED] wrote:



I'll bet that ZFS will generate more calls about broken hardware
and fingers will be pointed at ZFS at first because it's the new
kid; it will be some time before people realize that the data was
rotting all along.




EhhhI don't think so. Most of our customers have HW arrays that 
have been scrubbing data for years and years as well as apps on the 
top that have been verifying the data. (Oracle for example.) Not to 
mention there will be a bit of time before people move over to ZFS in 
the high end.




Ahh... but there is the rub.  Today - you/we don't *really* know, do 
we?  Maybe there's bad juju blocks, maybe not.  Running ZFS, whether 
in a redundant vdev or not, will certainly turn the big spotlight on 
and give us the data that checksums matched, or they didn't.  



A spotlight on what? How is that data going to get into ZFS? The more I 
think about this more I realize it's going to do little for existing 
data sets. You're going to have to migrate that data from filesystem X 
into ZFS first. From that point on ZFS has no idea if the data was bad 
to begin with. If you can do an in place migration then you might be 
able to weed out some bad physical blocks/drives over time but I assert 
that the current disk scrubbing methodologies catch most of those.


Yes, it's great for new data sets where you started with ZFS. Sorry if I 
sound like I'm raining on the parade here folks. That's not the case, 
really, and I'm all for the great new features and EAU ZFS gives where 
applicable.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 10 6/06 now available for download

2006-06-27 Thread Shannon Roddy



 Solaris 10u2 was released today.  You can now download it from here:

 http://www.sun.com/software/solaris/get.jsp
   

Does anyone know if ZFS is included in this release?  One of my local
Sun reps said it did not make it into the u2 release, though I have
heard for ages that 6/06 would include it.

Thanks!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 10 6/06 now available for download

2006-06-27 Thread Prabahar Jeyaram

Indeed. ZFS is included in Solaris 10 U2.

-- Prabahar.

Shannon Roddy wrote:

 Solaris 10u2 was released today.  You can now download it from here:

 http://www.sun.com/software/solaris/get.jsp
   
 
 Does anyone know if ZFS is included in this release?  One of my local
 Sun reps said it did not make it into the u2 release, though I have
 heard for ages that 6/06 would include it.
 
 Thanks!
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 10 6/06 now available for download

2006-06-27 Thread Gary Combs





Yup, it's there!

Shannon Roddy said the following on 06/27/06 12:57:

  

  
Solaris 10u2 was released today.  You can now download it from here:

http://www.sun.com/software/solaris/get.jsp
  

  

  
  
Does anyone know if ZFS is included in this release?  One of my local
Sun reps said it did not make it into the u2 release, though I have
heard for ages that 6/06 would include it.

Thanks!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


-- 

  

  
   Gary Combs 
Technical Marketing
  
  Sun Microsystems, Inc.
3295 NW 211th Terrace
Hillsboro, OR 97124 US
Phone x32604/+1 503 715 3517
Fax 503-715-3517
Email [EMAIL PROTECTED]
  


  
  
"The box said 'Windows 2000 Server or better', so I installed Solaris"
  
  
  


  

  



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 10 6/06 now available for download

2006-06-27 Thread Phillip Wagstrom -- Area SSE MidAmerica


Shannon Roddy wrote:

Solaris 10u2 was released today.  You can now download it from here:

http://www.sun.com/software/solaris/get.jsp
  


Does anyone know if ZFS is included in this release?  One of my local
Sun reps said it did not make it into the u2 release, though I have
heard for ages that 6/06 would include it.


Yes.

[EMAIL PROTECTED]:/home/pwags% more /etc/release
   Solaris 10 6/06 s10s_u2wos_09a SPARC
   Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
 Assembled 09 June 2006
[EMAIL PROTECTED]:/home/pwags% zpool list
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
sse1.06T455G633G41%  ONLINE -


Regards,
Phil

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Darren J Moffat


Nicolas Williams wrote:

On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote:
This is getting pretty picky.  You're saying that ZFS will detect any  
errors introduced after ZFS has gotten the data.  However, as stated  
in a previous post, that doesn't guarantee that the data given to ZFS  
wasn't already corrupted.


There will always be some place where errors can be introduced and go on
undetected.  But some parts of the system are more error prone than
others, and ZFS targets the most error prone of them: rotating rust.

For the rest, make sure you have ECC memory, that you're using secure
NFS (with krb5i or krb5p), and the probability of undetectable data
corruption errors should be much closer to zero than what you'd get with
other systems.


Another alternative is using IPsec with just AH.

For the benefit of those outside of Sun MPK17 both krb5i and IPsec AH 
were used to diagnose and prove that we have a faulty router in a lab 
that was causing very strange build errors.  TCP/IP alone didn't catch 
the problems and sometimes they showed up with SCCS simple checksums and 
sometimes we had compile errors.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-27 Thread Darren J Moffat


Torrey McMahon wrote:

Darren J Moffat wrote:
So everything you are saying seems to suggest you think ZFS was a 
waste of engineering time since hardware raid solves all the problems ?


I don't believe it does but I'm no storage expert and maybe I've drank 
too much cool aid.  I'm software person and for me ZFS is brilliant it 
is so much easier than managing any of the hardware raid systems I've 
dealt with.



ZFS is greatfor the systems that can run it. However, any enterprise 
datacenter is going to be made up of many many hosts running many many 
OS. In that world you're going to consolidate on large arrays and use 
the features of those arrays where they cover the most ground. For 
example, if I've 100 hosts all running different OS and apps and I can 
perform my data replication and redundancy algorithms, in most cases 
Raid, in one spot then it will be much more cost efficient to do it there.


but you still need a local file system on those systems in many cases.

So back to where we started I guess, how to effectively use ZFS to 
benefit Solaris (and the other platforms it gets ported to) while still 
using Hardware RAID because you have no choice but to use it.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] disk evacuate

2006-06-27 Thread Dick Davies


Just wondered if there'd been any progress in this area?

Correct me if i'm wrong, but as it stands, there's no way
to remove a device you accidentally 'zpool add'ed without
destroying the pool.

On 12/06/06, Gregory Shaw [EMAIL PROTECTED] wrote:

Yes, if zpool remove works like you describe, it does the same
thing.  Is there a time frame for that feature?

Thanks!

On Jun 11, 2006, at 10:21 AM, Eric Schrock wrote:

 This only seems valuable in the case of an unreplicated pool.  We
 already have 'zpool offline' to take a device and prevent ZFS from
 talking to it (because it's in the process of failing, perhaps).  This
 gives you what you want for mirrored and RAID-Z vdevs, since
 there's no
 data to migrate anyway.

 We are also planning on implementing 'zpool remove' (for more than
 just
 hot spares), which would allow you to remove an entire toplevel vdev,
 migrating the data off of it in the process.  This would give you what
 you want for the case of an unreplicated pool.

 Does this satisfy the usage scenario you described?

 - Eric

 On Sun, Jun 11, 2006 at 07:52:37AM -0600, Gregory Shaw wrote:
 Pardon me if this scenario has been discussed already, but I haven't
 seen anything as yet.

 I'd like to request a 'zpool evacuate pool device' command.
 'zpool evacuate' would migrate the data from a disk device to other
 disks in the pool.

 Here's the scenario:

 Say I have a small server with 6x146g disks in a jbod
 configuration.   If I mirror the system disk with SVM (currently) and
 allocate the rest as a non-raidz pool, I end up with 4x146g in a pool
 of approximately 548gb capacity.

 If one of the disks is starting to fail, I would need to use 'zpool
 replace new-disk old-disk'.  However, since I have no more slots in
 the machine to add a replacement disk, I'm stuck.

 This is where a 'zpool evacuate pool device' would come in handy.
 It would allow me to evacuate the failing device so that it could be
 replaced and re-added with 'zpool add pool device'.

 What does the group think?




--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Al Hopper

On Tue, 27 Jun 2006, Gregory Shaw wrote:

 Yes, but the idea of using software raid on a large server doesn't
 make sense in modern systems.  If you've got a large database server
 that runs a large oracle instance, using CPU cycles for RAID is
 counter productive.  Add to that the need to manage the hardware
 directly (drive microcode, drive brownouts/restarts, etc.) and the
 idea of using JBOD in modern systems starts to lose value in a big way.

 You will detect any corruption when doing a scrub.  It's not end-to-
 end, but it's no worse than today with VxVM.

The initial impression I got, after reading the original post, is that its
author[1] does not grok something fundamental about ZFS and/or how it
works!  Or does not understand that there are many CPU cycles in a modern
Unix box that are never taken advantage of.

It's clear to me that ZFS provides considerable, never before available,
features and facilities, and that any impact that ZFS may have on CPU or
memory utilization will become meaningless over time, as the # of CPU
cores increase - along with their performance.  And that average system
memory size will continue to increase over time.

Perhaps the author is taking a narrow view that ZFS will *replace*
existing systems.  I don't think that this will be the general case.
Especially in a large organization where politics and turf wars will
dominate any technical discussions and implementation decisions will be
made by senior management who are 100% buzzword compliant (and have
questionable technical/engineering skills).  Rather it will provide the
system designer with a hugely powerful *new* tool to apply in system
design.  And will challenge the designer to use it creatively and
effectively.

There is no such thing as the universal screw-driver.  Every toolbox has
tens of screwdrivers and tool designers will continue to dream about
replacing them all with _one_ tool.

[1] Sorry Gregory.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] This may be a somewhat silly question ...

2006-06-27 Thread Dennis Clarke


... but I have to ask.

How do I back this up?

Here is my definition of a backup :

(1) I can copy all data and metadata onto some media in
a manner that verifies the integrity of the data and
metadata written.

(1.1) By verify I mean that the data written onto
  the media is read back and compared to the
  source and accuracy is assured.

(2) I can walk away with the media and be able to restore
the data onto bare metal with nothing other than Solaris
10 Update 2 ( or Nevada ) CDROM sets and reasonable hardware.

I have a copy of the Solaris ZFS Administration Guide which is some
document numbered 817-2271.  158 pages and well worth printing out I think.

Let's suppose that I have a pile of disks arranged in mirrors and everything
seems to be going along swimmingly thus :

# zpool status zfs0
  pool: zfs0
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zfs0 ONLINE   0 0 0
  mirror ONLINE   0 0 0
c0t10d0  ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c0t11d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c0t12d0  ONLINE   0 0 0
c1t12d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c0t9d0   ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c0t13d0  ONLINE   0 0 0
c1t13d0  ONLINE   0 0 0

errors: No known data errors
#

# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
zfs0  95.3G  70.8G  27.5K  /export/zfs
zfs0/backup   91.2G  70.8G  88.4G  /export/zfs/backup
zfs0/backup/pasiphae  2.77G  24.2G  2.77G  /export/zfs/backup/pasiphae
zfs0/lotus 786M  70.8G   786M  /opt/lotus
zfs0/zone 3.40G  70.8G  24.5K  /export/zfs/zone
zfs0/zone/common  24.5K  8.00G  24.5K  legacy
zfs0/zone/domino  24.5K  70.8G  24.5K  /opt/zone/domino
zfs0/zone/sugar   3.40G  12.6G  3.40G  /opt/zone/sugar

At this point I attach a tape drive to the machine :

# devfsadm -v -C -c tape
devfsadm[24247]: verbose: symlink /dev/rmt/0 -
../../devices/[EMAIL PROTECTED],0/SUNW,[EMAIL PROTECTED],880/[EMAIL 
PROTECTED],0:
.
.
.
devfsadm[24247]: verbose: symlink /dev/rmt/0ubn -
../../devices/[EMAIL PROTECTED],0/SUNW,[EMAIL PROTECTED],880/[EMAIL 
PROTECTED],0:ubn
# mt -f /dev/rmt/0lbn status
DLT4000 tape drive:
   sense key(0x6)= Unit Attention   residual= 0   retries= 0
   file no= 0   block no= 0
#

I then create a snapshot as per the documentation :

# zfs list zfs0
NAME   USED  AVAIL  REFER  MOUNTPOINT
zfs0  95.3G  70.8G  27.5K  /export/zfs
# date
Tue Jun 27 18:10:36 EDT 2006
# zfs snapshot [EMAIL PROTECTED]
# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
zfs0  95.3G  70.8G  27.5K  /export/zfs
[EMAIL PROTECTED]  0  -  27.5K  -
zfs0/backup   91.2G  70.8G  88.4G  /export/zfs/backup
zfs0/backup/pasiphae  2.77G  24.2G  2.77G  /export/zfs/backup/pasiphae
zfs0/lotus 786M  70.8G   786M  /opt/lotus
zfs0/zone 3.40G  70.8G  24.5K  /export/zfs/zone
zfs0/zone/common  24.5K  8.00G  24.5K  legacy
zfs0/zone/domino  24.5K  70.8G  24.5K  /opt/zone/domino
zfs0/zone/sugar   3.40G  12.6G  3.40G  /opt/zone/sugar
#

And then I send that snapshot to tape :

# zfs send [EMAIL PROTECTED]  /dev/rmt/0mbn
#

That command ran for maybe 15 seconds.  I seriously doubt that 95GB of data
was written to tape and verified in that time although I'd like to see the
device and bus that can do it!  :-)

I'll destroy that snapshot and try something else here :

# zfs destroy [EMAIL PROTECTED]

Now perhaps the mystery is to try a different ZFS filesystem :

# date
Tue Jun 27 18:17:33 EDT 2006
# zfs snapshot zfs0/[EMAIL PROTECTED]:17Hrs

I'll check the tape drive that did something above although I have no idea
what.

# mt -f /dev/rmt/0mbn status
DLT4000 tape drive:
   sense key(0x0)= No Additional Sense   residual= 0   retries= 0
   file no= 1   block no= 0
#

Now I will send that stream to the tape :

# zfs send zfs0/[EMAIL PROTECTED]:17Hrs  /dev/rmt/0mbn

The tape is now doing something again and I don't know what.

I would like to think that when it is down I can walk to a totally new
machine and restore the ZFS filesystem zfs0/lotus with no issue but I don't
see a verify step here anywhere and I really have no idea what will happen
when I hit the end of that tape.

I am very bothered that my 95GB zfs0 did not go to tape and I don't know why
not.  I think that my itty bitty 786MB zfs0/lotus is actually going to tape
right now ( lights are flashing )

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread David Valin

Al Hopper wrote:
 On Tue, 27 Jun 2006, Gregory Shaw wrote:
 
 
Yes, but the idea of using software raid on a large server doesn't
make sense in modern systems.  If you've got a large database server
that runs a large oracle instance, using CPU cycles for RAID is
counter productive.  Add to that the need to manage the hardware
directly (drive microcode, drive brownouts/restarts, etc.) and the
idea of using JBOD in modern systems starts to lose value in a big way.

You will detect any corruption when doing a scrub.  It's not end-to-
end, but it's no worse than today with VxVM.
 
 
 The initial impression I got, after reading the original post, is that its
 author[1] does not grok something fundamental about ZFS and/or how it
 works!  Or does not understand that there are many CPU cycles in a modern
 Unix box that are never taken advantage of.


Just because there are idle cpu cycles does not mean it is ok for the
Operating System to use them.  If there is a valid reason for the OS to
consume those cycles then that is fine.  But every cycle that the OS
consumes is one less cycle that is available for the customer apps (be
it Oracle or whatever, and I spend a lot of my time trying to squeeze
those cycles out of the high end systems).  The job of the operating
system is get the hell out of the way as quickly as possible so the user
aps. can do there work.  That can mean offloading some of the work onto
smart arrays.   As someone once said to me once, a customer does not buy
hardware to run an OS on, they buy it to accomplish some given piece of
work.


 
 It's clear to me that ZFS provides considerable, never before available,
 features and facilities, and that any impact that ZFS may have on CPU or
 memory utilization will become meaningless over time, as the # of CPU
 cores increase - along with their performance.  And that average system
 memory size will continue to increase over time.

This is true and will probably be true for ever and has been going on
ever since the first chip.  There has always been more demand for more
power by the end users.  However just because we have available cycles
does not mean the OS should consume them.
 
 Perhaps the author is taking a narrow view that ZFS will *replace*
 existing systems.  I don't think that this will be the general case.
 Especially in a large organization where politics and turf wars will
 dominate any technical discussions and implementation decisions will be
 made by senior management who are 100% buzzword compliant (and have
 questionable technical/engineering skills).  Rather it will provide the
 system designer with a hugely powerful *new* tool to apply in system
 design.  And will challenge the designer to use it creatively and
 effectively.

It all depends on your needs.  The idea of ZFS of providing raid
capabilities is very appealing for those systems that are desk top units
or small servers.  But where we are talking petabyte+  storage with 30+
gig/sec of IO bandwidth capacity, I believe we will find the CPUs are
going to consume way to much to handle the IO rate in such in
environment, at which time the work needs to be off loaded to smart
arrays (I have to do that experimentation yet).  You do not buy a 18
wheel tractor trailer to simply move a lawnmower from job site to job
site, you buy a SUV, pickup truck or trailer. Vice versa you do not buy
a pickup truck to move a tracked excavator, you have a tractor trailer.

 There is no such thing as the universal screw-driver.  Every toolbox has
 tens of screwdrivers and tool designers will continue to dream about
 replacing them all with _one_ tool.
 

How true.  ZFS is one of many tools available.  However the impression I
have been picking up out of here at various times is that alot of people
view ZFS as the only tool in the tool box, thus everything is looking
like a nail because all you have is a hammer.

If ZFS is providing better data integrity then the current storage
arrays, that  sounds like to me an opportunity for the next generation
of intelligent arrays to become better.

Dave Valin

 [1] Sorry G

regory.
 
 Regards,
 
 Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
 OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS

2006-06-27 Thread Neil Perrin




Robert Milkowski wrote On 06/27/06 03:00,:

Hello Chris,

Tuesday, June 27, 2006, 1:07:31 AM, you wrote:

CC On 6/26/06, Neil Perrin [EMAIL PROTECTED] wrote:



Robert Milkowski wrote On 06/25/06 04:12,:


Hello Neil,

Saturday, June 24, 2006, 3:46:34 PM, you wrote:

NP Chris,

NP The data will be written twice on ZFS using NFS. This is because NFS
NP on closing the file internally uses fsync to cause the writes to be
NP committed. This causes the ZIL to immediately write the data to the intent 
log.
NP Later the data is also written committed as part of the pools transaction 
group
NP commit, at which point the intent block blocks are freed.

NP It does seem inefficient to doubly write the data. In fact for blocks
NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed)
NP we write the data block and also an intent log record with the block 
pointer.
NP During txg commit we link this block into the pool tree. By experimentation
NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K
NP they do not benefit from this.

Is 32KB easily tuned (mdb?)?


I'm not sure. NFS folk?



CC I think he is referring to the zfs_immediate_write_sz variable, but

Exactly, I was asking about this not NFS.


Sorry for the confusion. The zfs_immediate_write_sz varaible was meant for
internal use and not really intended for public tuning. However, yes it could
be tuned dynamically anytime using mdb, or set in /etc/system

--

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Supporting ~10K users on ZFS

2006-06-27 Thread eric kustarz

Steve Bennett wrote:

OK, I know that there's been some discussion on this before, but I'm not sure
that any specific advice came out of it. What would the advice be for
supporting a largish number of users (10,000 say) on a system that supports
ZFS? We currently use vxfs and assign a user quota, and backups are done via
Legato Networker.

Using lots of filesystems is definitely encouraged - as long as doing so
makes sense in your environment.

From what little I currently understand, the general advice would seem to be to
assign a filesystem to each user, and to set a quota on that. I can see this
being OK for small numbers of users (up to 1000 maybe), but I can also see it
being a bit tedious for larger numbers than that.

I just tried a quick test on Sol10u2:
for x in 0 1 2 3 4 5 6 7 8 9; do for y in 0 1 2 3 4 5 6 7 8 9; do
zfs create testpool/$x$y; zfs set quota=1024k testpool/$x$y
done; done
[apologies for the formatting - is there any way to preformat text on this
forum?]
It ran OK for a minute or so, but then I got a slew of errors:
cannot mount '/testpool/38': unable to create mountpoint
filesystem successfully created, but not mounted

So, OOTB there's a limit that I need to raise to support more than approx 40
filesystems (I know that this limit can be raised, I've not checked to see
exactly what I need to fix). It does beg the question of why there's a limit
like this when ZFS is encouraging use of large numbers of filesystems.

There is no 40 filesystem limit. You most likely had a pre-existing
file/directory in testpool of the same name of the filesystem you tried
to create.

fsh-hake# zfs list
NAME USED AVAIL REFER MOUNTPOINT
testpool77K 7.81G 24.5K /testpool
fsh-hake# echo hmm /testpool/01
fsh-hake# zfs create testpool/01
cannot mount 'testpool/01': Not a directory
filesystem successfully created, but not mounted
fsh-hake#

If I have 10,000 filesystems, is the mount time going to be a problem?
I tried:
for x in 0 1 2 3 4 5 6 7 8 9; do for x in 0 1 2 3 4 5 6 7 8 9; do
zfs umount testpool/001; zfs mount testpool/001
done; done
This took 12 seconds, which is OK until you scale it up - even if we assume
that mount and unmount take the same amount of time, so 100 mounts will take 6
seconds, this means that 10,000 mounts will take 5 minutes. Admittedly, this is
on a test system without fantastic performance, but there *will* be a much
larger delay on mounting a ZFS pool like this over a comparable UFS filesystem.

So this really depends on why and when you're unmounting filesystems. I
suspect it won't matter much since you won't be unmounting/remounting
your filesystems.

I currently use Legato Networker, which (not unreasonably) backs up each
filesystem as a separate session - if I continue to use this I'm going to have
10,000 backup sessions on each tape backup. I'm not sure what kind of
challenges restoring this kind of beast will present.

Others have already been through the problems with standard tools such as 'df'
becoming less useful.

Is there a specific problem you had in mind regarding 'df;?

One alternative is to ditch quotas altogether - but even though disk is cheap, it's not
free, and regular backups take time (and tapes are not free either!). In any case, 10,000
undergraduates really will be able to fill more disks than we can afford to provision. We tried
running a Windows fileserver back in the days when it had no support for per-user quotas; we did
some ad-hockery that helped to keep track of the worst offenders (ableit after the event), but what
really killed us was the uncertainty over whether some idiot would decide to fill all available
space with vital research data (or junk, depending on your point of view).

I can see the huge benefits that ZFS quotas and reservations can bring, but I
can also see that there is a possibility that there are situations where ZFS
could be useful, but the lack of 'legacy' user-based quotas make it
impractical. If the ZFS developers really are not going to implement user
quotas is there any advice on what someone like me could do - at the moment I'm
presuming that I'll just have to leave ZFS alone.

I wouldn't give up that easily... looks like 1 filesystem per user, and
1 quota per filesystem does exactly what you want:

fsh-hake# zfs get -r -o name,value quota testpool
NAME VALUE
testpool none

testpool/ann 10M
testpool/bob 10M
testpool/john10M

fsh-hake#

I'm assuming that you decided against 1 filesystem per user due to
supposed 40 filesystem limit, which is isn't true.

eric

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Supporting ~10K users on ZFS

2006-06-27 Thread Peter Tribble

On Tue, 2006-06-27 at 23:07, Steve Bennett wrote:
 From what little I currently understand, the general advice would 
 seem to be to assign a filesystem to each user, and to set a quota 
 on that. I can see this being OK for small numbers of users (up to
 1000 maybe), but I can also see it being a bit tedious for larger
 numbers than that.

I've seen this discussed; even recommended. I don't think, though
- given that zfs has been available in a supported version of Solaris
for about 24 hours or so - that we've yet got to the point of best
practice or recommendation yet.

That said, the idea of one filesystem per user does have its
attractions.
With zfs - unlike other filesystems - it's feasible. Whether it's
sensible
is another matter.

Still, you could give them a zone each as well...

(One snag is that for undergraduates, there isn't really an
intermediate level - department or research grant, for example -
that can be used as the allocation unit.)

 I just tried a quick test on Sol10u2:
 for x in 0 1 2 3 4 5 6 7 8 9;  do for y in 0 1 2 3 4 5 6 7 8 9; do
 zfs create testpool/$x$y; zfs set quota=1024k testpool/$x$y
 done; done
 [apologies for the formatting - is there any way to preformat text on this 
 forum?]
 It ran OK for a minute or so, but then I got a slew of errors:
 cannot mount '/testpool/38': unable to create mountpoint
 filesystem successfully created, but not mounted
 
 So, OOTB there's a limit that I need to raise to support more than
 approx 40 filesystems (I know that this limit can be raised, I've not
 checked to see exactly what I need to fix). It does beg the question
 of why there's a limit like this when ZFS is encouraging use of large
 numbers of filesystems.

Works fine for me. I've done this up to 16000 or so (not with current
bits, that was last year).

 If I have 10,000 filesystems, is the mount time going to be a problem?
 I tried:
 for x in 0 1 2 3 4 5 6 7 8 9;  do for x in 0 1 2 3 4 5 6 7 8 9; do
 zfs umount testpool/001; zfs mount testpool/001
 done; done
 This took 12 seconds, which is OK until you scale it up - even if we assume
 that mount and unmount take the same amount of time,

It's not quite symmetric; I think umount is a fraction slower
(it has to check if the filesystem is in use, amongst other
things), but the estimate is probably accurate enough.

 so 100 mounts will take 6 seconds, this means that 10,000 mounts
 will take 5 minutes. Admittedly, this is on a test system without
 fantastic performance, but there *will* be a much larger delay
 on mounting a ZFS pool like this over a comparable UFS filesystem.

My test last year got to 16000 filesystems on a 1G server before
it went ballistic and all operations took infinitely long. I had
clearly run out of physical memory.

5 minutes doesn't sound too bad to me. It's an order of
magnitude quicker than it took to initialize ufs quotas
before ufs logging was introduced.

 One alternative is to ditch quotas altogether - but even though
 disk is cheap, it's not free, and regular backups take time
 (and tapes are not free either!). In any case, 10,000
 undergraduates really will be able to fill more disks than
 we can afford to provision. 

Last year, before my previous employer closed down, we
switched off user disk quotas for 20,000 researchers. The world
didn't end. The disks didn't fill up. All the work we had to
do managing user quotas vanished. The number of calls to the
helpdesk to sort out stupid problems due to applications running
out of disk space plummeted down to zero.

-- 
-Peter Tribble
L.I.S., University of Hertfordshire - http://www.herts.ac.uk/
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Supporting ~10K users on ZFS

2006-06-27 Thread Neil Perrin




[EMAIL PROTECTED] wrote On 06/27/06 17:17,:

We have over 1 filesystems under /home in strongspace.com and it works fine.

 I forget but there was a bug or there was an improvement made around nevada
 build 32 (we're currently at 41) that made the initial mount on reboot
 significantly faster.

Before that it was around 10-15 minutes. I wonder if that improvement didn't 
make

 it into sol10U2?

That fix (bug 6377670) made it into build 34 and S10_U2.



-Jason

Sent via BlackBerry from Cingular Wireless  


-Original Message-
From: eric kustarz [EMAIL PROTECTED]
Date: Tue, 27 Jun 2006 15:55:45 
To:Steve Bennett [EMAIL PROTECTED]

Cc:zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Supporting ~10K users on ZFS

Steve Bennett wrote:



OK, I know that there's been some discussion on this before, but I'm not sure 
that any specific advice came out of it. What would the advice be for 
supporting a largish number of users (10,000 say) on a system that supports 
ZFS? We currently use vxfs and assign a user quota, and backups are done via 
Legato Networker.





Using lots of filesystems is definitely encouraged - as long as doing so 
makes sense in your environment.




From what little I currently understand, the general advice would seem to be to 
assign a filesystem to each user, and to set a quota on that. I can see this 
being OK for small numbers of users (up to 1000 maybe), but I can also see it 
being a bit tedious for larger numbers than that.


I just tried a quick test on Sol10u2:
  for x in 0 1 2 3 4 5 6 7 8 9;  do for y in 0 1 2 3 4 5 6 7 8 9; do
  zfs create testpool/$x$y; zfs set quota=1024k testpool/$x$y
  done; done
[apologies for the formatting - is there any way to preformat text on this 
forum?]
It ran OK for a minute or so, but then I got a slew of errors:
  cannot mount '/testpool/38': unable to create mountpoint
  filesystem successfully created, but not mounted

So, OOTB there's a limit that I need to raise to support more than approx 40 
filesystems (I know that this limit can be raised, I've not checked to see 
exactly what I need to fix). It does beg the question of why there's a limit 
like this when ZFS is encouraging use of large numbers of filesystems.





There is no 40 filesystem limit.  You most likely had a pre-existing 
file/directory in testpool of the same name of the filesystem you tried 
to create.


fsh-hake# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
testpool77K  7.81G  24.5K  /testpool
fsh-hake# echo hmm  /testpool/01
fsh-hake# zfs create testpool/01
cannot mount 'testpool/01': Not a directory
filesystem successfully created, but not mounted
fsh-hake#



If I have 10,000 filesystems, is the mount time going to be a problem?
I tried:
  for x in 0 1 2 3 4 5 6 7 8 9;  do for x in 0 1 2 3 4 5 6 7 8 9; do
  zfs umount testpool/001; zfs mount testpool/001
  done; done
This took 12 seconds, which is OK until you scale it up - even if we assume 
that mount and unmount take the same amount of time, so 100 mounts will take 6 
seconds, this means that 10,000 mounts will take 5 minutes. Admittedly, this is 
on a test system without fantastic performance, but there *will* be a much 
larger delay on mounting a ZFS pool like this over a comparable UFS filesystem.





So this really depends on why and when you're unmounting filesystems.  I 
suspect it won't matter much since you won't be unmounting/remounting 
your filesystems.




I currently use Legato Networker, which (not unreasonably) backs up each 
filesystem as a separate session - if I continue to use this I'm going to have 
10,000 backup sessions on each tape backup. I'm not sure what kind of 
challenges restoring this kind of beast will present.

Others have already been through the problems with standard tools such as 'df' 
becoming less useful.





Is there a specific problem you had in mind regarding 'df;?



One alternative is to ditch quotas altogether - but even though disk is cheap, it's not 
free, and regular backups take time (and tapes are not free either!). In any case, 10,000 
undergraduates really will be able to fill more disks than we can afford to provision. We tried 
running a Windows fileserver back in the days when it had no support for per-user quotas; we did 
some ad-hockery that helped to keep track of the worst offenders (ableit after the event), but what 
really killed us was the uncertainty over whether some idiot would decide to fill all available 
space with vital research data (or junk, depending on your point of view).

I can see the huge benefits that ZFS quotas and reservations can bring, but I 
can also see that there is a possibility that there are situations where ZFS 
could be useful, but the lack of 'legacy' user-based quotas make it 
impractical. If the ZFS developers really are not going to implement user 
quotas is there any advice on what someone like me could do - at the moment I'm 
presuming that I'll just have to leave ZFS

Re: [zfs-discuss] Re: ZFS and Storage

2006-06-27 Thread Gregory Shaw

On Jun 27, 2006, at 3:30 PM, Al Hopper wrote:On Tue, 27 Jun 2006, Gregory Shaw wrote: Yes, but the idea of using software raid on a large server doesn'tmake sense in modern systems.  If you've got a large database serverthat runs a large oracle instance, using CPU cycles for RAID iscounter productive.  Add to that the need to manage the hardwaredirectly (drive microcode, drive brownouts/restarts, etc.) and theidea of using JBOD in modern systems starts to lose value in a big way.You will detect any corruption when doing a scrub.  It's not end-to-end, but it's no worse than today with VxVM. The initial impression I got, after reading the original post, is that itsauthor[1] does not grok something fundamental about ZFS and/or how itworks!  Or does not understand that there are many CPU cycles in a modernUnix box that are never taken advantage of.It's clear to me that ZFS provides considerable, never before available,features and facilities, and that any impact that ZFS may have on CPU ormemory utilization will become meaningless over time, as the # of CPUcores increase - along with their performance.  And that average systemmemory size will continue to increase over time.Perhaps the author is taking a narrow view that ZFS will *replace*existing systems.  I don't think that this will be the general case.Especially in a large organization where politics and turf wars willdominate any "technical" discussions and implementation decisions will bemade by senior management who are 100% buzzword compliant (and havequestionable technical/engineering skills).  Rather it will provide thesystem designer with a hugely powerful *new* tool to apply in systemdesign.  And will challenge the designer to use it creatively andeffectively.There is no such thing as the universal screw-driver.  Every toolbox hastens of screwdrivers and tool designers will continue to dream aboutreplacing them all with _one_ tool.[1] Sorry Gregory.Regards,Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDTOpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005                OpenSolaris Governing Board (OGB) Member - Feb 2006___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss No insult taken.  I was trying to point out that many customers don't have 'free' cpu cycles, and that every little bit you take from their machine for subsystem control is that much real work the system will not be doing.I think of the statement of "many cpu cycles in modern unix boxes that are never taken advantage of" in the similar vein of monitoring vendors:  "It's just another agent.  It won't take more than 5% of the box."I think we'll let the customer decide on the above.  I've encountered both situations:  customers with large boxes with plenty of headroom, and customers that run 100% all day, every day and have no cycles that aren't dedicated to real work.When I read as a ex-customer (e.g. not with Sun) that I've got to sacrifice cpu cycles in a software upgrade, it says to me that the upgrade will result in a slower system. -Gregory Shaw, IT ArchitectPhone: (303) 673-8273        Fax: (303) 673-8273ITCTO Group, Sun Microsystems Inc.1 StorageTek Drive MS 4382              [EMAIL PROTECTED] (work)Louisville, CO 80028-4382                 [EMAIL PROTECTED] (home)"When Microsoft writes an application for Linux, I've Won." - Linus Torvalds ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

48 matches

Mail list logo