Re: [zfs-discuss] ZFS and SAN

2008-02-11 Thread Richard Elling
Christophe Rolland wrote:
 Hi all
 we consider using ZFS for various storages (DB, etc). Most features are 
 great, especially the ease of use.
 Nevertheless, a few questions :

 - we are using SAN disks, so most JBOD recommandations dont apply, but I did 
 not find many experiences of zpool of a few terabytes on Luns... anybody ?
   

Many X4500 customers have many TBytes of storage under ZFS (JBOD).

 - we cannot remove a device from a pool. so no way of correcting the 
 attachment of a 200 GB LUN on a 6 TB pool on which oracle runs ... am i the 
 only one worrying ? 
   

There is no way to prevent you from running rm -rf / either.
In a real production environment, using best practices, you would
never type such commands -- you would always script them and
test on the test environment prior to rolling into production.

 - on a sun cluster, luns are seen on both nodes. Can we prevent mistakes like 
 creating a pool on already assigned luns ? for example, veritas wants a 
 force flag. With ZFS i can do :
 node1: zpool create X add lun1 lun2
 node2 : zpool create Y add lun1 lun2
 and then, results are unexpected, but pool X will never switch again ;-) 
 resource and zone are dead.
   

We've had some informal discussions on how to do this.
Currently, zpool and other commands which manage disk
partitioning (eg. format) use libdiskmgt calls to determine
if the slices or partitions are in use on the local machine.
For shared storage, we won't know if another machine might
be using another slice.  For Solaris Cluster, we could write an
extended protocol for checking with other nodes in the
cluster for use.  However, even this does not work for the
general case, such as a SAN or a SAN with heterogeneous
nodes.
 -- richard

 - what could be some interesting tools to test IO perfs ? did someone run 
 iozone and publish baseline, modifications and according results ?

 well, anyway, thanks to zfs team :D
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-11 Thread William Fretts-Saxton
It does.  The file size is limited to the original creation size, which is 65k 
for files with 1 data sample.

Unfortunately, I have zero experience with dtrace and only a little with truss. 
 I'm relying on the dtrace scripts from people on this thread to get by for now!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SAN

2008-02-11 Thread Christophe Rolland
Hi Robert,

thanks for the answer.

 You are not the only one. It's somewhere on ZFS developers list...
yes, i checked this on the whole list.
so, lets wait for the feature.

 Actually it should complain and using -f (force)
on the active node, yes.
but if we want to reuse the luns on the other node, there is no warning.

 CR - what could be some interesting tools to test IO
 Check for filebench (included with recent SXCE).
i ll try it.

thanks for your answer
christophe
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SAN

2008-02-11 Thread Robert Milkowski
Hello Christophe,

Friday, February 1, 2008, 7:55:31 PM, you wrote:

CR Hi all
CR we consider using ZFS for various storages (DB, etc). Most
CR features are great, especially the ease of use.
CR Nevertheless, a few questions :

CR - we are using SAN disks, so most JBOD recommandations dont
CR apply, but I did not find many experiences of zpool of a few terabytes on 
Luns... anybody ?

Just works.

CR - we cannot remove a device from a pool. so no way of correcting
CR the attachment of a 200 GB LUN on a 6 TB pool on which oracle runs
CR ... am i the only one worrying ? 

You are not the only one. It's somewhere on ZFS developers list...


CR - on a sun cluster, luns are seen on both nodes. Can we prevent
CR mistakes like creating a pool on already assigned luns ? for
CR example, veritas wants a force flag. With ZFS i can do :
CR node1: zpool create X add lun1 lun2
CR node2 : zpool create Y add lun1 lun2
CR and then, results are unexpected, but pool X will never switch
CR again ;-) resource and zone are dead.

Actually it should complain and using -f (force) option will have to
be used to do something like above.

CR - what could be some interesting tools to test IO perfs ? did
CR someone run iozone and publish baseline, modifications and according 
results ?

Check for filebench (included with recent SXCE).
Also check list archives and various blogs (including mine) for some
zfs benchmarks.



--
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] UFS on zvol Cache Questions...

2008-02-11 Thread Roch - PAE

Priming the cache for ZFS should work at least after boot
When freemem is large; any read block will make it to
cache. Post boot when memory is primed with something else
(what?) then it gets more difficult for both UFS and ZFS to
guess what to keep in caches.

Did you try priming ZFS after boot ?

So next you seem to suffer because your sequential write to
log files appear to displace from the ARC the more useful DB 
files (I'd be interested to see if this still occur after
you've primed the ZFS cache after boot).

Note that  if your logfile rate is  huge  (dd like) then ZFS
cache management will suffer but that is well on it's way to
being fixed. But  for DS, I would  think  that the log  rate
would be  more reasonable and that  your storage  is able to
keep up. That gives ZFS cache management a fighting chance
to store the reused data over sequential writes.

If the default  behavior is not working  for you, we'll need
to consider the ARC  behavior in this case.  I don't see why
it  should not work out of  the box. But manual control will
come also in the form of this DIO like feature :

6429855  Need way to tell ZFS that caching is a lost cause

While  we manage to try and  solve your problem out the box,
you might also have a  background process that keeps priming
the cache at a  low  I/O rate.  Not a great  workaround, but
should be effective.


-r



Brad Diggs writes:
  Hello Darren,
  
  Please find responses in line below...
  
  On Fri, 2008-02-08 at 10:52 +, Darren J Moffat wrote:
   Brad Diggs wrote:
I would like to use ZFS but with ZFS I cannot prime the cache
and I don't have the ability to control what is in the cache 
(e.g. like with the directio UFS option).
   
   Why do you believe you need that at all ?  
  
  My application is directory server.  The #1 resource that 
  directory needs to make maximum utilization of is RAM.  In 
  order to do that, I want to control every aspect of RAM
  utilization both to safely use as much RAM as possible AND
  avoid contention among things trying to use RAM.
  
  Lets consider the following example.  A customer has a 
  50M entry directory.  The sum of the data (db3 files) is
  approximately 60GB.  However, there is another 2GB for the
  root filesystem, 30GB for the changelog, 1GB for the 
  transaction logs, and 10GB for the informational logs.
  
  The system on which directory server will run has only 
  64GB of RAM.  The system is configured with the following
  partitions:
  
FS  Used(GB)  Description
 /  2 root
 /db60directory data
 /logs  41changelog, txn logs, and info logs
 swap   10system swap
  
  I prefer to keep the directory db cache and entry caches
  relatively small.  So the db cache is 2GB and the entry 
  cache is 100M.  This leaves roughly 63GB of RAM for my 60GB
  of directory data and Solaris. The only way to ensure that
  the directory data (/db) is the only thing in the filesystem
  cache is to set directio on / (root) and (/logs).
  
   What do you do to prime the cache with UFS 
  
  cd ds_instance_dir/db
  for i in `find . -name '*.db3`
  do
dd if=${i} of=/dev/null
  done
  
   and what benefit do you think it is giving you ?
  
  Priming the directory server data into filesystem cache 
  reduces ldap response time for directory data in the
  filesystem cache.  This could mean the difference between
  a sub ms response time and a response time on the order of
  tens or hundreds of ms depending on the underlying storage
  speed.  For telcos in particular, minimal response time is 
  paramount.
  
  Another common scenario is when we do benchmark bakeoffs
  with another vendor's product.  If the data isn't pre-
  primed, then ldap response time and throughput will be
  artificially degraded until the data is primed into either
  the filesystem or directory (db or entry) cache.  Priming
  via ldap operations can take many hours or even days 
  depending on the number of entries in the directory server.
  However, priming the same data via dd takes minutes to hours
  depending on the size of the files.  
  
  As you know in benchmarking scenarios, time is the most limited
  resource that we typically have.  Thus, priming via dd is much
  preferred.
  
  Lastly, in order to achieve optimal use of available RAM, we
  use directio for the root (/) and other non-data filesystems.
  This makes certain that the only data in the filesystem cache
  is the directory data.
  
   Have you tried just using ZFS and found it doesn't perform as you need 
   or are you assuming it won't because it doesn't have directio ?
  
  We have done extensive testing with ZFS and love it.  The three 
  areas lacking for our use cases are as follows:
   * No ability to control what is in cache. e.g. no directio
   * No absolute ability to apply an upper boundary to the amount
 of RAM consumed by ZFS.  I know that the arc cache has a 
 control that 

Re: [zfs-discuss] Hardware RAID vs. ZFS RAID

2008-02-11 Thread Andy Lubel
 With my (COTS) LSI 1068 and 1078 based controllers I get consistently

 better performance when I export all disks as jbod (MegaCli - 
 CfgEachDskRaid0).

   
 Is that really 'all disks as JBOD'? or is it 'each disk as a single 
 drive RAID0'?

single disk raid0:
./MegaCli -CfgEachDskRaid0 Direct -a0


It may not sound different on the surface, but I asked in another
thread 
and others confirmed, that if your RAID card has a battery backed
cache 
giving ZFS many single drive RAID0's is much better than JBOD (using
the 
'nocacheflush' option may even improve it more.)

My understanding is that it's kind of like the best of both worlds.
You 
get the higher number of spindles and vdevs for ZFS to manage, ZFS
gets 
to do the redundancy, and the the HW RAID Cache gives virtually
instant 
acknowledgement of writes, so that ZFS can be on it's way.

So I think many RAID0's is not always the same as JBOD. That's not to 
say that even True JBOD doesn't still have an advantage over HW RAID.
I 
don't know that for sure.

I have tried mixing hardware and zfs raid but it just doesn't make sense
to use from a performance or redundancy standpoint why we would add
those layers of complexity.  In this case I'm building nearline so there
isn't even a battery attached and I have disabled any caching on the
controller.  I have a SUN SAS HBA on the way which would be what I would
use ultimately for my JBOD attachment.


But I think there is a use for HW RAID in ZFS configs which wasn't 
always the theory I've heard.
 I have really learned not to do it this way with raidz and raidz2:

 #zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0 c3t12d0  
 c3t13d0 c3t14d0 c3t15d0
   
Why? I know creating raidz's with more than 9-12 devices, but that 
doesn't cross that threshold.
Is there a reason you'd split 8 disks up into 2 groups of 4? What 
experience led you to this?
(Just so I don't have to repeat it. ;) )

I don't know why but with most setups I have tested (8 and 16 drive
configs) dividing raid5 into 4 disks per vdev and 5 for a raidz2 perform
better.  Take a look at my simple dd test (filebench results as soon as
I can figure out how to get it working proper with SOL10).

=

8 SATA 500gb disk system with LSI 1068 (megaRAID ELP) - no BBU


-
bash-3.00# zpool history
History for 'pool0-raidz':
2008-02-11.16:38:13 zpool create pool0-raidz raidz c2t0d0 c2t1d0 c2t2d0
c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0

bash-3.00# zfs list
NAME  USED  AVAIL  REFER  MOUNTPOINT
pool0-raidz   117K  3.10T  42.6K  /pool0-raidz


bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo0 bs=8192
count=131072;time sync
131072+0 records in
131072+0 records out

real0m1.768s
user0m0.080s
sys 0m1.688s

real0m3.495s
user0m0.001s
sys 0m0.013s

bash-3.00# time dd if=/pool0-raidz/w-test.lo0
of=/pool0-raidz/rw-test.lo0 bs=8192; time sync
131072+0 records in
131072+0 records out

real0m6.994s
user0m0.097s
sys 0m2.827s

real0m1.043s
user0m0.001s
sys 0m0.013s

bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo1 bs=8192
count=655360;time sync
655360+0 records in
655360+0 records out

real0m24.064s
user0m0.402s
sys 0m8.974s

real0m1.629s
user0m0.001s
sys 0m0.013s

bash-3.00# time dd if=/pool0-raidz/w-test.lo1
of=/pool0-raidz/rw-test.lo1 bs=8192; time sync
655360+0 records in
655360+0 records out

real0m40.542s
user0m0.476s
sys 0m16.077s

real0m0.617s
user0m0.001s
sys 0m0.013s
bash-3.00# time dd if=/pool0-raidz/w-test.lo0 of=/dev/null bs=8192; time
sync
131072+0 records in
131072+0 records out

real0m3.443s
user0m0.084s
sys 0m1.327s

real0m0.013s
user0m0.001s
sys 0m0.013s

bash-3.00# time dd if=/pool0-raidz/w-test.lo1 of=/dev/null bs=8192; time
sync
655360+0 records in
655360+0 records out

real0m15.972s
user0m0.413s
sys 0m6.589s

real0m0.013s
user0m0.001s
sys 0m0.012s
---

bash-3.00# zpool history
History for 'pool0-raidz':
2008-02-11.17:02:16 zpool create pool0-raidz raidz c2t0d0 c2t1d0 c2t2d0
c2t3d0
2008-02-11.17:02:51 zpool add pool0-raidz raidz c2t4d0 c2t5d0 c2t6d0
c2t7d0

bash-3.00# zfs list
NAME  USED  AVAIL  REFER  MOUNTPOINT
pool0-raidz   110K  2.67T  36.7K  /pool0-raidz

bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo0 bs=8192
count=131072;time sync
131072+0 records in
131072+0 records out

real0m1.835s
user0m0.079s
sys 0m1.687s

real0m2.521s
user0m0.001s
sys 0m0.013s

bash-3.00# time dd if=/pool0-raidz/w-test.lo0
of=/pool0-raidz/rw-test.lo0 bs=8192; time sync
131072+0 records in
131072+0 records out

real0m2.376s
user0m0.084s
sys 0m2.291s

real0m2.578s
user0m0.001s
sys 0m0.013s

bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo1 bs=8192
count=655360;time sync
655360+0 records in
655360+0 records out

real0m19.531s
user0m0.404s
sys 0m8.731s

real0m2.255s
user0m0.001s
sys  

[zfs-discuss] 3ware support

2008-02-11 Thread Johan Kooijman
Goodmorning all,

can anyone confirm that 3ware raid controllers are indeed not working
under Solaris/OpenSolaris? I can't seem to find it in the HCL.

We're now using a 3Ware 9550SX as a S-ATA RAID controller. The
original plan was to disable all it's RAID functions and use justs the
S-ATA controller functionality for ZFS deployment.

If indeed 3Ware isn't support, I have to buy a new controller. Any
specific controller/brand you can recommend for Solaris?

-- 
Met vriendelijke groeten / With kind regards,
Johan Kooijman

T +31(0) 6 43 44 45 27
F +31(0) 76 201 1179
E [EMAIL PROTECTED]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] scrub halts

2008-02-11 Thread Lida Horn
The latest changes to the sata and marvell88sx modules
have been put  back to Solaris Nevada and should be
available in the next build (build 84).  Hopefully,
those of you who use it will find the changes helpful.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [Fwd: Re: Presales support on ZFS]

2008-02-11 Thread Jim Dunham

Enrico,


Is there any forecast to improve the efficiency of the replication
mechanisms of ZFS ? Fishwork - new NAS release 


I would take some time to talk with and understand exactly what the  
customer's expectation are for replication. i would not base my  
decision on the cost of replicating 10 bytes, regardless of how  
inefficient it may be.


These two documents should help:

http://www.sun.com/storagetek/white-papers/data_replication_strategies.pdf
http://www.sun.com/storagetek/white-papers/enterprise_continuity.pdf

Two key metrics of replication are:

Recovery Point Objective (RPO), is the amount of data lost (or less),  
measured as a unit time. Once a day backups yield a 24 hour RPO, once  
an hour snapshots yields ~1 hour RPO, asynchronous replication yields  
zero seconds to a few minutes RPO, and synchronous replication means  
zero seconds RPO.


Recovery Time Objective (RTO), is the amount of time after a failure,  
until normal operations are restored. Tapes backups could be minutes  
to hours, local snapshots could be nearly instantaneous, assuming the  
local site survived the failure. Remote snapshots or replicas could be  
minutes, hours or days, depending on the amount of data to  
resynchronize, impacted by network bandwidth and latency.


Availability Suite has a unique feature in this last area, called on- 
demand pull. Assuming that the primary site's volumes are lost, after  
they have been re-provisioned, a reverse update can be initiated.  
Besides the background resilvering in the reverse direction being  
active, eventually restoring all lost data, on-demand pull performs  
synchronous replication of data blocks on demand, as needed by the  
filesystem, database or application. Although the performance will be  
less then synchronous replication, the RTO is quite low. This type of  
recovery is analogous to loosing one's entire email account, having  
recovery initiated, but also selected email can be open as needed  
before the entire volume is restored, using  on demand requests to  
satisfy data blocks for relevant email requests.


Jim





Considering the solution we are offering to our customer ( 5 remote
sites replicating in one central data-center ) with ZFS ( cheapest
solution )  I should consider
3 times the network load of a solution based on SNDR-AVS and 3 times  
the

storage space too..correct ?

I there any documentation on that ?
Thanks

Richard Elling ha scritto:

Enrico Rampazzo wrote:

Hello
I'm offering a solution based on our disks where replication and
storage management should be made using only ZFS...
The test change few bytes on one file ( 10bytes ) and check how
many bytes the source sends to target.
The customer tried the replication between 2 volume...They  
compared

ZFS replica with true copy replica and they realized the following
considerations:

 1. ZFS uses a block bigger than HDS true copy



ZFS uses dynamic block sizes.  Depending on the configuration and
workload, just a few disk blocks will change, or a bunch of redundant
metadata might change.  In either case, changing the ZFS recordsize
will make little, if any, change.

 2. true copy sends 32Kbytes and ZFS 100K and more changing only  
10

file bytes

Can we configure ZFS to improve replication efficiencies ?



By default, ZFS writes two copies of metadata. I would not recommend
reducing this because it will increase your exposure to faults.   
What may

be happening here is that a 10 byte write may cause a metadata change
resulting in a minimum of three 512 byte physical blocks being
changed. The metadata copies are on spatially diverse, so you may see
these three
blocks starting at non-contiguous boundaries.  If Truecopy sends only
32kByte blocks (speculation), then the remote transfer will be  
96kBytes

for 3 local, physical block writes.

OTOH, ZFS will coalesce writes.  So you may be able to update a
number of files yet still only replicate 96kBytes through Truecopy.
YMMV.

Since the customer is performing replication, I'll assume they are  
very

interested in data protection, so keeping the redundant metadata is a
good idea. The customer should also be aware that replication at the
application level is *always* more efficient than replicating  
somewhere

down the software stack where you lose data context.
-- richard


The solution should consider 5 remote site replicating on one
central data-center. Considering the zfs block overhead the
customer is thinking to buy a solution based on traditional  
storage

arrays like HDS entry level arrays ( our 2530/2540 ). If so ..with
the ZFS the network traffic, storage space become big problems for
the customer infrastructures.

Are there any documentation explaining internal ZFS replication
mechanism to face the customer doubts ? Thanks


Do we need of AVS in our solution to solve the problem ?


Thanks



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] Real time mirroring

2008-02-11 Thread justin
Have you looked at AVS? (http://opensolaris.org/os/project/avs/)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-11 Thread johansen
 Is deleting the old files/directories in the ZFS file system
 sufficient or do I need to destroy/recreate the pool and/or file
 system itself?  I've been doing the former.

The former should be sufficient, it's not necessary to destroy the pool.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-11 Thread William Fretts-Saxton
I ran this dtrace script and got no output.  Any ideas?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Real time mirroring

2008-02-11 Thread Ross
Well 5 minutes after posting that the resilver completed.  However despite it 
saying that the resilver completed with 0 errors ten minutes ago, the device 
still shows as unavailable, and my pool is still degraded.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Real time mirroring

2008-02-11 Thread Ross
Found my first problems with this today.  The ZFS mirror appears to work fine, 
but if you disconnect one of the iSCSI targets it hangs for 5 mins or more.  
I'm also seeing very concerning behaviour when attempting to re-attach the 
missing disk.

My test scenario is:
 - Two 35GB iSCSI targets are being shared using ZFS shareiscsi=on
 - They are imported to a 3rd Solaris box and used to create a mirrored ZFS pool
 - I use that to mount a NFS share, and connected to that with VMware ESX server

My first test was to clone a virtual machine onto the new volume.  That 
appeared to work fine, so I decided to test the mirroring.  I started another 
clone operation then powered down one of the iSCSI targets.  Well, the clone 
operation seemed to hang as soon as I did that, so I ran zpool status to see 
what was going on.  The news wasn't good:  That hung too.

Nothing happened in either window for a good 5 minutes, then ESX popped up with 
an error saying the virtual disk is either corrupted or not a supported 
format, and at the exact same time the zpool status command completed, but 
showing that all the drives were still ONLINE.

I immediately re-ran zpool status, now it reported that one iSCSI was now 
offline and the pool was running in a degraded state.

So, for some reason it's taken 5 minutes for the iSCSI device to go offline, 
it's locked up ZFS for that entire time, and ZFS reported the wrong status the 
first time around too.

The only good news is that now that ZFS is in a degraded state I can start the 
clone operation again and it completes fine with just half of the mirror 
available.

Next, I powered on the missing server, checked format  /dev/null to ensure 
the drives had re-connected, and used zpool online to re-attach the missing 
disk.  So far it's taken over an hour to attempt to resilver files from a 10 
minute copy, and the progress report is up and down like a yo-yo.  The progress 
reporting from ZFS so far has been:
 - 2.25% done, 0h13m to go
 - 7.20% done, 0h12m to go
 - 6.14% done, 0h8m to go(odd, how does it go down?)
 ...
 - 78.50% done, 0h2m to go
 - 41.67% done, 0h8m to go   (huh?)
 ...
 - 72.45% done, 0h3m to go
 - 42.42% done, 0h9m to go

Getting concerned now, I'm actually wondering if this is ever going to 
complete, and I have no idea if these problems are ZFS or iSCSI related.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI target using ZFS filesystem as backing

2008-02-11 Thread Darren J Moffat
Ross wrote:
 Bleh, found out why they weren't appearing.  I was just creating a regular 
 ZFS filesystem and setting shareiscsi=on.  If you create a volume it works 
 fine...
 
 I wonder if that's something that could do with being added to the 
 documentation for shareiscsi?  I can see now that all the examples of how to 
 use it are using the zfs create -V command, but can't find anything that 
 explicitly states that shareiscsi needs a fixed size volume.
 
 Should ZFS generate an error if somebody tries to set shareiscsi=on for a 
 filesystem that doesn't support that property?

My initial reaction was yes, however there is a case where you want to 
set shareisci=on for a filesystem.  Setting it on a filesystem allows 
for it to be inherited by any volumes created below that point in the 
hierarchy.

Lets take this fictional, but reasonable, dataset hierarchy.

tank/volumes/template/solaris
tank/volumes/template/linux
tank/volumes/template/windows
tank/volumes/archive/
tank/volumes/active/host-abc
tank/volumes/active/host-xyz


tank is the pool name.
volumes is a dataset (with canmount=false if you like)
template, archive, active are allso datasets (again canmount=false)

The actual volumes are: solaris, linux, windows, host-abc, host-xyz

So where do we a turn on iscsi sharing ?  It could be done at the 
individual volume layer, or it could be done up at the volumes dataset 
layer eg:

zfs set shareiscsi=on tank/volumes/template/solaris
zfs set shareiscsi=on tank/volumes/template/linux
zfs set shareiscsi=on tank/volumes/template/windows
...

or just do:
zfs set shareiscsi=on tank/volumes/

Aside: having canmount=false on tank/volumes may or may not be a good 
idea but it depends on the local deployment.


-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [Fwd: Re: Presales support on ZFS]

2008-02-11 Thread Enrico Rampazzo
Is there any forecast to improve the efficiency of the replication 
mechanisms of ZFS ? Fishwork - new NAS release 

Considering the solution we are offering to our customer ( 5 remote 
sites replicating in one central data-center ) with ZFS ( cheapest 
solution )  I should consider
3 times the network load of a solution based on SNDR-AVS and 3 times the 
storage space too..correct ?

I there any documentation on that ?
Thanks

Richard Elling ha scritto:
 Enrico Rampazzo wrote:
 Hello
 I'm offering a solution based on our disks where replication and 
 storage management should be made using only ZFS...
 The test change few bytes on one file ( 10bytes ) and check how 
 many bytes the source sends to target.
 The customer tried the replication between 2 volume...They compared 
 ZFS replica with true copy replica and they realized the following 
 considerations:

   1. ZFS uses a block bigger than HDS true copy
   

 ZFS uses dynamic block sizes.  Depending on the configuration and
 workload, just a few disk blocks will change, or a bunch of redundant
 metadata might change.  In either case, changing the ZFS recordsize
 will make little, if any, change.

   2. true copy sends 32Kbytes and ZFS 100K and more changing only 10
  file bytes

 Can we configure ZFS to improve replication efficiencies ?
   

 By default, ZFS writes two copies of metadata. I would not recommend
 reducing this because it will increase your exposure to faults.  What may
 be happening here is that a 10 byte write may cause a metadata change
 resulting in a minimum of three 512 byte physical blocks being 
 changed. The metadata copies are on spatially diverse, so you may see 
 these three
 blocks starting at non-contiguous boundaries.  If Truecopy sends only
 32kByte blocks (speculation), then the remote transfer will be 96kBytes
 for 3 local, physical block writes.

 OTOH, ZFS will coalesce writes.  So you may be able to update a
 number of files yet still only replicate 96kBytes through Truecopy.
 YMMV.

 Since the customer is performing replication, I'll assume they are very
 interested in data protection, so keeping the redundant metadata is a
 good idea. The customer should also be aware that replication at the
 application level is *always* more efficient than replicating somewhere
 down the software stack where you lose data context.
 -- richard

 The solution should consider 5 remote site replicating on one 
 central data-center. Considering the zfs block overhead the 
 customer is thinking to buy a solution based on traditional storage 
 arrays like HDS entry level arrays ( our 2530/2540 ). If so ..with 
 the ZFS the network traffic, storage space become big problems for 
 the customer infrastructures.

 Are there any documentation explaining internal ZFS replication 
 mechanism to face the customer doubts ? Thanks
   
 Do we need of AVS in our solution to solve the problem ?
  
 Thanks
   

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] scrub halts

2008-02-11 Thread Will Murnane
On Feb 12, 2008 4:45 AM, Lida Horn [EMAIL PROTECTED] wrote:
 The latest changes to the sata and marvell88sx modules
 have been put  back to Solaris Nevada and should be
 available in the next build (build 84).  Hopefully,
 those of you who use it will find the changes helpful.
I have indeed found it beneficial.  I installed the new drivers on two
machines, both of which were intermittently giving errors about device
resets.  One card did this so often that I believed the card was
faulty and I would have to replace either the card or the motherboard.

Since installing the new drivers I've had no issues whatsoever with
drives on either box.  I ran zpool scrubs continuously on the flaky
box, replaced a disk with another one, and copied data about in an
attempt to replicate the bus errors I had previously seen, to no
avail.  The other box has been similarly stable, as far as I can tell;
I see no messages in the logs and the users haven't complained when I
asked them.

Thank you for the work you've put into improving the state of these
drivers; I meant to email you earlier this week and mention the great
strides they have made, but other things took precedence.  That, to my
mind, is the primary evolution these drivers have made: I don't have
to worry about my HBAs any more.

Thanks!
Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Real time mirroring

2008-02-11 Thread Ross
Well, I got it working, but not in a tidy way.  I'm running HA-ZFS here, so I 
moved the ZFS pool over to the other node in the cluster.  That had exactly the 
same problems however, the iSCSI disks were unavailable.

Then I found an article from November 2006 
(http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html), saying that 
the iSCSI initiator won't reconnect until you reboot.  I rebooted one node of 
the cluster, then swopped ZFS back over to there and Voila!  Fully working 
mirrored storage again.

So I guess it's an iSCSI initiator problem in that it doesn't reconnect 
properly to a rebooted target, but it's not a particularly stable solution at 
this stage.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI target using ZFS filesystem as backing

2008-02-11 Thread Ross
Bleh, found out why they weren't appearing.  I was just creating a regular ZFS 
filesystem and setting shareiscsi=on.  If you create a volume it works fine...

I wonder if that's something that could do with being added to the 
documentation for shareiscsi?  I can see now that all the examples of how to 
use it are using the zfs create -V command, but can't find anything that 
explicitly states that shareiscsi needs a fixed size volume.

Should ZFS generate an error if somebody tries to set shareiscsi=on for a 
filesystem that doesn't support that property?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is swap still needed on c0d0s1 to get crash dumps?

2008-02-11 Thread Roman Morokutti
Thank you for your info. So with dumpadm I can
manage crash-dumps. And if ZFS is not capable
of handling those dumps, who cares. Then I will
create an extra slice for those purposes. No 
problem.

Roman
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss