Re: [zfs-discuss] ZFS and powerpath

2007-07-23 Thread Moore, Joe
Brian Wilson wrote:
 On Jul 16, 2007, at 6:06 PM, Torrey McMahon wrote:
  Darren Dunham wrote:
  My previous experience with powerpath was that it rode below the  
  Solaris
  device layer.  So you couldn't cause trespass by using the wrong
  device.  It would just go to powerpath which would choose the link
to
  use on its own.
 
  Is this not true or has it changed over time?
  I haven't looked at power path for some time but it used to be the
  opposite. The powerpath node sat on top of the actual device paths.

  One of the selling points of mpxio is that it doesn't have that  
  problem. (At least for devices it supports.) Most of the multipath
software had  
  that same limitation
 
 
 I agree, it's not true.  I don't know how long it hasn't been true,  
 but the last year and a half I've been implementing PowerPath on  
 Solaris 8, 9, 10, the way to make it work is to point whatever disk  
 tool you're using to the emcpower device.  The other paths are there  
 because leadville finds them and creates them (if you're using  
 leadville), but PowerPath isn't doing anything to make them  
 redundant, it's giving you the emcpower device and the emcp, etc.  
 drivers to front end them and give you a multipathed device (the  
 emcpower device).  It DOES choose which one to use, for all 
 I/O going  
 through the emcpower device.  In a situation where you lose 
 paths and  
 I/O is moving, you'll see scsi errors down one path, then the next,  
 then the next, as PowerPath gets fed the scsi error and tries the  
 next device path.  If you use those actual device paths, you're not  
 actually getting a device that PowerPath is multipathing for you  
 (i.e. it does not dig in beneath the scsi driver)

I'm afraid I have to disagree with you: I'm using the
/dev/dsk/c2t$WWNdXs2 devices quite happily with powerpath handling
failover for my clariion.

# powermt version
EMC powermt for PowerPath (c) Version 4.4.0 (build 274)
# powermt display dev=58
Pseudo name=emcpower58a
CLARiiON ID=APM00051704678 [uscicsap1]
Logical device ID=6006016067E51400565259A15331DB11 [saperqdb1:
/oracle/Q02/saparch]
state=alive; policy=BasicFailover; priority=0; queued-IOs=0
Owner: default=SP A, current=SP A

==
 Host ---   - Stor -   -- I/O Path -  --
Stats ---
### HW Path I/O PathsInterf.   ModeState  Q-IOs
Errors

==
3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016130202E48d58s0 
SP A1 active
alive  0  0
3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016930202E48d58s0 
SP B1 active
alive  0  0
# fsck /dev/dsk/c2t5006016130202E48d58s0
** /dev/dsk/c2t5006016130202E48d58s0
** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups

FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX? n

144 files, 189504 used, 33832172 free (420 frags, 4228969 blocks, 0.0%
fragmentation)
# fsck /dev/dsk/c2t5006016930202E48d58s0
** /dev/dsk/c2t5006016930202E48d58s0
** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups

FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX? n

144 files, 189504 used, 33832172 free (420 frags, 4228969 blocks, 0.0%
fragmentation)

### So at this point, I can look down either path and get to my data.
Now I kill 1 of the 2 paths via SAN zoning.  cfgadm -c configure c2, and
powermt check reports that the path to SP A is now dead.  I'm still able
to fsck the dead path:
# cfgadm -c configure c2
# powermt check
Warning: CLARiiON device path c2t5006016130202E48d58s0 is currently
dead.
Do you want to remove it (y/n/a/q)? n
# powermt display dev=58
Pseudo name=emcpower58a
CLARiiON ID=APM00051704678 [uscicsap1]
Logical device ID=6006016067E51400565259A15331DB11 [saperqdb1:
/oracle/Q02/saparch]
state=alive; policy=BasicFailover; priority=0; queued-IOs=0
Owner: default=SP A, current=SP B

==
 Host ---   - Stor -   -- I/O Path -  --
Stats ---
### HW Path I/O PathsInterf.   ModeState  Q-IOs
Errors

==
3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016130202E48d58s0 
SP A1 active
dead   0  1
3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016930202E48d58s0 
SP B1 active
alive  0  0
# fsck /dev/dsk/c2t5006016130202E48d58s0
** /dev/dsk/c2t5006016130202E48d58s0
** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch
** 

Re: [zfs-discuss] ZFS send needs optimalization

2007-07-23 Thread Robert Milkowski
Hello Łukasz,

Monday, July 23, 2007, 1:19:16 PM, you wrote:

Ł ZFS send is very slow.
Ł dmu_sendbackup function is traversing dataset in one thread and in 
Ł traverse callback function ( backup_cb  ) we are waiting for data in 
Ł arc_read called with ARC_WAIT flag.

Ł I want to parallize zfs send to make it faster. 
Ł dmu_sendbackup could allocate buffer, that will be used for buffering output.
Ł Few threads can traverse dataset, few threads would be used for async read 
operations.

Ł I think it could speed up zfs send operation 10x.

Ł What do you think about it ?

I guess you should check with Matthew Ahrens as IIRC he's working on
'zfs send -r' and possibly some other improvements to zfs send. The
question is what code changes Matthew has done so far (it hasn't been
integrated AFAIK) and possibly work from there. Or perhaps Matthew is
already working on it also...

Now, if zfs resides on lots of disks then I guess it should speed up
zfs send considerably, at least in some cases (lot of small files,
written/deleted/created randomly).

Then it would be great if you could implement something and share with
us some results to see if there's actually some performance gain.

Also I guess you'll have to write all transactions to the other end
(zfs recv) in the same order they were created on disk,or not?


ps. Lukasz - nice to see you here more and more :)


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sharemgr Test Suite Released on OpenSolaris.org

2007-07-23 Thread Jim Walker
The Sharemgr test suite is available on OpenSolaris.org.  
  
The source tarball, binary package and baseline can be downloaded from the test
consolidation download center at:
http://dlc.sun.com/osol/test/downloads/current

The source code can be viewed in the Solaris Test Collection (STC) 2.0 source
tree at:
http://cvs.opensolaris.org/source/xref/test/ontest-stc2/src/suites/share

The SUNWstc-tetlite package must be installed prior to executing a Sharemgr
test run. More information on the Sharemgr test suite is available in the
Sharemgr README file at:
http://src.opensolaris.org/source/xref/test/ontest-stc2/src/suites/share/README

Any questions about the Sharemgr test suite can be sent to testing discuss at:
http://www.opensolaris.org/os/community/testing/discussions 

Cheers,  
Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS and NFS Mounting - Missing Permissions

2007-07-23 Thread Scott Adair
Hi

I'm trying to setup a new NFS server, and wish to use Solaris and ZFS. I have a 
ZFS filesystem set up to handle the users home directories and setup sharing

   # zfs list
   NAME   USED  AVAIL  REFER  MOUNTPOINT
   data   896K  9.75G  35.3K  /data
   data/home  751K  9.75G  38.0K  /data/home
   data/home/bob 32.6K  9.75G  32.6K  /data/home/bob
   data/home/joe  647K  9.37M   647K  /data/home/joe
   data/home/paul32.6K  9.75G  32.6K  /data/home/paul

   # zfs get sharenfs data/home 
   NAME PROPERTY   VALUE  SOURCE
   data/homesharenfs   rw local 

And these directories are owned by the user

   # ls -l /data/home
   total 12
   drwxrwxr-x   2 bob  sigma  2 Jul 23 08:47 bob
   drwxrwxr-x   2 joe  sigma  4 Jul 23 11:31 joe
   drwxrwxr-x   2 paul sigma  2 Jul 23 08:47 paul

I have the top level directory shared (/data/home). When I mount this on the 
client pc (ubuntu) I loose all the permissions, and can't see any of the files..

   [EMAIL PROTECTED]:/nfs/home# ls -l
   total 6
   drwxr-xr-x 2 root root 2 2007-07-23 08:47 bob
   drwxr-xr-x 2 root root 2 2007-07-23 08:47 joe 
   drwxr-xr-x 2 root root 2 2007-07-23 08:47 paul

   [EMAIL PROTECTED]:/nfs/home# ls -l joe
   total 0

However, when I mount each directory manually, it works.. 

   [EMAIL PROTECTED]:~# mount torit01sx:/data/home/joe /scott

   [EMAIL PROTECTED]:~# ls -l /scott
   total 613
   -rwxrwxrwx 1 joe sigma 612352 2007-07-23 11:32 file

Any ideas? When I try the same thing with a UFS based filesystem it works as 
expected

   [EMAIL PROTECTED]:/# mount torit01sx:/export/home /scott

   [EMAIL PROTECTED]:/# ls -l scott
   total 1
   drwxr-xr-x 2 joe sigma 512 2007-07-23 12:25 joe

Any help would be greatly appreciated.. Thanks in advance

Scott
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS send needs optimalization

2007-07-23 Thread Matthew Ahrens
Robert Milkowski wrote:
 Hello Łukasz,
 
 Monday, July 23, 2007, 1:19:16 PM, you wrote:
 
 Ł ZFS send is very slow.
 Ł dmu_sendbackup function is traversing dataset in one thread and in 
 Ł traverse callback function ( backup_cb  ) we are waiting for data in 
 Ł arc_read called with ARC_WAIT flag.

That's correct.

 Ł I want to parallize zfs send to make it faster. 
 Ł dmu_sendbackup could allocate buffer, that will be used for buffering 
 output.
 Ł Few threads can traverse dataset, few threads would be used for async read 
 operations.
 
 Ł I think it could speed up zfs send operation 10x.
 
 Ł What do you think about it ?

You're right that we need to issue more i/os in parallel -- see 6333409 
traversal code should be able to issue multiple reads in parallel

However, it may be much more straightforward to just issue prefetches 
appropriately, rather than attempt to coordinate multiple threads.  That 
said, feel free to experiment.

 I guess you should check with Matthew Ahrens as IIRC he's working on
 'zfs send -r' and possibly some other improvements to zfs send. The
 question is what code changes Matthew has done so far (it hasn't been
 integrated AFAIK) and possibly work from there. Or perhaps Matthew is
 already working on it also...

Unfortunately I am not working on this bug as part of my zfs send -r 
changes.  But I plan to work on it (unless you get to it first!) later this 
year as part of the pool space reduction changes.

 Also I guess you'll have to write all transactions to the other end
 (zfs recv) in the same order they were created on disk,or not?

Nope, that's (one of) the beauty of zfs send.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] more love for databases

2007-07-23 Thread eric kustarz

On Jul 22, 2007, at 7:39 PM, JS wrote:

 There a way to take advantage of this in Sol10/u03?

 sorry, variable 'zfs_vdev_cache_max' is not defined in the 'zfs'  
 module

That tunable/hack will be available in s10u4:
http://bugs.opensolaris.org/view_bug.do?bug_id=6472021

wait about a month and it should be officially out...

eric

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and NFS Mounting - Missing Permissions

2007-07-23 Thread Richard Elling
Scott Adair wrote:
 Hi
 
 I'm trying to setup a new NFS server, and wish to use Solaris and ZFS. I have 
 a ZFS filesystem set up to handle the users home directories and setup sharing
 
# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
data   896K  9.75G  35.3K  /data
data/home  751K  9.75G  38.0K  /data/home
data/home/bob 32.6K  9.75G  32.6K  /data/home/bob
data/home/joe  647K  9.37M   647K  /data/home/joe
data/home/paul32.6K  9.75G  32.6K  /data/home/paul
 
# zfs get sharenfs data/home 
NAME PROPERTY   VALUE  SOURCE
data/homesharenfs   rw local 
 
 And these directories are owned by the user
 
# ls -l /data/home
total 12
drwxrwxr-x   2 bob  sigma  2 Jul 23 08:47 bob
drwxrwxr-x   2 joe  sigma  4 Jul 23 11:31 joe
drwxrwxr-x   2 paul sigma  2 Jul 23 08:47 paul
 
 I have the top level directory shared (/data/home). When I mount this on the 
 client pc (ubuntu) I loose all the permissions, and can't see any of the 
 files..

/data/home is a different file system than /data/home/joe.  NFS shares do not
cross file system boundaries.  You'll need to share /data/home/joe, too.
  -- richard

[EMAIL PROTECTED]:/nfs/home# ls -l
total 6
drwxr-xr-x 2 root root 2 2007-07-23 08:47 bob
drwxr-xr-x 2 root root 2 2007-07-23 08:47 joe 
drwxr-xr-x 2 root root 2 2007-07-23 08:47 paul
 
[EMAIL PROTECTED]:/nfs/home# ls -l joe
total 0
 
 However, when I mount each directory manually, it works.. 
 
[EMAIL PROTECTED]:~# mount torit01sx:/data/home/joe /scott
 
[EMAIL PROTECTED]:~# ls -l /scott
total 613
-rwxrwxrwx 1 joe sigma 612352 2007-07-23 11:32 file
 
 Any ideas? When I try the same thing with a UFS based filesystem it works as 
 expected
 
[EMAIL PROTECTED]:/# mount torit01sx:/export/home /scott
 
[EMAIL PROTECTED]:/# ls -l scott
total 1
drwxr-xr-x 2 joe sigma 512 2007-07-23 12:25 joe
 
 Any help would be greatly appreciated.. Thanks in advance
 
 Scott
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General recommendations on raidz groups of different sizes

2007-07-23 Thread Richard Elling
Richard Elling wrote:
 Haudy Kazemi wrote:
 How would one calculate system reliability estimates here? One is a 
 RAIDZ set of 6 disks, the other a set of 8. The reliability of each 
 RAIDZ set by itself isn't too hard to calculate, but put together, 
 especially since they're different sizes, I don't know.
 
 We just weigh them accordingly.  MTTDL for a 6-disk set will be better
 than for an 8-disk set, though that seems to be counter-intuitive for
 some folks.  Let me see if I can put some numbers together later this
 week...

OK, after some math we can get some idea...

Using the MTTDL[1] model for a default disk (500 GBytes, 800k hours MTBF,
24 hours logistical response, 60 GBytes/hr resync) we get:

configMTTDL[1] (yrs)

6-disk raidz75,319
8-disk raidz40,349
2x 6-disk raidz 37,659
6-disk raidz + 8-disk raidz 26,274
2x 8-disk raidz 20,175

As you would expect, the MTTDL for the 6-disk scenario is better than for
the 8-disk scenario.  So it follows that the MTTDL for a pair of 6-disk
raidz sets is better than for a pair of 8-disk raidz sets and the 6+8
scenario is in between.

The MTTDL[1] model is described at:
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl

  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ETA of device evacuation?

2007-07-23 Thread Mark Ashley

Hi Louwtjie,

(CC'd to the list as an FYI to others)

The biggest gotcha is the SE6140's have a 12 byte SCSI control data 
block, and thus can only do 2TB LUNs out to the host. That's not an 
issue with ZFS however since you can just tack them together and grow 
your pool that way. See the attached PNG. That's how we're doing it. 
You'd have one ZFS file system on top of the pool for your customers setup.


SE6140 limitations:
   Maximum Volumes Per Array 2,048
   Maximum Volumes Per RAID group 256
   Maximum Volume size 2 TB (minus 12 GB)  
   Maximum Drives in RAID group 30 
   Maximum RAID group Size 20 TB   
   Storage Partitions Yes
   Maximum Total Partitions 64
   Maximum Hosts per Partition 256
   Maximum Volumes per Partition 256
   Maximum Number of Global Hot Spares 15

The above limits might matter if you thought you'd just have one fat LUN 
coming from your SE6140. You can't do it that way. But as shown in the 
picture you can use ZFS to do all that for you. If you keep all of your 
LUNs at exactly 2000GB when you make them then you can mirror and then 
detach an array's LUNs one by one until you can remove the array.


It'll be nice when ZFS has the ability to natively remove LUNs from a 
pool, expected in several months apparently.


Don't try to install the SE6140 software on Solaris 11 unless you're 
good at porting. It's possible (our NFS server is Sol 11 b64a) but it's 
not end-user friendly. Solaris 9 or 10 is fine. We needed the Sol 11 
b64a version for the ZFS ISCSI abilities which are fixed in that release.


When setting up the SE6140s, I found the serial ports didn't function 
with the supplied cables, at least on our equipment. Be proactive and 
wire all the SE6140s into one management network and put a DHCP server 
on there that allocates IPs according to the MAC addresses on the 
controller cards. Then, and not before, go and register the arrays in 
the Common Array Manager software. (And download the latest (May 2007) 
from Sun first too). Trying to change an arrays IP once it's in the CAM 
setup is nasty.


From what you describe you can do it all with just one array. All of 
ours are 750GB SATA disk SE6140s, expandable to 88TB per array. Our 
biggest is one controller and one expansion tray so we have lots of 
headroom.


You lose up to 20% of your raw capacity after the SE6140 RAID5 volume 
setup and the ZFS overheads too. Keep that in mind when scoping your 
solution.


In your /etc/system, put in the tuning:

set zfs:zil_disable=1
set zfs:zfs_nocacheflush=1

Our NFS database clients also have these tunings:

VCS mount opts (or /etc/vfstab for you)
MountOpt = 
rw,bg,hard,intr,proto=tcp,vers=3,rsize=32768,wsize=32768,forcedirectio


ce.conf mods:
name=pci108e,abba parent=/[EMAIL PROTECTED],70 unit-address=1 
adv_autoneg_cap=1 adv_1000fdx_cap=1 accept_jumbo=1 adv_pause_cap=1;


This gets about 400MBytes/s all together, running to a T2000 NFS server. 
That's pretty much the limits of the hardware so we're happy with that :)


We're yet to look at MPxIO and load balancing across controllers. Plus 
I'm not sure I've tuned the file systems for Oracle block sizes. 
Depending on your solution that probably isn't an issue with you.


We like the ability to do ZFS snapshots and clones, we can copy an 
entire DB setup and create a clone in about ten seconds. Before it took 
hours using the EMCs.


Cheers,
Mark.



After reading your post ... I was wondering whether you would give
some input/advice on a certain configuration I'm working on.

A customer (potential) are considering using a server (probably Sun
galaxy) connected to 2 switches and lots (lots!) of 6140's.

- One large filesystem
- 70TB
- No downtime growth/expansion

Since it seems that you have several 6140's under ZFS control ... any
problems/comments for me?

Thank you.

On 7/19/07, Mark Ashley [EMAIL PROTECTED] wrote:

Hi folks,

One of the things I'm really hanging out for is the ability to evacuate
the data from a zpool device onto the other devices and then remove the
device. Without mirroring it first etc. The zpool would of course shrink
in size according to how much space you just took away.

Our situation is we have a number of SE6140 arrays attached to a host
with a total of 35TB. Some arrays are owned by other projects but are on
loan for a while. I'd like to make one very large pool from the (maximum
2TB! wtf!) LUNs from the SE6140s and once our DBAs are done with the
workspace, remove the LUNs and free up the SE6140 arrays so their owners
can begin to use them.

At the moment once a device is in a zpool, it's stuck there. That's a
problem. What sort of time frame are we looking at until it's possible
to remove LUNs from zpools?

ta,
Mark. 


inline: SE6140_to_ZFS.png___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss