from:"\"Marion Hakanson\""

Re: [zfs-discuss] Petabyte pool?

2013-03-16 Thread Marion Hakanson

>hakan...@ohsu.edu said:
>> I get a little nervous at the thought of hooking all that up to a single
>> server, and am a little vague on how much RAM would be advisable, other than
>> "as much as will fit" (:-).  Then again, I've been waiting for something 
like
>> pNFS/NFSv4.1 to be usable for gluing together multiple NFS servers into a
>> single global namespace, without any sign of that happening anytime soon.
> 
richard.ell...@gmail.com said:
> NFS v4 or DFS (or even clever sysadmin + automount) offers single namespace
> without needing the complexity of NFSv4.1, lustre, glusterfs, etc. 

Been using NFSv4 since it showed up in Solaris-10 FCS, and it is true
that I've been clever enough (without automount -- I like my computers
to be as deterministic as possible, thank you very much :-) for our
NFS clients to see a single directory-tree namespace which abstracts
away the actual server/location of a particular piece of data.

However, we find it starts getting hard to manage when a single project
(think "directory node") needs more space than their current NFS server
will hold.  Or perhaps what you're getting at above is even more clever
than I have been to date, and is eluding me at the moment.  I did see
someone mention "NFSv4 referrals" recently, maybe that would help.

Plus, believe it or not, some of our customers still insist on having the
server name in their path hierarchy for some reason, like /home/mynfs1/,
/home/mynfs2/, and so on.  Perhaps I've just not been persuasive enough
yet (:-).


richard.ell...@gmail.com said:
> Don't forget about backups :-)

I was hoping I could get by with telling them to buy two of everything.

Thanks and regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Petabyte pool?

2013-03-15 Thread Marion Hakanson

>>Ray said:
>>> Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed
>>> to a couple of LSI SAS switches. 
>> 
>Marion said:
>> How many HBA's in the R720?
>
Ray said:
> We have qty 2 LSI SAS 9201-16e HBA's (Dell resold[1]).

Sounds similar in approach to the Aberdeen product another sender referred to,
with SAS switch layout:
  http://www.aberdeeninc.com/images/1-up-petarack2.jpg

One concern I had is that I compared our SuperMicro JBOD with 40x 4TB drives
in it, connected via a dual-port LSI SAS 9200-8e HBA, to the same pool layout
on a 40-slot server with 40x SATA drives in it.  But the server uses no SAS
expanders, instead using SAS-to-SATA octopus cables to connect the drives
directly to three internal SAS HBA's (2x 9201-16i's, 1x 9211-8i).

What I found was that the internal pool was significantly faster for both
sequential and random I/O than the pool on the external JBOD.

My conclusion was that I would not want to exceed ~48 drives on a single
8-port SAS HBA.  So I thought that running the I/O of all your hundreds
of drives through only two HBA's would be a bottleneck.

LSI's specs say 4800MBytes/sec for an 8-port SAS HBA, but 4000MBytes/sec
for that card in an x8 PCIe-2.0 slot.  Sure, the newer 9207-8e is rated
at 8000MBytes/sec in an x8 PCIe-3.0 slot, but it still has only the same
8 SAS ports going at 4800MBytes/sec.

Yes, I know the disks probably can't go that fast.  But in my tests
above, the internal 40-disk pool measures 2000MBytes/sec sequential
reads and writes, while the external 40-disk JBOD measures at 1500
to 1700 MBytes/sec.  Not a lot slower, but significantly slower, so
I do think the number of HBA's makes a difference.

At the moment, I'm leaning toward piling six, eight, or ten HBA's into
a server, preferably one with dual IOH's (thus two PCIe busses), and
connecting dual-path JBOD's in that manner.

I hadn't looked into SAS switches much, but they do look more reliable
than daisy-chaining a bunch of JBOD's together.  I just haven't seen
how to get more bandwidth through them to a single host.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Petabyte pool?

2013-03-15 Thread Marion Hakanson

rvandol...@esri.com said:
> We've come close:
> 
> admin@mes-str-imgnx-p1:~$ zpool list
> NAME   SIZE  ALLOC   FREECAP DEDUP  HEALTH  ALTROOT
> datapool   978T   298T   680T30%  1.00x  ONLINE  -
> syspool278G   104G   174G37%  1.00x  ONLINE  -
> 
> Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed to
> a couple of LSI SAS switches. 

Thanks Ray,

We've been looking at those too (we've had good luck with our MD1200's).

How many HBA's in the R720?

Thanks and regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Petabyte pool?

2013-03-15 Thread Marion Hakanson

Greetings,

Has anyone out there built a 1-petabyte pool?  I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data.  Probably a single 10Gbit NIC for connectivity is sufficient.

We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".

I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-).  Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.

So, has anyone done this?  Or come close to it?  Thoughts, even if you
haven't done it yourself?

Thanks and regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mpt_sas multipath problem?

2013-01-07 Thread Marion Hakanson

j...@opensolaris.org said:
> Output from 'prtconf -v' would help, as would a cogent description of what
> you are looking at to determine that MPxIO isn't working. 

Sorry James, I must've made a cut-and-paste-o and left out my description
of the symptom.  That being, 40 new drives show up as 80 new disk devices
at the OS level (in "format", in "cfgadm -alv", in "ls /dev/dsk" and in
"prtconf -Dv" listings).

Adding the drives' string to a white-list in scsi_vhci.conf got us going,
thanks to Richard's reminder.  I do have before and after prtconf listings,
if anyone is interested.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mpt_sas multipath problem?

2013-01-07 Thread Marion Hakanson

richard.ell...@gmail.com said:
> Sometimes the mpxio detection doesn't work properly. You can try to whitelist
> them, https://www.illumos.org/issues/644

And I said:
> Thanks Richard, I was hoping I hadn't just made up my vague memory of such
> functionality.  We'll give it a try. 

That did the trick.  I added these lines to /kernel/drv/scsi_vhci.conf,
at the end of the file:

scsi-vhci-failover-override = 
"WD  WD4001FYYG-01SL3", "f_sym";  # WD RE 4TB SAS HDD

A reboot was involved, as I wasn't able to coax the system into re-reading
the scsi_vhci.conf file using "update_drv scsi_vhci", nor by unplugging
and replugging the JBOD's SAS cables, "cfgadm -c unconfigure c49", etc.

I'm off to exercise it with filebench tomorrow

Thanks and regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mpt_sas multipath problem?

2013-01-07 Thread Marion Hakanson

> On Jan 7, 2013, at 1:20 PM, Marion Hakanson  wrote:
> Greetings,
> We're trying out a new JBOD here.  Multipath (mpxio) is not working, and we
> could use some feedback and/or troubleshooting advice.
> . . .

richard.ell...@gmail.com said:
> Sometimes the mpxio detection doesn't work properly. You can try to whitelist
> them, https://www.illumos.org/issues/644

Thanks Richard, I was hoping I hadn't just made up my vague memory
of such functionality.  We'll give it a try.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] mpt_sas multipath problem?

2013-01-07 Thread Marion Hakanson

Greetings,

We're trying out a new JBOD here.  Multipath (mpxio) is not working,
and we could use some feedback and/or troubleshooting advice.

The OS is oi151a7, running on an existing server with a 54TB pool
of internal drives.  I believe the server hardware is not relevant
to the JBOD issue, although the internal drives do appear to the
OS with multipath device names (despite the fact that these
internal drives are cabled up in a single-path configuration).  If
anything, this does confirm that multipath is enabled in mpt_sas.conf
via the mpxio-disable="no" directive (internal HBA's are LSI SAS,
2x 9201-16i and 1x 9211-8i).

The JBOD is a SuperMicro 847E26-RJBOD1, with the front backplane
daisy-chained to the rear backplane (both expanders).  Each of the two
expander chains is connected to one port of an LSI SAS 9200-8e HBA.  So
far, all this hardware has appeared as working for others and well-supported,
and this 9200-8e is running the -IT firmware, version 15.0.0.0.

The drives are 40x of the WD4001FYYG SAS 4TB variety, firmware VR02.
The spot-checks I've done so far seem to show that both device instances
of a drive show up in "prtconf -Dv" with identical serial numbers and
identical "devid" and "guid" values, so I'm not sure what might be
missing to allow mpxio to recognize them as the same device.

Has anyone out there got this type of hardware working?  In a multipath
configuration?  Suggestions on mdb or dtrace code I can use to debug?
Are there "secrets" to the internal daisy-chain cabling that our vendor
is not aware of?

Thanks and regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot replace X with Y: devices have different sector alignment

2012-11-12 Thread Marion Hakanson

tron...@gmail.com said:
> That said, I've already migrated far too many times already. I really, really
> don't want to migrate the pool again, if it can be avoided. I've already
> migrated from raidz1 to raidz2 and then from raidz2 to mirror vdevs. Then,
> even though I already had a mix of 512b and 4k discs in the pool, when I
> bought new 3TB discs, I couldn't add them to the pool, and I had to set up a
> new pool with ashift=12. In retrospect, I should have built the new pool
> without the 2TB drives, and had I known what I do now, I would definately
> have done that. 

Are you sure you can't find 3TB/4TB drives with 512b sectors?  If you can
believe the "User Sectors Per Drive" specifications, these WD disks do:
WD4000FYYZ, WD3000FYYZ
Those are the SATA part-numbers;  There are SAS equivalents.

I also found the Hitachi UltraStar 7K3000 and 7K4000 drives claim to
support 512-byte sector sizes.

Sure, they're expensive, but what enterprise-grade drives aren't?  And,
they might solve your problem.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Seagate Constellation vs. Hitachi Ultrastar

2012-04-09 Thread Marion Hakanson

richard.ell...@richardelling.com said:
> We are starting to see a number of SAS HDDs that prefer logical-block to
> round-robin. I see this with late model Seagate and Toshiba HDDs.
> 
> There is another, similar issue with recognition of multipathing by the
> scsi_vhci driver. Both of these are being tracked as https://www.illumos.org/
> issues/644 and there is an alternate scsi_vhci.conf file posted in that
> bugid.

Interesting, I just last week had a Toshiba come from Dell as a replacement
for a Seagate 2TB SAS drive;  On Solaris-10, the Toshiba insisted on showing
up as 2 drives, so mpxio was not recognizing it.  Fortunately I was able to
swap the drive for a Seagate, but I'll stash away a copy of the scsi_vhci.conf
entry for the future.


> We're considering making logical-block the default (as in above bugid) and we
> have not discovered a reason to keep round-robin. If you know of any reason
> why round-robin is useful, please add to the bugid. 

Should be fine.  When I first ran into this a couple years ago, I did a
lot of tests and found logical-block to be slower than "none" (with those
Seagate 2TB SAS drives in Dell MD1200's), but not a whole lot slower.
I vaguely recall that round-robin was better for highly random, small I/O 
(IOPS-intensive) workloads.

I got the best results by manually load-balancing half the drives to one
path and half the drives to the other path.  But I decided it was not
worth the effort.  Maybe if there was a way to automatically do that
(with a relatively static result)  Of course, this was all tested
on Solaris-10, so your mileage may vary.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Seagate Constellation vs. Hitachi Ultrastar

2012-04-06 Thread Marion Hakanson

a...@blackandcode.com said:
> I'm spec'ing out a Thumper-esque solution and having trouble finding my
> favorite Hitachi Ultrastar 2TB drives at a reasonable post-flood price. The
> Seagate Constellations seem pretty reasonable given the market circumstances
> but I don't have any experience with them. Anybody using these in their ZFS
> systems and have you had good luck?  

We have a lot of 2TB and 3TB Seagates here, they work fine.  Most of
ours are the Nearline-SAS variety, in Dell MD1200 enclosures, used on
Windows & Linux behind PERC H800 RAID cards, and on Solaris-10 and
OpenIndiana behind LSI SAS HBA's.  We do have one new server with a
pile of 2TB SATA Seagate's as well, so far working fine.

The only caveat I've found is that the Nearline SAS Seagates go really
slow with the Solaris default multipath load-balancing setting
(round-robin).  Set it to "none" or some large block value and they go
fast.  This issue doesn't appear when used with the PERC H800's.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Marion Hakanson

p...@kraus-haus.org said:
> Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
> Achieving 500MB/sec. with 8KB files and lots of random accesses is really
> hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of
> 100MB+ files is much easier.
> . . .
> For ZFS, performance is proportional to the number of vdevs NOT the
> number of drives or the number of drives per vdev. See https://
> docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL
> Xc for some testing I did a while back. I did not test sequential read as
> that is not part of our workload. 
> . . .
> I understand why the read performance scales with the number of vdevs,
> but I have never really understood _why_ it does not also scale with the
> number of drives in each vdev. When I did my testing with 40 dribves, I
> expected similar READ performance regardless of the layout, but that was NOT
> the case. 

In your first paragraph you make the important point that "performance"
is too ambiguous in this discussion.  Yet in the 2nd & 3rd paragraphs above,
you go back to using "performance" in its ambiguous form.  I assume that
by "performance" you are mostly focussing on random-read performance

My experience is that sequential read performance _does_ scale with the number
of drives in each vdev.  Both sequential and random write performance also
scales in this manner (note that ZFS tends to save up small, random writes
and flush them out in a sequential batch).

Small, random read performance does not scale with the number of drives in each
raidz[123] vdev because of the dynamic striping.  In order to read a single
logical block, ZFS has to read all the segments of that logical block, which
have been spread out across multiple drives, in order to validate the checksum
before returning that logical block to the application.  This is why a single
vdev's random-read performance is equivalent to the random-read performance of
a single drive.


p...@kraus-haus.org said:
> The recommendation is to not go over 8 or so drives per vdev, but that is
> a performance issue NOT a reliability one. I have also not been able to
> duplicate others observations that 2^N drives per vdev is a magic number (4,
> 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is
> reliable, just (relatively) slow :-) 

Again, the "performance issue" you describe above is for the random-read
case, not sequential.  If you rarely experience small-random-read workloads,
then raidz* will perform just fine.  We often see 2000 MBytes/sec sequential
read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's
(using 2TB drives).

However, when a disk fails and must be resilvered, that's when you will
run into the slow performance of the small, random read workload.  This
is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
especially of the 1TB+ size.  That way if it takes 200 hours to resilver,
you've still got a lot of redundancy in place.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] permissions

2012-02-10 Thread Marion Hakanson

capcas...@gmail.com said:
> I have a file that I can't delete, change permissions or owner.  ls -v does
> not show any acl's on the file not even those for normal unix rw etc.
> permissions from ls -l show -rwx-- chmod gived an error of not owner for
> the owner !! and for root just says can't change or not owner depending on
> the mode i am trying to set.

In addition to Richard's suggestion, a couple things come to mind:

(1) Ability to delete a file (or rename it) depends on the permissions
of the directory which contains it, not on the file itself.

(2) If you're doing the delete/chown on an NFS client, ownerships could
be different than expected if UID mapping is broken (NFSv4 with a
mismatched domain), or if remote root is being mapped to "nobody"
on the NFS server.  Similar issues could happen for a CIFS client.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] L2ARC, block based or file based?

2012-01-13 Thread Marion Hakanson

mattba...@gmail.com said:
> We're looking at buying some additional SSD's for L2ARC (as well as
> additional RAM to support the increased L2ARC size) and I'm wondering if we
> NEED to plan for them to be large enough to hold the entire file or if ZFS
> can cache the most heavily used parts of a single file.
> 
> After watching arcstat (Mike Harsch's updated version) and arc_summary, I'm
> still not sure what to make of it. It's rare that the l2arc (14Gb) hits
> double digits in %hit whereas the ARC (3Gb) is frequently >80% hit. 

I'm not sure of the answer to your initial question (file-based vs 
block-based),
but I may have an explanation for the stats you're seeing.  We have a system
here with 96GB of RAM and also the Sun F20 flash accelerator card (96GB),
most of which is used for L2ARC.

Note that data is not written into the L2ARC until it is evicted from the
ARC (e.g. when something newer or more frequently used needs ARC space).
So, my interpretation of the high hit rates on the in-RAM ARC, and low hit
rates on the L2ARC, is that the working set of data fits mostly in RAM,
and the system seldom needs to go to the L2ARC for more.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?

2011-10-31 Thread Marion Hakanson

lmulc...@marinsoftware.com said:
> . . .
> The MySQL server is:
> Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks
> each - 600GB 10k RPM SAS Drives Storage Controller: LSI, Inc. 1068E (JBOD)
> 
> I have also seen similar symptoms on systems with MD1000 disk arrays
> containing 2TB 7200RPM SATA drives.
> 
> The only thing of note that seems to show up in the /var/adm/messages file on
> this MySQL server is:
> 
> Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/
> pci8086,3410@9/pci1000,3080@0 (mpt0): Oct 31 18:24:51 mslvstdp02r mpt
> request inquiry page 0x89 for SATA target:58 failed! Oc
> . . .

Have you got the latest firmware on your LSI 1068E HBA's?  These have been
known to have lockups/timeouts when used with SAS expanders (disk enclosures)
with incompatible firmware revisions, and/or with older mpt drivers.

The MD1220 is a 6Gbit/sec device.  You may be better off with a matching
HBA  -- Dell has certainly told us the MD1200-series is not intended for
use with the 3Gbit/sec HBA's.  We're doing fine with the LSI SAS 9200-8e,
for example, when connecting to Dell MD1200's with the 2TB "nearline SAS"
disk drives.

Last, are you sure it's memory-related?  You might keep an eye on "arcstat.pl"
output and see what the ARC sizes look like just prior to lockup.  Also,
maybe you can look up instructions on how to force a crash dump when the
system hangs -- one of the experts around here could tell a lot from a
crash dump file.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cannot format 2.5TB ext disk (EFI)

2011-06-23 Thread Marion Hakanson

kitty@oracle.com said:
> It wouldn't let me
> # zpool create test_pool c5t0d0p0
> cannot create 'test_pool': invalid argument for this pool operation 

Try without the "p0", i.e. just:

  # zpool create test_pool c5t0d0

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Marion Hakanson

jp...@cam.ac.uk said:
> I can't speak for this particular situation or solution, but I think in
> principle you are wrong.  Networks are fast.  Hard drives are slow.  Put a
> 10G connection between your storage and your front ends and you'll have  the
> bandwidth[1].  Actually if you really were hitting 1000x8Mbits I'd put  2,
> but that is just a question of scale.  In a different situation I have  boxes
> which peak at around 7 Gb/s down a 10G link (in reality I don't need  that
> much because it is all about the IOPS for me).  That is with just  twelve 15k
> disks.  Your situation appears to be pretty ideal for storage  hardware, so
> perfectly achievable from an appliance. 

Depending on usage, I disagree with your bandwidth and latency figures
above.  An X4540, or an X4170 with J4000 JBOD's, has more bandwidth
to its disks than 10Gbit ethernet.  You would need three 10GbE interfaces
between your CPU and the storage appliance to equal the bandwidth of a
single 8-port 3Gb/s SAS HBA (five of them for 6Gb/s SAS).

It's also the case that the Unified Storage platform doesn't have enough
bandwidth to drive more than four 10GbE ports at their full speed:
http://dtrace.org/blogs/brendan/2009/09/22/7410-hardware-update-and-analyzing-t
he-hypertransport/

We have a customer (internal to the university here) that does high
throughput gene sequencing.  They like a server which can hold the large
amounts of data, do a first pass analysis on it, and then serve it up
over the network to a compute cluster for further computation.  Oracle
has nothing in their product line (anymore) to meet that need.  They
ended up ordering an 8U chassis w/40x 2TB drives in it, and are willing
to pay the $2k/yr retail ransom to Oracle to run Solaris (ZFS) on it,
at least for the first year.  Maybe OpenIndiana next year, we'll see.

Bye Oracle

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun T3-2 and ZFS on JBODS

2011-03-06 Thread Marion Hakanson

sigbj...@nixtra.com said:
> I will do some testing on the loadbalance on/off. We have nearline SAS disks,
> which does have dual path from the disk, however it's still just 7200rpm
> drives.
> 
> Are you using SATA , SAS or SAS-nearline in your array? Do you have multiple
> SAS connections to your arrays, or do you use a single connection per array
> only? 

We have four Dell MD1200's connected to three Solaris-10 systems.  Three
of the MD1200's have nearline-SAS 2TB 7200RPM drives, and one has SAS 300GB
15000RPM drives.  All the MD1200's are connected with dual SAS modules to
a dual-port HBA on their respective servers (one setup is with two MD1200's
daisy-chained, but again using dual SAS modules & cables).

Both types of drives suffer super-slow writes (but reasonable reads) when
loadbalance=roundrobin is in effect.  E.g 280 MB/sec sequential reads, and
28MB/sec sequential writes, for the 15kRPM SAS drives I tested last week.
We don't see this extreme slowness on our dual-path Sun J4000 JBOD's, but
those all have SATA drives (with the dual-port interposers inside the
drive sleds).

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun T3-2 and ZFS on JBODS

2011-03-02 Thread Marion Hakanson

sigbj...@nixtra.com said:
> I've played around with turning on and off mpxio on the mpt_sas driver,
> disabling increased the performance from 30MB / sec, but it's still far from
> the original performance. I've attached some dumps of zpool iostat before and
> after reinstallation. 

I find "zpool iostat" is less useful in telling what the drives are
doing than "iostat -xn 1".  In particular, the latter will give you an
idea of how many operations are queued per drive, and how long it's taking
the drives to handle those operations, etc.

On our Solaris-10 systems (U8 and U9), if mpxio is enabled, you really
want to set loadbalance=none.  The default (round-robin) makes some of
our JBOD's (Dell MD1200) go really slow for writes.  I see you have tried
with mpxio disabled, so your issue may be different.

You don't say what you're doing to generate your test workload, but there
are some workloads which will speed up a lot if the ZIL is disabled.  Maybe
that or some other /etc/system tweaks were in place on the original system.
Also use "format -e" and its "write_cache" commands to see if the drives'
write caches are enabled or not.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SIL3114 and sparc solaris 10

2011-02-25 Thread Marion Hakanson

nat...@tuneunix.com said:
>   I can confirm that on *at least* 4 different cards - from different  board
> OEMs - I have seen single bit ZFS checksum errors that went away  immediately
> after removing the 3114 based card.
> 
> I stepped up to the 3124 (pci-x up to 133mhz) and 3132 (pci-e) and have
> never looked back.
> 
> I now throw any 3114 card I find into the bin at the first available
> opportunity as they are a pile of doom waiting to insert an exploding  garden
> gnome into the unsuspecting chest cavity of your data. 

Maybe I've just been lucky.  I have a 3114 card configured with two ports
internal and two external (E-SATA).  There is a ZFS pool configured as a
mirror of a 1TB drive on the E-SATA port in an external dock, and a 1TB
drive on a motherboard SATA port.  It's been running like this for a
couple of years, with weekly scrubs, and has so far had no errors.
The system is a 32-bit x86 running Solaris-10U6.

My 3114 card came with RAID firmware, and I re-flashed it to non-RAID,
as others have mentioned.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive i/o anomaly

2011-02-07 Thread Marion Hakanson

matt.connolly...@gmail.com said:
> After putting the drive online (and letting the resilver complete) I took the
> slow drive (c8t1d0 western digital green) offline and the system ran very
> nicely.
> 
> It is a 4k sector drive, but I thought zfs recognised those drives and didn't
> need any special configuration...? 

That's a nice confirmation of the cost of not doing anything special (:-).

I hear the problem may be due to 4k drives which report themselves as 512b
drives, for boot/BIOS compatibility reasons.  I've also seen various ways
to force 4k alignment, and check what the "ashift" value is in your pool's
drives, etc.  Google "solaris zfs 4k sector align" will lead the way.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive i/o anomaly

2011-02-07 Thread Marion Hakanson

matt.connolly...@gmail.com said:
> extended device statistics  
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 1.2   36.0  153.6 4608.0  1.2  0.3   31.99.3  16  18 c12d0
> 0.0  113.40.0 7446.7  0.8  0.17.00.5  15   5 c8t0d0
> 0.2  106.44.1 7427.8  4.0  0.1   37.81.4  93  14 c8t1d0
> extended device statistics  
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 0.4   73.2   25.7 9243.0  2.3  0.7   31.69.8  34  37 c12d0
> 0.0  226.60.0 24860.5  1.6  0.27.00.9  25  19 c8t0d0
> 0.2  127.63.4 12377.6  3.8  0.3   29.72.2  91  27 c8t1d0
> extended device statistics  
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 0.0   44.20.0 5657.6  1.4  0.4   31.79.0  19  20 c12d0
> 0.2   76.04.8 9420.8  1.1  0.1   14.21.7  12  13 c8t0d0
> 0.0   16.60.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
> extended device statistics  
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 0.00.20.0   25.6  0.0  0.00.32.3   0   0 c12d0
> 0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
> 0.0   11.00.0 1365.6  9.0  1.0  818.1   90.9 100 100 c8t1d0 
> . . .

matt.connolly...@gmail.com said:
> I expect that the c8t0d0 WD Green is the lemon here and for some reason is
> getting stuck in periods where it can write no faster than about 2MB/s. Does
> this sound right? 

No, it's the opposite.  The drive sitting at 100%-busy, c8t1d0, while the
other drive is idle, is the sick one.  It's slower than the other, has 9.0
operations waiting (queued) to finish.  The other one is idle because it
has already finished the write activity and is waiting for the slow one
in the mirror to catch up.  If you run "iostat -xn" without the interval
argument, i.e. so it prints out only one set of stats, you'll see the
average performance of the drives since last reboot.  If the "asvc_t"
figure is significantly larger for one drive than the other, that's a
way to identify the one which has been slower over the long term.


> Secondly, what I wonder is why it is that the whole file system seems to hang
> up at this time. Surely if the other drive is doing nothing, a web page can
> be served by reading from the available drive (c8t1d0) while the slow drive
> (c8t0d0) is stuck writing slow. 

The available drive is c8t0d0 in this case.  However, if ZFS is in the
middle of a txg (ZFS transaction) commit, it cannot safely do much with
the pool until that commit finishes.  You can see that ZFS only lets 10
operations accumulate per drive (used to be 35), i.e. 9.0 in the "wait"
column, and 1.0 in the "actv" column, so it's kinda stuck until the
drive gets its work done.

Maybe the drive is failing, or maybe it's one of those with large sectors
that are not properly aligned with the on-disk partitions.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] reliable, enterprise worthy JBODs?

2011-01-25 Thread Marion Hakanson

tmcmah...@yahoo.com said:
> Interesting. Did you switch to the load-balance option?

Yes, I ended up with "load-balance=none".  Here's a thread about
it in the storage-discuss mailing list:

  http://opensolaris.org/jive/thread.jspa?threadID=130975&tstart=90

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] reliable, enterprise worthy JBODs?

2011-01-25 Thread Marion Hakanson

p...@bolthole.com said:
> Any other suggestions for (large-)enterprise-grade, supported JBOD hardware
> for ZFS these days? Either fibre or SAS would be okay. 

As others have said, it depends on your definition of "enterprise-grade".
We're using Dell's MD1200 SAS JBOD's with Solaris-10 and ZFS.  Ours have
the Seagate 2TB "Nearline SAS" drives.

The 3Gbit Sun SAS HBA, using mpt driver, works fine, although with a stream
of harmless warning messages about "unknown event 10 received".  The newer
6Gbit Sun/Oracle SAS HBA, using mpt_sas driver, works well without that issue.

The only special tuning I had to do was turn off round-robin load-balancing
in the mpxio configuration.  The Seagate drives were incredibly slow when
running in round-robin mode, very speedy without.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS slows down over a couple of days

2011-01-12 Thread Marion Hakanson

Stephan,

The "vmstat" shows you are not actually short of memory;  The "pi" and "po"
columns are zero, so the system is not having to do any paging, and it seems
unlike the system is slow directly because of RAM shortage.  With the ARC,
it's not unusual for vmstat to show little free memory, but the system will
give up that RAM when an application asks for it.  You can tell if this is
happening a lot by:
echo "::arc" | mdb -k | grep throttle

If the value of "memory_throttle_count" is large, that will indicate that
apps are often asking the kernel to give up ARC memory.

Also, as you said, the "iostat" figures look idle.  You can tell more
using "iostat -xn 1", which will give service times & percent-busy
figures for the actual devices.

It could be that something about the networking involved is what is
actually slow.  You could find out if it's a local bottleneck by trying
some simple I/O tests on the server itself, maybe:
dd if=/dev/zero of=/file/in/zpool bs=1024k
and watching what iostat shows, etc.

Another test is to try a network-only test, maybe using "ttcp" between
the server and a client.  This could tell you if it's network or storage
that's causing the slow-down.  If you don't have "ttcp", something silly
like, on a client running:
dd if=/dev/zero bs=1024k | ssh -c blowfish server "dd of=/dev/null 
bs=1024k"

You can watch network throughput on the server using:
dladm show-link -s -i 1

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz recovery

2010-12-13 Thread Marion Hakanson

z...@lordcow.org said:
> For example when I 'dd if=/dev/zero of=/dev/ad6', or physically remove the
> drive for awhile, then 'online' the disk, after it resilvers I'm typically
> left with the following after scrubbing:
> 
> r...@file:~# zpool status
>   pool: pool
>  state: ONLINE status: One or more devices has experienced an unrecoverable
> error.  An
>   attempt was made to correct the error.  Applications are unaffected. 
> action:
> Determine if the device needs to be replaced, and clear the errors
>   using 'zpool clear' or replace the device with 'zpool replace'.
>see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub completed after 0h0m with 0 errors on Fri Dec 10 23:45:56 2010
> config:
> 
>   NAMESTATE READ WRITE CKSUM
>   poolONLINE   0 0 0
> raidz1ONLINE   0 0 0
>   ad12ONLINE   0 0 0
>   ad13ONLINE   0 0 0
>   ad4 ONLINE   0 0 0
>   ad6 ONLINE   0 0 7
> 
> errors: No known data errors
> 
> http://www.sun.com/msg/ZFS-8000-9P lists my above actions as a cause for this
> state and rightfully doesn't think them serious. When I 'clear' the errors
> though and offline/fault another drive, and then reboot, the array faults.
> That tells me ad6 was never fully integrated back in. Can I tell the array to
> re-add ad6 from scratch? 'detach' and 'remove' don't work for raidz.
> Otherwise I need to use 'replace' to get out of this situation. 


After you "clear" the errors, do another "scrub" before trying anything
else.  Once you get a complete scrub with no new errors (and no checksum
errors), you should be confident that the damaged drive has been fully
re-integrated into the pool.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with STK raid card w battery

2010-10-24 Thread Marion Hakanson

replic...@gmail.com said:
> One other question, how can I ensure that the controller's cache is really
> being used? (arcconf doesn't seem to show much). Since ZFS would flush the
> data as soon as it can, I am curious to see if the caching is making a
> difference or not. 

Share out a dataset on the pool over NFS to a remote client.  On the
client, unpack a tar archive onto the NFS dataset, timing how long it
takes.  Do this once with the cache set to "write-through" (which basically
disables the write cache), and again with it set to "write-back".

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-14 Thread Marion Hakanson

rewar...@hotmail.com said:
> ok... we're making progress.  After swapping the LSI HBA for a Dell H800 the
> issue disappeared.  Now, I'd rather not use those controllers because they
> don't have a JBOD mode. We have no choice but to make individual RAID0
> volumes for each disks which means we need to reboot the server every time we
> replace a failed drive.  That's not good... 

Earlier you said you had eliminated the ZIL as an issue, but one difference
between the Dell H800 and the LSI HBA is that the H800 has an NV cache (if
you have the battery backup present).

A very simple test would be when things are running slow, try disabling
the ZIL temporarily, to see if that makes things go fast.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] possible ZFS-related panic?

2010-09-02 Thread Marion Hakanson

Folks,

Has anyone seen a panic traceback like the following?  This is Solaris-10u7
on a Thumper, acting as an NFS server.  The machine was up for nearly a
year, I added a dataset to an existing pool, set compression=on for the
first time on this system, loaded some data in there (via "rsync"),
then mounted it to the NFS client.

The first data was written by the client itself in a 10pm cron-job, and
the system crashed at 10:02pm as below:

panic[cpu2]/thread=fe8000f5cc60: page_sub: bad arg(s): pp 
872b5610, *ppp 0

fe8000f5c470 unix:mutex_exit_critical_size+20219 ()
fe8000f5c4b0 unix:page_list_sub_pages+161 ()
fe8000f5c510 unix:page_claim_contig_pages+190 ()
fe8000f5c600 unix:page_geti_contig_pages+44b ()
fe8000f5c660 unix:page_get_contig_pages+c2 ()
fe8000f5c6f0 unix:page_get_freelist+1a4 ()
fe8000f5c760 unix:page_create_get_something+95 ()
fe8000f5c7f0 unix:page_create_va+2a1 ()
fe8000f5c850 unix:segkmem_page_create+72 ()
fe8000f5c8b0 unix:segkmem_xalloc+60 ()
fe8000f5c8e0 unix:segkmem_alloc_vn+8a ()
fe8000f5c8f0 unix:segkmem_alloc+10 ()
fe8000f5c9c0 genunix:vmem_xalloc+315 ()
fe8000f5ca20 genunix:vmem_alloc+155 ()
fe8000f5ca90 genunix:kmem_slab_create+77 ()
fe8000f5cac0 genunix:kmem_slab_alloc+107 ()
fe8000f5caf0 genunix:kmem_cache_alloc+e9 ()
fe8000f5cb00 zfs:zio_buf_alloc+1d ()
fe8000f5cb50 zfs:zio_compress_data+ba ()
fe8000f5cba0 zfs:zio_write_compress+78 ()
fe8000f5cbc0 zfs:zio_execute+60 ()
fe8000f5cc40 genunix:taskq_thread+bc ()
fe8000f5cc50 unix:thread_start+8 ()

syncing file systems... done
. . .

Unencumbered by more than a gut feeling, I disabled compression on
the dataset, and we've gotten through two nightly runs of the same
NFS client job without crashing, but of course we would tecnically
have to wait for nearly a year before we've exactly replicated the
original situation (:-).

Unfortunately the dump-slice was slightly too small, we were just short
of enough space to capture the whole 10GB crash dump.  I did get savecore
to write something out, and I uploaded it to the Oracle support site,but it 
gives "scat" too much indigestion to be useful to the engineer I'm working
with.  They have not found any matching bugs so far, so I thought I'd ask a
slightly wider audience here.

Thanks and regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Marion Hakanson

markwo...@yahoo.com said:
> So the question is with a proper ZIL SSD from SUN, and a RAID10... would I be
> able to support all the VM's or would it still be pushing the limits a 44
> disk pool? 

If it weren't a closed 7000-series appliance, I'd suggest running the
"zilstat" script.  It should make it clear whether (and by how much)
you would benefit from the Logzilla addition in your current raidz
configuration.  Maybe there's some equivalent in the builtin FishWorks
analytics which can give you the same information.

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance Testing

2010-08-11 Thread Marion Hakanson

p...@kraus-haus.org said:
>  Based on these results, and our capacity needs, I am planning to go with 5
> disk raidz2 vdevs.

I did similar tests with a Thumper in 2008, with X4150/J4400 in 2009,
and more recently comparing X4170/J4400 and X4170/MD1200:
http://acc.ohsu.edu/~hakansom/thumper_bench.html
http://acc.ohsu.edu/~hakansom/j4400_bench.html
http://acc.ohsu.edu/~hakansom/md1200_loadbal_bench.html

On the Thumper, we went with 7x(4D+2P) raidz2, and as a general-purpose
NFS server performance has been fantastic except (as expected without
any NV ZIL) for the very rare "lots of small synchronous I/O" workloads
(like extracting a tar archive via an NFS client).

In fact, our experience with the above has led us to go with 6x(5D+2P)
on our new X4170/J4400 NFS server.  The difference between this config
and 7x(4D+2P) on the same hardware is pretty small, and both are faster
than the Thumper.


> Since we have five J4400, I am considering using one disk
> in each of the five arrays per vdev, so that a complete failure of a J4400
> does not cause any loss of data. What is the general opinion of that approach

We did something like this on the Thumper, with one disk on each of
the internal HBA's.  Since our new system has only two J4400's, we
didn't try to cover this type of failure.


> and does anyone know how to map the MPxIO device name back to a physical
> drive ?

You can use CAM to view the mapping of physical drives to device names
(with or without MPxIO enabled).  That's the most human-friendly way
that I've found.

If you're using Oracle/Sun LSI HBA's (mpt), a "raidctl -l" will list
out devices names like 0.0.0, 0.1.0, and so on.  That middle digit does
seem to correspond with the physical slot number in the J4400's, at least
initially.  Unfortunately (for this purpose), if you move drives around,
the "raidctl" names follow the drives to their new locations, as do the
Solaris device names (verified by "dd if=/dev/dsk/... of=/dev/null" and
watching the blinkenlights).  Also, with multiple paths, devices will
show up with two different names in "raidctl -l", so it's a bit of a
pain to make sense of it all.

So, just use CAM

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Please trim posts

2010-06-18 Thread Marion Hakanson

doug.lin...@merchantlink.com said:
> Apparently, before Outlook there WERE no meetings, because it's clearly
> impossible to schedule one without it. 

Don't tell my boss, but I use Outlook for the scheduling, and fetchmail
plus procmail to download email out of Exchange and into my favorite
email client.  Thankfully, Exchange listens to incoming SMTP when I need
to send messages.


> And please don't mail me with your favorite OSS solution.  I've tried them
> all.  None of them integrate with Exchange *smoothly* and *cleanly*.  They're
> all workarounds and kludges that are as annoying in the end as Outlook. 

Hmm, what I'm doing doesn't _integrate_ with Exchange;  It just bypasses
it for the email portion of my needs.  Non-OSS:  Mac OS X 10.6 claims to
integrate with Exchange, although I have not yet tried it myself.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread Marion Hakanson

frank+lists/z...@linetwo.net said:
> I remember, and this was a few years back but I don't see why it would be any
> different now, we were trying to add drives 1-2 at a time to medium-sized
> arrays (don't buy the disks until we need them, to hold onto cash), and the
> Netapp performance kept going down down down.  We eventually had to borrow an
> array from Netapp to copy our data onto to rebalance.  Netapp told us
> explicitly, make sure to add an entire shelf at a time (and a new raid group,
> obviously, don't extend any existing group). 

The advent of aggregates fixed that problem.  Used to be that a raid-group
belonged to only one volume.  Now multiple flex-vols (even tiny ones) share
all the spindles (and parity drives) on their aggregate, and you can rebalance
after adding drives without having to manually move/copy existing data.  Pretty
slick, if you can afford the price.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread Marion Hakanson

frank+lists/z...@linetwo.net said:
> Well in that case it's invalid to compare against Netapp since they can't do
> it either (seems to be the consensus on this list).  Neither zfs nor Netapp
> (nor any product) is really designed to handle adding one drive at a time.
> Normally you have to add an entire shelf, and if you're doing that it's
> better to add a new vdev to your pool. 

This is incorrect (and another poster has pointed this out).  NetApp can
add a single drive (or more) to a raid-group, and has been able to do so
since before they had dual-parity, aggregates, flex-vols, and rebalancing.

BTW, the rebalance after growing an aggregate is not automatic (as of
OnTAP-7.3 anyway).  You invoke a command manually on each volume that
you care about, and the rebalance runs in the background until finished.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Write retry errors to SSD's on SAS backplane (mpt)

2010-03-25 Thread Marion Hakanson

rvandol...@esri.com said:
> We have a Silicon Mechanics server with a SuperMicro X8DT3-F (Rev 1.02)
> (onboard LSI 1068E (firmware 1.28.02.00) and a SuperMicro SAS-846EL1 (Rev
> 1.1) backplane. 
> . . .
> The system is fully patched Solaris 10 U8, and the mpt driver is
> version 1.92:

Since you're running on Solaris-10 (and its mpt driver), have you tried
the firmware that Sun recommends for their own 1068E-based HBA's?  There
are a couple of versions depending on your usage, but they're all earlier
revs than the 1.28.02.00 you have:

http://www.lsi.com/support/sun/sg_xpci8sas_e_sRoHS.html

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] j4500 cache flush

2010-03-05 Thread Marion Hakanson

bene...@yahoo.com said:
> Marion - Do you happen to know which SAS hba it applys to? 

Here's the article:
  http://sunsolve.sun.com/search/document.do?assetkey=1-66-248487-1

The title is "Write-Caching on JBOD SATA Drive is Erroneously Enabled
by Default When Connected to Non-RAID SAS HBAs".

By the way, you can use "raidctl" to view/manage firmware on these.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] j4500 cache flush

2010-03-05 Thread Marion Hakanson

erik.trim...@sun.com said:
> All J4xxx systems are really nothing more than huge SAS expanders hooked  to
> a bunch of disks, so cache flush requests will either come from ZFS  or any
> attached controller.  Note that I /think/ most non-RAID  controllers don't
> initiate their own cache flush requests. 

Docs for the non-RAID HBA's sold by Sun say that with proper (recent)
firmware, at power-up the HBA will disable write caches on the disks
themselves (this refers to the LSI 1068-based HBA's, anyway).  There
was a Sun Alert issued for early revisions which failed to disable
disk caches, resulting in data loss at power loss for cache-unaware
software.

Solaris/OpenSolaris ZFS will then enable the write caches once it
knows it has control of whole disks, and issues flushes to the drives
as appropriate.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Marion Hakanson

car...@taltos.org said:
> NetApp does _not_ expose an ACL via NFSv3, just old school POSIX  mode/owner/
> group info. I don't know how NetApp deals with chmod, but I'm  sure it's
> documented. 

The answer is, "It depends."  If the NetApp volume is NTFS-only permissions,
then chmod from the Unix/NFS side doesn't work, and you can only manipulate
permissions from Windows clients..  If it's a "mixed" security-style volume,
chmod from the Unix/NFS side will delete the NTFS ACL's, and the SMB clients
will see faked-up ACL's that match the new POSIX permissions.  Whichever side
made the most recent change will be in effect.

Newer OnTAP versions have an optional setting which overrides this effect
of chmod if NFSv4 is in effect on mixed-security volumes, and instead tries
to mirror the ACL's as identically as possible to both kinds of clients.
Poor old NFSv3 and older clients still see gibberish POSIX permissions,
but the least privilege available in ACL's is enforced by the filer.

BTW, our experience has been that NFSv4 on NetApp does not work very well,
and NetApp support folks have advised us to not use it in order to avoid
crashing the filer.  They of course blame the various incompatible NFSv4
client implementations out there

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-25 Thread Marion Hakanson

hen...@acm.org said:
> I've been surveying various forums looking for other places using ZFS ACL's
> in production to compare notes and see how if at all they've handled some of
> the issues we've found deploying them.
> 
> So far, I haven't found anybody using them in any substantial way, let alone
> trying to leverage them to allow a very large user population to have highly
> flexible control over access to their data.
> 
> Anyone here that has a non-negligible ACL deployment that would be interested
> in discussing it? 

We've been using them here for a couple of years now.  Personally, I'd
say if you set one ACL, you're already in "non-negligible" territory.
It's not easy to get them right, and usually the hardest task is in
figuring out what the users want, so we don't use them unless the users'
needs cannot be met using traditional Unix/POSIX permissions.

The only way we've been able to do this effectively is by scripting
it so it's repeatable (and documented), and using inheritance to
propagate them to any new items which are added to shared areas.
The scripting also (sorta) covers the problem that most backup and
file transfer utilities are not capable of backing up and restoring
the NFSv4-style ACL's on ZFS.

So, let the discussion ensue

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Interrupt sharing

2010-02-25 Thread Marion Hakanson

d...@dd-b.net said:
> I know from startup log messages that I've got several interrupts being
> shared.  I've been wondering how serious this is.  I don't have any
> particular performance problems, but then again my cpu and motherboard are
> from 2006 and I'd like to extend their service life, so using them more
> efficiently isn't a bad idea.  Plus it's all a learning experience :-). 

Mine's from 2004, and I've been going through the same adjustments here.


> While I see the relevance to diagnosing performance problems, for my case, is
> there likely to be anything I can do about interrupt assignments?  Or is this
> something that, if it's a problem, is an unfixable problem (short of changing
> hardware)?  I think there's BIOS stuff to shuffle interrupt assignments some,
> but do changes at that level survive kernel startup, or get overwritten? 

Experience with my motherboard is that even when you switch the BIOS
"Plug-n-Play OS" setting between "No" and "Yes", Solaris-10 doesn't
seem to change where it maps any devices.  Probably a removal of the
/etc/path_to_inst file and reconfiguration reboot would be required,
but even that won't move devices required for booting.

Also, the onboard devices (like your nv_sata, ehci, etc.) are not likely
to move around at all.  Only things that could be moved to different
PCI/PCI-X/PCIe slots are likely to move.  Ran across this note:
http://blogs.sun.com/sming56/entry/interrupts_output_in_mdb

I found it pretty time-consuming just mapping the OS's device instance
numbers to the physical devices.  Taking the device instance numbers
from "intrstat" or "echo '::interrupts -d' | mdb -k" and digging through
the output of "prtconf -Dv" and/or boot-up /var/adm/messages stuff was
pretty tedious.

Check out what mine looks like, in particular the case where four devices
share the same interrupt -- the two onboard SATA ports, onboard ethernet,
and one slow-mode USB port (Intel ICH5 chipset).  There doesn't appear to
be a thing you can do about this sharing.  The system's never seemed slow,
though I do try to avoid using that particular USB port.

# echo '::interrupts -d' | mdb -k
IRQ  Vector IPL Bus   Type  CPU Share APIC/INT# Driver Name(s)
10x41   5   ISA   Fixed 0   1 0x0/0x1   i8042#0
60x43   5   ISA   Fixed 0   1 0x0/0x6   fdc#0
90x81   9   PCI   Fixed 0   1 0x0/0x9   acpi_wrapper_isr
12   0x42   5   ISA   Fixed 0   1 0x0/0xc   i8042#0
15   0x44   5   ISA   Fixed 0   1 0x0/0xf   ata#1
16   0x82   9   PCI   Fixed 0   3 0x0/0x10  uhci#3, uhci#0, nvidia#0
17   0x86   9   PCI   Fixed 0   1 0x0/0x11  audio810#0
18   0x85   9   PCI   Fixed 0   4 0x0/0x12  pci-ide#1, e1000g#0, uhci#2,
pci-ide#1
19   0x84   9   PCI   Fixed 0   1 0x0/0x13  uhci#1
22   0x40   5   PCI   Fixed 0   1 0x0/0x16  pci-ide#2
23   0x83   9   PCI   Fixed 0   1 0x0/0x17  ehci#0
160  0xa0   0 IPI   ALL 0 - poke_cpu
192  0xc0   13IPI   ALL 1 - xc_serv
208  0xd0   14IPI   ALL 1 - kcpc_hw_overflow_intr
209  0xd1   14IPI   ALL 1 - cbe_fire
210  0xd3   14IPI   ALL 1 - cbe_fire
240  0xe0   15IPI   ALL 1 - xc_serv
241  0xe1   15IPI   ALL 1 - apic_error_intr
# 

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Poor ZIL SLC SSD performance

2010-02-19 Thread Marion Hakanson

felix.buenem...@googlemail.com said:
> I think I'll try one of thise inexpensive battery-backed PCI RAM drives  from
> Gigabyte and see how much IOPS they can pull. 

Another poster, Tracy Bernath, got decent ZIL IOPS from an OCZ Vertex unit.
Dunno if that's sufficient for your purposes, but it looked pretty good
for the money.

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Identifying firmware version of SATA controller (LSI)

2010-02-05 Thread Marion Hakanson

rvandol...@esri.com said:
> I'm trying to figure out where I can find the firmware on the LSI
> controller... are the bootup messages the only place I could expect to see
> this?  prtconf and prtdiag both don't appear to give firmware information. 
> . . .
> Solaris 10 U8 x86.

The "raidctl" command is your friend;  Useful for updating firmware
if you choose to do so, as well.  You can also find the revisions in
the output of "prtconf -Dv", search for "firm" in the long list.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Building big cheap storage system. What hardware to use?

2010-01-29 Thread Marion Hakanson

fjwc...@gmail.com said:
> Yes, if I was to re-do the hardware config for these servers, using what I
> know now, I would do things a little differently:
> . . .
>   - find a case with more than 24 drive bays (any way to get a Thumper
> without the extra hardware/software?) ;)
> . . .

It's called the Sun Storage J4500 array.  Well, until Oracle gets
around to changing the name anyway

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zpool creation best practices

2009-12-31 Thread Marion Hakanson

mijoh...@gmail.com said:
> I've never had a lun go bad but bad things do happen.  Does anyone else use
> ZFS in this way?  Is this an unrecommended setup?

We used ZFS like this on a Hitachi array for 3 years.  Worked fine, not
one bad block/checksum error detected.  Still using it on an old Sun 6120
array, too.


>  It's too late to change my
> setup, but in the future when I'm planning new systems, should I consider the
> effort to allow zfs fully control all the disks? 

Well, you should certainly consider all the alternatives you can afford.
Our customers happen to like cheap bulk storage, so we have a Thumper,
and a few SAS-connected Sun J4000 SATA JBOD's.  But our grant-funded
researchers may not be a "typical" customer mix

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] getting decent NFS performance

2009-12-23 Thread Marion Hakanson

erik.trim...@sun.com said:
> The suggestion was to make the SSD on each machine an iSCSI volume, and  add
> the two volumes as a mirrored ZIL into the zpool. 

I've mentioned the following before

For a poor-person's slog which gives decent NFS performance, we have had
good results with allocating a slice on (e.g.) an X4150's internal disk,
behind the internal Adaptec RAID controller.  Said controller has only
256MB of NVRAM, but it made a big difference with NFS performance (look
for the "tar unpack" results at the bottom of the page):

http://acc.ohsu.edu/~hakansom/j4400_bench.html

You can always replace them when funding for your Zeus SSD's comes in (:-).

Regards,

-- 
Marion Hakanson 
OHSU Advanced Computing Center


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [storage-discuss] ZFS on JBOD storage, mpt driver issue - server not responding

2009-11-11 Thread Marion Hakanson

m...@cybershade.us said:
> So at this point this looks like an issue with the MPT driver or these SAS
> cards (I tested two) when under heavy load. I put the latest firmware for the
> SAS card from LSI's web site - v1.29.00 without any changes, server still
> locks.
> 
> Any ideas, suggestions how to fix or workaround this issue? The adapter is
> suppose to be enterprise-class. 

We have three of these HBA's, used as follows:
X4150, J4400, Solaris-10U7-x86, mpt patch 141737-01
V245, J4200, Solaris-10U7, mpt patch 141736-05
X4170, J4400, Solaris-10U8-x86, mpt/kernel patch 141445-09

None of these systems are suffering the issues you describe.  All of
their SAS HBA's are running the latest Sun-supported firmware I could
find for these HBA's, which is v1.26.03.00 (BIOS 6.24.00), in LSI firmware
update 14.2.2 at:
http://www.lsi.com/support/sun/

In that package is also a v1.27.03.00 firmware for use when connecting
to the F5100 flash accelerator, but it's clearly labelled as only for
use with that device.

Anyway, I of course don't know if you've already tried the v1.26.03.00
firmware in your situation, but I wanted to at least report that we
are using this combination on Solaris-10 without experiencing the
timeout issues you are having.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (home NAS) zfs and spinning down of drives

2009-11-04 Thread Marion Hakanson

jimkli...@cos.ru said:
> Thanks for the link, but the main concern in spinning down drives of a ZFS
> pool  is that ZFS by default is not so idle. Every 5 to 30 seconds it closes
> a transaction  group (TXG) which requires a synchronous write of metadata to
> disk. 

You know, it's just going to depend on your usage.  On my home machine
(Solaris-10U6 with U8-level patches), the drives are set to spin down
after 30 minutes of idle time.  I'm not certain if the root pool spins
down, but the drives in the 2nd mirrored pool do spin down.  This pool
contains my Solaris home directory and the Samba-connected datasets for
backups of other computers.

It is true that I have to make sure Thunderbird and Firefox are not
running in order to idle the home directory.  Then the drives spin down
and seem to stay that way until I wake up the display by moving the
mouse or accessing the keyboard.  They will also spin up when a nightly
backup kicks off on one of the other systems, or if I SSH-in from work
to check something.

I don't do anything special other than stopping Thunderbird and Firefox
when I leave the computer.  I just select "Lock Screen" from the Gnome
Launch menu, the screen-lock window pops up, and the display goes into
power-save mode shortly after.  I don't think there's anything magic
about ZFS with regard to keeping the drives busy.

The fancy power-saving stuff was done by Green-Bytes;  There they modified
ZFS to do the meta-data updates onto Flash-based SSD's separate from the
rest of the usual pool drives.  That way things like ZIL activity did not
have to spin up a large number of data drives just to make small metadata
updates, etc.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris disk confusion ?

2009-11-04 Thread Marion Hakanson

zfs...@jeremykister.com said:
> unfortunately, fdisk won't help me at all:
> # fdisk -E /dev/rdsk/c12t1d0p0
> # zpool create -f testp c12t1d0
> invalid vdev specification
> the following errors must be manually repaired:
> /dev/dsk/c3t11d0s0 is part of active ZFS pool dbzpool. Please see zpool(1M).

Hmm.  Did you do the "devfsadm -Cv" as someone else suggested?  I think
I would do that both before and after the "fdisk -E".  Then give the "dd"
treatment again, with the EFI-style partion label in place.

I've sometimes had to do the "dd" treatment with both VTOC and EFI labels on
the same drive in order to make ZFS forget it had ever been used in a pool.

Regards,

Marion




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris disk confusion ?

2009-11-03 Thread Marion Hakanson

>I said:
>> You'll need to give the same "dd" treatment to the end of the disk as well;
>> ZFS puts copies of its labels at the beginning and at the end.  Oh, and

zfs...@jeremykister.com said:
> im not sure what you mean here - I thought p0 was the entire disk in x86 -
> and s2 was the whole disk in the partition.  what else should i overwrite? 

Sorry, yes, you did get the whole slice overwritten.  Most people just
add a "count=10" or something similar, to overwrite the beginning of
the drive, but your invocation would overwrite the whole thing.

If the disk is going to be part of whole-disk zpool, I like to make
sure there is not an old VTOC-style partition table on there.  That
can be done either via some "format -e" commands, or with "fdisk -E",
to put an EFI label on there.

Anyway, I agree with the desire for "zpool" to be able to do this
itself, with less possibility of human error in partitioning, etc.
Glad to hear there's already an RFE filed for it.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris disk confusion ?

2009-11-02 Thread Marion Hakanson


zfs...@jeremykister.com said:
> # format -e c12t1d0 selecting c12t1d0 [disk formatted] /dev/dsk/c3t11d0s0 is
> part of active ZFS pool dbzpool. Please see zpool(1M).
> 
> It is true that c3t11d0 is part of dbzpool.  But why is solaris upset about
> c3t11 when i'm working with c12t1 ??  So i checked the device links, and  all
> looks fine: 
> . . .

Could it be that c12t1d0 was at some time in the past (either in this
machine or another machine) known as c3t11d0, and was part of a pool
called "dbzpool"?


> i tried:
> fdisk -B /dev/rdsk/c12t1d0
> dd if=/dev/zero of=/dev/rdsk/c12t1d0p0 bs=1024k
> dd if=/dev/zero of=/dev/rdsk/c12t1d0s2 bs=1024k
> 
> but Solaris still has some association between c3t11 and c12t1. 

You'll need to give the same "dd" treatment to the end of the disk as well;
ZFS puts copies of its labels at the beginning and at the end.  Oh, and
you can "fdisk -E /dev/rdsk/c12t1d0" to convert to a single, whole-disk
EFI partition (non-VTOC style).

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sniping a bad inode in zfs?

2009-10-27 Thread Marion Hakanson

da...@elemental.org said:
> Normally on UFS I would just take the 'nuke it from orbit' route and use clri
> to wipe the directory's inode. However, clri doesn't appear to be zfs aware
> (there's not even a zfs analog of clri in /usr/lib/fs/ zfs), and I don't
> immediately see an option in zdb which would help cure this. 

Well, it might make things worse, but have you tried /usr/sbin/unlink ?
I'm on Solaris-10, so don't know if that's still part of OpenSolaris.

Regards,

Marion



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

2009-10-26 Thread Marion Hakanson

opensolaris-zfs-disc...@mlists.thewrittenword.com said:
> Is it really pointless? Maybe they want the insurance RAIDZ2 provides. Given
> the choice between insurance and performance, I'll take insurance, though it
> depends on your use case. We're using 5-disk RAIDZ2 vdevs. 
> . . .
> Would love to hear other opinions on this. 

Hi again Albert,

On our Thumper, we use 7x 6-disk raidz2's (750GB drives).  It seems a good
compromise between capacity, IOPS, and data protection.  Like you, we are
afraid of the possibility of a 2nd disk failure during resilvering of these
large drives.  Our usage is a mix of disk-to-disk-to-tape backups, archival,
and multi-user (tens of users) NFS/SFTP service, in roughly that order
of load.  We have had no performance problems with this layout.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] "zfs send..." too slow?

2009-10-26 Thread Marion Hakanson

knatte_fnatte_tja...@yahoo.com said:
> Is rsync faster? As I have understood it, "zfs send.." gives me an exact
> replica, whereas rsync doesnt necessary do that, maybe the ACL are not
> replicated, etc. Is this correct about rsync vs "zfs send"? 

It is true that rsync (as of 3.0.5, anyway) does not preserve NFSv4/ZFS
ACL's.  It also cannot handle ZFS snapshots.

On the other hand, you can run multiple rsync's in parallel;  You can
only do that with zfs send/recv if you have multiple, independent ZFS
datasets that can be done in parallel.  So which one goes faster will
depend on your situation.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange results ...

2009-10-22 Thread Marion Hakanson

jel+...@cs.uni-magdeburg.de said:
> 2nd) Never had a Sun STK RAID INT before. Actually my intention was to create
> a zpool mirror of sd0 and sd1 for boot and logs, and a 2x2-way  zpool mirror
> with the 4 remaining disks. However, the controller seems not to support
> JBODs :( - which is also bad, since we can't simply put those disks into
> another machine with a different controller without data loss, because the
> controller seems to use its own format under the hood.

Yes, those Adaptec/STK internal RAID cards are annoying for use with ZFS.
You also cannot replace a failed disk without using the STK RAID software
to configure the new disk as a standalone volume (before "zpool replace").
Fortunately you probably don't need to boot into the BIOS-level utility,
I think you can use the Adaptec StorMan utilities from within the OS, if
you remembered to install them.


>  Also the 256MB
> BBCache seems to be a little bit small for ZIL even if one would know, how to
> configure it ...

Unless you have an external (non-NV cached) pool on the same server, you
wouldn't gain anything from setting up a separate ZIL in this case.  All
your internal drives have NV cache without doing anything special.


> So what would you recommend? Creating 2 appropriate STK INT arrays and using
> both as a single zpool device, i.e. without ZFS mirror devs and 2nd copies?  

Here's what we did:  Configure all internal disks as standalone volumes on
the RAID card.  All those volumes have the battery-backed cache enabled.
The first two 146GB drives got sliced in two:  the first half of each disk
became the boot/root mirror pool.  The 2nd half was used for a separate-ZIL
mirror, applied to an external SATA pool.

Our remaining internal drives were configured into a mirrored ZFS pool
for database transaction logs.  No need for a separate ZIL there, since
the internal drives effectively have NV cache as far as ZFS is concerned.

Yes, the 256MB cache is small, but if it fills up, it is backed by the
10kRPM internal SAS drives, which should have decent latency when compared
to external SATA JBOD drives.  And even this tiny NV cache makes a huge
difference when used on an NFS server:
http://acc.ohsu.edu/~hakansom/j4400_bench.html

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zpool without any redundancy

2009-10-20 Thread Marion Hakanson

>I wrote:
>> Is anyone else tired of seeing the word redundancy? (:-)

matthias.ap...@lanlabor.com said:
> Only in a perfect world (tm) ;-)
> IMHO there is no such thing as "too much redundancy". In the real world the
> possibilities of redundancy are only limited by money, 

Sigh.  I was just joking about how many times the word showed up in
all of our postings.

http://www.imdb.com/title/tt1436296/

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zpool without any redundancy

2009-10-20 Thread Marion Hakanson

mmusa...@east.sun.com said:
> What benefit are you hoping zfs will provide in this situation?  Examine
> your situation carefully and determine what filesystem works best for you.
> There are many reasons to use ZFS, but if your configuration isn't set up  to
> take advantage of those reasons, then there's a disconnect somewhere. 

How about if your config can only take advantage of _some_ of those reasons
to use ZFS?  There are plenty of benefits to using ZFS on a single bare hard
drive, and those benefits apply to using it on an expensive SAN array.  It's
up to each individual to decide if adding redundancy is worthwhile or not.

I'm not saying ZFS is perfect.  And, ZFS is indeed better when it can make
use of redundancy.  But ZFS has lost data even with such redundancy, so
having it does not confer magical protection from all disasters.

Anyway, here's a note describing our experience with this situation:

We've been using ZFS here on two hardware RAID fiberchannel arrays, with
no ZFS-level redundancy, starting September-2006 -- roughly 6TB of data,
checksums enabled, weekly scrubs, regular tape backups.  So far there has
been not one checksum error detected on these arrays.  We've had dumb SAN
connectivity losses, complete power failures on arrays, FC switches, and/or
file servers, and so on, but no loss of data.

Before ZFS, we used a combination of SAM-QFS and UFS filesystems on the
same arrays, and ZFS has proved much easier to manage, reducing data loss
due to human errors in volume and space management.  The checksum feature
makes filesystems without it into second-class offerings, in my opinion.

Is anyone else tired of seeing the word redundancy? (:-)

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Marion Hakanson

webcl...@rochester.rr.com said:
>  To verify data, I cannot depend on existing tools since diff is not large
> file aware.  My best idea at this point is to calculate and compare MD5 sums
> of every file and spot check other properties as best I can. 

Ray,

I recommend that you use rsync's "-c" to compare copies.  It reads all the
source files, computes a checksum for them, then does the same for the
destination and compares checksums.  As far as I know, the only thing
that rsync can't do in your situation is the ZFS/NFSv4 ACL's.  I've used
it to migrate many TB's of data.

Regards,

Marion




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

2009-09-29 Thread Marion Hakanson

David Stewart wrote:
> How do I identify which drive it is?  I hear each drive spinning (I listened
> to them individually) so I can't simply select the one that is not spinning.

You can try reading from each raw device, and looking for a blinky-light
to identify which one is active.  If you don't have individual lights,
you may be able to hear which one is active.  The "dd" command should do.

Marion

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Marion Hakanson

rswwal...@gmail.com said:
> Yes, but if it's on NFS you can just figure out the workload in MB/s and use
> that as a rough guideline. 

I wonder if that's the case.  We have an NFS server without NVRAM cache
(X4500), and it gets huge MB/sec throughput on large-file writes over NFS.
But it's painfully slow on the "tar extract lots of small files" test,
where many, tiny, synchronous metadata operations are performed.


> I did a smiliar test with a 512MB BBU controller and saw no difference with
> or without the SSD slog, so I didn't end up using it.
> 
> Does your BBU controller ignore the ZFS flushes? 

I believe it does (it would be slow otherwise).  It's the Sun StorageTek
internal SAS RAID HBA.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Marion Hakanson

j...@jamver.id.au said:
> For a predominantly NFS server purpose, it really looks like a case of the
> slog has to outperform your main pool for continuous write speed as well as
> an instant response time as the primary criterion. Which might as well be a
> fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of
> them. 

I wonder if you ran Richard Elling's "zilstat" while running your
workload.  That should tell you how much ZIL bandwidth is needed,
and it would be interesting to see if its stats match with your
other measurements of slog-device traffic.

I did some filebench and "tar extract over NFS" tests of J4400 (500GB,
7200RPM SATA drives), with and without slog, where slog was using the
internal 2.5" 10kRPM SAS drives in an X4150.  These drives were behind
the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
cache memory, all on Solaris-10U7.

We saw slight differences on filebench oltp profile, and a huge speedup
for the "tar extract over NFS" tests with the slog present.  Granted, the
latter was with only one NFS client, so likely did not fill NVRAM.  Pretty
good results for a poor-person's slog, though:
http://acc.ohsu.edu/~hakansom/j4400_bench.html

Just as an aside, and based on my experience as a user/admin of various
NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
to get very good improvements with relatively small amounts of NVRAM
(128K, 1MB, 256MB, etc.).  None of the filers I've seen have ever had
tens of GB of NVRAM.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving volumes to new controller

2009-09-17 Thread Marion Hakanson

vidar.nil...@palantir.no said:
> I'm trying to move disks in a zpool from one SATA-kontroller to another. Its
> 16 disks in 4x4 raidz. Just to see if it could be done, I moved one disk from
> one raidz over to the new controller. Server was powered off. 
> . . .
> zpool replace storage c10t7d0 c11t0d0
> /dev/dsk/c11t0d0s0 is part of active ZFS pool storage. Please see zpool(1M). 
> . . .
> I've tried several things now (fumbling around in the dark :-)). I tried to
> delete all partitions and relabel the disk, with no other results than above.
> . . .

To recover from this situation, you'll need to erase enough blocks of
the disk to get rid of the ZFS pool info.  You could do this a number of
ways, but probably the simplest is:
dd if=/dev/zero of=/dev/rdsk/c11t0d0 bs=512 count=100

You may also need to give the same treatment to the last several blocks
of the disk, where redundant ZFS labels may still be present.


> Both controllers are "raid controllers", and I haven't found any way to make
> them presents the disks directly to opensolaris. So I  have made 1 volume for
> each drive (the raid5 implementation is rather slow, and they have no
> battery). Maybe this is the source of the problems? 

I don't think so.  If the two RAID controllers were not compatible, I doubt
that ZFS would see the pool info on the disk that you already moved.

By the way, after you've got the above issue fixed, if you can power the
server off, you might be able to move all the drives at once, without any
resilvering.  Just "zpool export storage" before moving the drives, then
afterwards "zpool import" should be able to find the pool in the new
location.

Note that this export/import approach probably won't work if the two RAID
controllers are not compatible with each other.  Some RAID controllers can
be re-flashed with non-RAID firmware, so that might simplify things.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-17 Thread Marion Hakanson

rswwal...@gmail.com said:
> It's not the stripes that make a difference, but the number of controllers
> there.
> 
> What's the system config on that puppy? 

The "zpool status -v" output was from a Thumper (X4500), slightly edited,
since in our real-world Thumper, we use c6t0d0 in c5t4d0's place in the
"optimal" layout I posted, because c5t4d0 is used in the boot-drive mirror.

See the following for our 2006 Thumper benchmarks, which appear to bear
out Richard Elling's RaidOptimizer analysis:
http://acc.ohsu.edu/~hakansom/thumper_bench.html

While I'm at it, filebench numbers from a recent J4400-based database
server deployment, with some "slog vs no-slog" comparisons (sorry, no
SSD's available here yet):
http://acc.ohsu.edu/~hakansom/j4400_bench.html

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-16 Thread Marion Hakanson

rswwal...@gmail.com said:
> There is another type of failure that mirrors help with and that is
> controller or path failures. If one side of a mirror set is on one
> controller or path and the other on another then a failure of one will   not
> take down the set.
> 
> You can't get that with RAIDZn. 

You can if you have a stripe of RAIDZn's, and enough controllers
(or paths) to go around.  The raidz2 below should be able to survive
the loss of two controllers, shouldn't it?

Regards,

Marion


$ zpool status -v
  pool: zp1
 state: ONLINE
 scrub: scrub completed after 7h9m with 0 errors on Mon Sep 14 13:39:03 2009
config:

NAMESTATE READ WRITE CKSUM
bulk_zp01   ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c5t1d0  ONLINE   0 0 0
c6t1d0  ONLINE   0 0 0
c7t1d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
c5t2d0  ONLINE   0 0 0
c6t2d0  ONLINE   0 0 0
c7t2d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
c5t3d0  ONLINE   0 0 0
c6t3d0  ONLINE   0 0 0
c7t3d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c4t4d0  ONLINE   0 0 0
c5t4d0  ONLINE   0 0 0
c6t4d0  ONLINE   0 0 0
c7t4d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
c4t5d0  ONLINE   0 0 0
c5t5d0  ONLINE   0 0 0
c6t5d0  ONLINE   0 0 0
c7t5d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c1t6d0  ONLINE   0 0 0
c4t6d0  ONLINE   0 0 0
c5t6d0  ONLINE   0 0 0
c6t6d0  ONLINE   0 0 0
c7t6d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
c1t7d0  ONLINE   0 0 0
c4t7d0  ONLINE   0 0 0
c5t7d0  ONLINE   0 0 0
c6t7d0  ONLINE   0 0 0
c7t7d0  ONLINE   0 0 0
spares
  c0t0d0AVAIL
  c1t0d0AVAIL
  c4t0d0AVAIL
  c7t0d0AVAIL

errors: No known data errors
$ 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding SAS/SATA Backplanes and Connectivity

2009-07-21 Thread Marion Hakanson


asher...@versature.com said:
> And, on that subject, is there truly a difference between Seagate's line-up
> of 7200 RPM drives? They seem to now have a bunch: 
> . . .
> Other manufacturers seem to have similar lineups. Is the difference going to
> matter to me when putting a mess of them into a SAS JBOD with an expander? 

There are differences even within the lineup of Sun-supplied SATA drives.
Some support multipathing, and some do not.  Even some that are reported
(in Sun docs) to support it, do not.

  http://opensolaris.org/jive/thread.jspa?threadID=107049&tstart=30
  http://opensolaris.org/jive/thread.jspa?threadID=107057&tstart=15

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-20 Thread Marion Hakanson

bfrie...@simple.dallas.tx.us said:
> No.  I am suggesting that all Solaris 10 (and probably OpenSolaris  systems)
> currently have a software-imposed read bottleneck which  places a limit on
> how well systems will perform on this simple  sequential read benchmark.
> After a certain point (which is  unfortunately not very high), throwing more
> hardware at the problem  does not result in any speed improvement.  This is
> demonstrated by  Scott Lawson's little two disk mirror almost producing the
> same  performance as our much more exotic setups. 

Apologies for reawakening this thread -- I was away last week.

Bob, have you tried changing your benchmark to be multithreaded?  It
occurs to me that maybe a single cpio invocation is another bottleneck.
I've definitely experienced the case where a single bonnie++ process was
not enough to max out the storage system.

I'm not suggesting that the bug you're demonstrating is not real.  It's
clear that subsequent runs on the same system show the degradation, and
that points out a problem.  Rather, I'm thinking that maybe the timing
comparisons between low-end and high-end storage systems on this particular
test are not revealing the whole story.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread Marion Hakanson

jlo...@ssl.berkeley.edu said:
> What's odd is we've checked a few hundred files, and most of them   don't
> seem to have any corruption.  I'm thinking what's wrong is the   metadata for
> these files is corrupted somehow, yet we can read them   just fine.  I wish I
> could tell which ones are really bad, so we   wouldn't have to recreate them
> unnecessarily.  They are mirrored in   various places, or can be recreated
> via reprocessing, but recreating/  restoring that many files is no easy task.

You know, this sounds similar to what happened to me once when I did a
"zpool offline" to half of a mirror, changed a lot of stuff in the pool
(like adding 20GB of data to an 80GB pool), then "zpool online", thinking
ZFS might be smart enough to sync up the changes that had happened
since detaching.

Instead, a bunch of bad files were reported.  Since I knew nothing was
wrong with the half of the mirror that had never been offlined, I just
did a "zpool detach" of the formerly offlined drive, "zpool clear" to
clear the error counts, "zpool scrub" to check for integrity, then
"zpool attach" to cause resilver to start from scratch.

If this describes your situation, I guess the tricky part for you is to
now decide which half of your mirror is the good half.

There's always "rsync -n -v -a -c ..." to compare copies of files
that happen to reside elsewhere.  Slow but safe.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] storage & zilstat assistance

2009-04-28 Thread Marion Hakanson

bfrie...@simple.dallas.tx.us said:
> Your IOPS don't seem high.  You are currently using RAID-5, which is a  poor
> choice for a database.  If you use ZFS mirrors you are going to  unleash a
> lot more IOPS from the available spindles. 

RAID-5 may be poor for some database loads, but it's perfectly adequate
for this one (small data warehouse, sequential writes, and so far mostly
sequential reads as well).  So far the RAID-5 LUN has not been a problem,
and it doesn't look like the low IOPS are because of the hardware, rather
the database/application just isn't demanding more.  Please correct me
if I've come to the wrong conclusion here


> I am not familiar with zilstat.  Presumaby the '93' is actually 930  ops/
> second? 

I think you answered your question in your second post.  But for others,
the "93" is the total ops over the reporting interval.  In this case,
the interval was 10 seconds, so 9.3 ops/sec.


> I have a 2540 here, but a very fast version with 12 300GB 15K RPM SAS  drives
> arranged as six mirrors (2540 is configured like a JBOD).  While I don't run
> a database, I have run an IOPS benchmark with random  writers (8K blocks) and
> see a peak of 3708 ops/sec.  With a SATA model  you are not likely to see
> half of that. 

Thanks for the 2540 numbers you posted.  There's a SAS 2530 here with the
same 300GB 15kRPM drives, and as you said, it's fast.  But it looks so far
like the SATA model, even with less than half the IOPS, will be more than
enough for our workload.

I'm pretty convinced that the SATA 2540 will be sufficient.  What I'm not
sure of is if the cheaper J4200 without SSD would be sufficient.  I.e.,
are we generating enough synchronous traffic that lack of NVRAM cache
will cause problems?

One thing zilstat doesn't make obvious (to me) is the latency effects of a
separate log/ZIL device.  I guess I could force our old array's cache into
write-through mode and see what happens to the numbers.  Judging by our
experience with NFS servers using this same array, I'm reluctant to try.

Thanks and regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] storage & zilstat assistance

2009-04-27 Thread Marion Hakanson

Greetings,

We have a small Oracle project on ZFS (Solaris-10), using a SAN-connected
array which is need of replacement.  I'm weighing whether to recommend
a Sun 2540 array or a Sun J4200 JBOD as the replacement.  The old array
and the new ones all have 7200RPM SATA drives.

I've been watching the workload on the current storage using Richard Elling's
handy zilstat tool, and could use some more eyes/brains than just mine for
making sense of the results.

There are three pools;  One is on a mirrored pair of internal 2.5" SAS
10kRPM drives, which holds some database logs;  The 2nd is a RAID-5 LUN on
the old SAN array (6 drives), which holds database tables & indices;  The
3rd is a mirrored pair of SAN drives, holding log replicas, archives,
and RMAN backup files.

I've included inline below an edited "zpool status" listing to show
the ZFS pools, a listing of "zilstat -l 30 10" showing ZIL traffic
for each of the three pools, and a listing of "iostat -xn 10" for
the relevant devices, all during the same time period.

Note that the time these stats were taken was a bit atypical, in that an
RMAN backup was taking place, which was the source of the read (over)load
on the "san_sp2" pool devices.

So, here are my conclusions, and I'd like a sanity check since I don't
have a lot of experience with interpreting ZIL activity just yet.

(1) ZIL activity is not very heavy.  Transaction logs on the internal
drives, which have no NVRAM cache, appear to generate low enough
levels of traffic that we could get by without an SSD ZIL if a
JBOD solution is chosen.  We can keep using the internal drive pool
after the old SAN array is replaced.

(2) During RMAN backups, ZIL activity gets much heavier on the affected
SAN pool.  We see a low-enough average rate (maybe 200 KBytes/sec),
but the occasional peak of as much as 1 to 2 MBytes/sec.  The 100%-busy
figures here are for "regular" read traffic, not ZIL.

(3) Probably to be safe, we should go with the 2540 array, which does
have a small NVRAM cache, even though it is a fair bit more expensive
than the J4200 JBOD solution.  Adding a Logzilla SSD to the J4200
is way more expensive than the 2540 with its NVRAM cache, and an 18GB
Logzilla is probably overkill for this workload.

I guess one question I'd add is:  The "ops" numbers seem pretty small.
Is it possible to give enough spindles to a pool to handle that many
IOP's without needing an NVRAM cache?  I know latency comes into play
at some point, but are we at that point?

Thanks and regards,

Marion


===
  pool: int_mp1
config:
NAME  STATE READ WRITE CKSUM
int_mp1   ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c0t0d0s5  ONLINE   0 0 0
c0t1d0s5  ONLINE   0 0 0

  pool: san_sp1
config:
NAME STATE READ WRITE 
CKSUM
san_sp1  ONLINE   0 0  
   0
  c3t4849544143484920443630303133323230303430d0  ONLINE   0 0  
   0

  pool: san_sp2
config:
NAME STATE READ WRITE 
CKSUM
san_sp2  ONLINE   0 0  
   0
  c3t4849544143484920443630303133323230303033d0  ONLINE   0 0  
   0

===

# zilstat -p san_sp1 -l 30 10
   N-Bytes  N-Bytes/s N-Max-RateB-Bytes  B-Bytes/s B-Max-Rateops  <=4kB 
4-32kB >=32kB
108992  10899 108992 143360  14336 143360  5  1 
 2  2
 0  0  0  0  0  0  0  0 
 0  0
 33536   3353  16768  40960   4096  20480  2  0 
 2  0
134144  13414  50304 163840  16384  61440  8  0 
 8  0
 16768   1676  16768  20480   2048  20480  1  0 
 1  0
 0  0  0  0  0  0  0  0 
 0  0
134144  13414 134144 221184  22118 221184  2  0 
 0  2
134848  13484 117376 233472  23347 143360  9  0 
 8  1
^C

# zilstat -p san_sp2 -l 30 10
   N-Bytes  N-Bytes/s N-Max-RateB-Bytes  B-Bytes/s B-Max-Rateops  <=4kB 
4-32kB >=32kB
   1126264 112626 3185921658880 165888 466944 56  0 
50  6
 67072   6707  25152 114688  11468  53248  6  0 
 6  0
 61120   6112  16768  86016   8601  20480  7  3 
 4  0
193216  19321  83840 258048  25804 114688 14  0 
14  0
   1563584 15635810437761916928 1916921282048 96  3 
93  0
 50304   5030  1

Re: [zfs-discuss] [on-discuss] Reliability at power failure?

2009-04-19 Thread Marion Hakanson

udip...@gmail.com said:
> dick at nagual.nl wrote:
>> Maybe because on the fifth day some hardware failure occurred? ;-)
> 
> That would be which? The system works and is up and running beautifully.
> OpenSolaris, as of now.

Running beautifully as long as the power stays on?  Is it hard to believe
hardware might glitch at power-failure (or power-on-after-failure)?

> Ah, you're hinting at a rare hardware glitch as  underlying problem? AFAIU,
> it is a proclaimed feature of ZFS that writes  are atomic, out and over

Not only does ZFS advertise atomic updates, it also _depends_ on them,
and checks for them having happened, likely more so than other filesystems.
Is it hard to believe that ZFS is exercising and/or checking up on your
hardware in ways that Linux does not do?

> Uwe,
> who is a big fan of a ZFS that fulfills all of its promises. Snapshots  and
> luupgrade have yet to fail me on it. And a few other beautiful  things. It is
> the reliability that makes me wonder if UFS/FFS/ext3 are  not better choices
> in this respect. Blaming standard, off-the-shelf hardware as 'too cheap' is a
> too  slippery slope, btw. 

Sorry to hear you're still having this issue.  I can only offer anecdotal
experience:  Running Solaris-10 here, non-mirrored ZFS root/boot since
last December (other ZFS filesystems, mirrored and non-mirrored, for 2 years
prior), on standard off-the-shelf PC, slightly more than 5 years old.  This
system has been through multiple power-failures, never with any corruption.
Same goes for a 2-yr-old Dell desktop PC at work, with mirrored ZFS root/boot;
Multiple power failures, never any reported checksum errors or other 
corruption.

We also have Solaris-10 systems at work, non-ZFS-boot, but with ZFS running
without redundancy on non-Sun fiberchannel RAID gear.  These have had
power failures and other SAN outages without causing corruption of ZFS
filesystems.

We have experienced a number of times where systems failed to boot after
power-failure, due to boot-archive being out of date.  Not corrupted, just
out of date.  Annoying and inconvient for production systems, but nothing
at all to do with ZFS.

So, I personally have not found ZFS to be any less reliable in presence of
power failures than Solaris-10/UFS or Linux on the same hardware.

I wonder what is it that's unique or rare about your situation, that
OpenSolaris and/or ZFS is uncovering?  I also wonder how hard it might
be to make ZFS resilient to whatever unique/rare circumstances you have,
as compared to finding/fixing/avoiding those circumstances.

Regards,

Marion

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What causes slow performance under load?

2009-04-19 Thread Marion Hakanson

mi...@cc.umanitoba.ca said:
> What would I look for with mpstat? 

Look for a CPU (thread) that might be 100% utilized;  Also look to see
if that CPU (or CPU's) has a larger number in the "ithr" column than all
other CPU's.  The idea here is that you aren't getting much out of the
T2000 if only one (or a few) of its 32 CPU's is working hard.

On our T2000's running Solaris-10 (Update 4, I believe), the default
kernel settings do not enable interrupt-fanout for the network interfaces.
So you can end up with all four of your e1000g's being serviced by the
same CPU.  You can't get even one interface to handle more than 35-45%
of a gigabit if that's the case, but proper tuning has allowed us to see
90MByte/sec each, on multiple interfaces simultaneously.

Note I'm not suggesting this explains your situation.  But even if you've
addressed this particular issue, you could still have some other piece of
your stack which ends up bottlenecked on a single CPU, and mpstat can
show if that's happening.

Oh yes, "intrstat" can also show if hardware device interrupts are being
spread among multiple CPU's.  On the T2000, it's recommended that you
set things up so only one thread per core is allowed to handle interrupts,
freeing the others for application-only work.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OpenSolaris / ZFS at the low-end

2009-03-31 Thread Marion Hakanson

bh...@freaks.com said:
> Even with a very weak CPU the system is close to saturating the PCI bus for
> reads with most configurations. 

Nice little machine.  I wonder if you'd get some of the bonnie numbers
increased if you ran multiple bonnie's in parallel.  Even though the
sequential throughput is near 100MB/sec, using both CPU cores might
push more random IOP's than a single-threaded bonnie can go.  This
was certainly the case on an UltraSPARC-T1 (1GHz) here -- not known
for single-threaded speed, but good multithreaded throughput.  I ran
three bonnie++'s together, using "-p 3" to initialize a semaphore,
and "-y" on the three measurement runs to synchronize their startup.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [perf-discuss] ZFS performance issue - READ is slow as hell...

2009-03-31 Thread Marion Hakanson

james.ma...@sun.com said:
> I'm not yet sure what's broken here, but there's something pathologically
> wrong with the IO rates to the device during the ZFS tests. In both cases,
> the wait queue is getting backed up, with horrific wait queue latency
> numbers. On the read side, I don't understand why we're seeing 4-5 seconds of
> zero disk activity on the read test in between bursts of a small number of
> reads. 

We observed such long pauses (with zero disk activity) with a disk array
that was being fed more operations than it could handle (FC queue depth).
The array was not losing ops, but the OS would fill the device's queue
and then the OS would completely freeze on any disk-related activity for
the affected LUN's.  All zpool or zfs commands related to those pools would
be unresponsive during those periods, until the load slowed down enough
such that the OS wasn't ahead of the array.

This was with Solaris-10 here, not OpenSolaris or SXCE, but I suspect
the principal would still apply.  Naturally, the original poster may have
a very different situation, so take the above as you wish.  Maybe Dtrace
can help:
http://blogs.sun.com/chrisg/entry/latency_bubble_in_your_io
http://blogs.sun.com/chrisg/entry/latency_bubbles_follow_up
http://blogs.sun.com/chrisg/entry/that_we_should_make

Note that using the above references, Dtrace showed that we had some
FC operations which took 60 or even 120 seconds to complete.  Things got
much better here when we zeroed in on two settings:

  (a) set FC queue depth for the device to match its backend capacity (4).
  (b) turn off sorting of the queue by the OS/driver (latency evened out).

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Bad SWAP performance from zvol

2009-03-31 Thread Marion Hakanson

casper@sun.com said:
> I've upgraded my system from ufs to zfs (root pool).
> By default, it creates a zvol for dump and swap. 
> . . .
> So I removed the zvol swap and now I have a standard swap partition. The
> performance is much better (night and day).  The system is usable and  I
> don't know the job is running.
> 
> Is this expected? 

If you're using Solaris-10U6 to migrate, the early revisions of liveupgrade
would create swap and dump zvols that have some different properties than
what S10U6 Jumpstart creates.  On x86 here, the swap zvol ends up with 4k
volblocksize when you Jumpstart install, but liveupgrade sets it to 8k (which
does not match the system page-size of 4k).

The other difference I noticed was the dump zvol from Jumpstart install
has 128k volblocksize, but early S10U6 liveupgrade set it to 8k, which
makes crash-dumps incredibly slow (should you have one).  I know that
subsequent LU patches have fixed the dump zvol volblocksize, but am not
sure if the swap zvol has been updated in a LU patch.

Sorry I can't report on whether zvol swap is slower than UFS swap slice
for us here;  None of our ZFS-root systems have done any significant
swapping/paging as far as I can tell.
  $ zfs get volsize,referenced rpool/swap
  NAMEPROPERTYVALUE   SOURCE
  rpool/swap  volsize 4.00G   -
  rpool/swap  referenced  105M-
  $

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Virutal zfs server vs hardware zfs server

2009-03-02 Thread Marion Hakanson

n...@jnickelsen.de said:
> As far as I know the situation with ATI is that, while ATI supplies
> well-performing binary drivers for MS Windows (of course) and Linux, there is
> no such thing for other OSs. So OpenSolaris uses standardized interfaces of
> the graphics hardware, which have comparatively low bandwidth. 
> . . .
> But there are things that really are a pain, e. g. web pages that constantly
> blend one picture into the other, for instance http://www.strato.de/ . While
> you would not notice that, usually, this page makes my laptop really slow,
> such that it requires significant effort even to find and press the button to
> close the window. 

Wow, this is getting pretty far afield from a ZFS discussion.  Hopefully
others will find this a helpful tidbit

I just found some xorg.conf settings which greatly alleviate this issue
on my Solaris-10-x86 machine with ATI Radeon 9200 graphics adapter.  In
the "Device" section, try one of the following:

Option  "AccelMethod"   "EXA"   # default is "XAA"
Or:
Option  "XaaNoOffscreenPixmaps" "on"

Seriously, it's almost like having a new PC.  Either option makes the
"100% CPU while fading rotating images" go away;  Personally, I prefer
the 2nd option, as I found the 1st method led to slightly slower redrawing
of windows (e.g. when you switch between GNOME desktops), but that will
depend on what else you're doing.

But yes, nVidia cards are much, much better supported in Solaris.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS on SAN?

2009-02-16 Thread Marion Hakanson

bfrie...@simple.dallas.tx.us said:
> A 12-disk pool that I built a year ago is still working fine with  absolutely
> no problems at all.  Another two disk pool built using  cheap large USB
> drives has been running for maybe eight months, with  no problems. 

We have non-redundant ZFS pools on an HDS 9520V array, and also a Sun 6120
array, some of them running for two years now (S10U3, S10U4, S10U5, both
SPARC and x86), up to 4TB in size.  We have experienced SAN zoning mistakes,
complete power loss to arrays, servers, and/or SAN switches, etc., with no
pool corruption or data loss.  We have not even seen one block checksum
error detected by ZFS on these arrays (we have seen one such error on our
X4500 in the past 6 months).

Note that the only available pool failure mode in the presence of a SAN
I/O error for these OS's has been to panic/reboot, but so far when the
systems have come back, data has been fine.  We also do tape backups
of these pools, of course.

Regards,

-- 
Marion Hakanson 
OHSU Advanced Computing Center


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Introducing zilstat

2009-02-04 Thread Marion Hakanson

The zilstat tool is very helpful, thanks!

I tried it on an X4500 NFS server, while extracting a 14MB tar archive,
both via an NFS client, and locally on the X4500 itself.  Over NFS,
said extract took ~2 minutes, and showed peaks of 4MB/sec buffer-bytes
going through the ZIL.

When run locally on the X4500, the extract took about 1 second, with
zilstat showing all zeroes.  I wonder if this is a case where that
ZIL bypass kicks in for >32K writes, in the local tar extraction.
Does zilstat's underlying dtrace include these bypass-writes in the
totals it displays?

I think if it's possible to get stats on this bypassed data, I'd like
to see it as another column (or set of columns) in the zilstat output.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS over NFS, poor performance with many small files

2009-01-20 Thread Marion Hakanson

d...@yahoo.com said:
> Any recommendations for an SSD to work with an X4500 server?  Will the SSDs
> used in the 7000 series servers work with X4500s or X4540s? 

The Sun System Handbook (sunsolve.sun.com) for the 7210 appliance (an
X4540-based system) lists the "logzilla" device with this fine print:
  PN#371-4192 Solid State disk drives can only be installed in slots 3 and 11.

Makes me wonder if they would work in our X4500 NFS server.  Our ZFS pool is
already deployed (Solaris-10), but we have four hot spares -- two of which
could be given up in favor of a mirrored ZIL.  An OS upgrade to S10U6 would
give the separate-log functionality, if the drivers, etc. supported the
actual SSD device.  I doubt we'll go out and buy them before finding out
if they'll actually work -- it would be a real shame if they didn't, though.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hybrid Pools - Since when?

2008-12-15 Thread Marion Hakanson

richard.ell...@sun.com said:
> L2ARC arrived in NV at the same time as ZFS boot, b79, November 2007. It was
> not back-ported to Solaris 10u6.

You sure?  Here's output on a Solaris-10u6 machine:

cyclops 4959# uname -a
SunOS cyclops 5.10 Generic_137138-09 i86pc i386 i86pc
cyclops 4960# zpool upgrade -v
This system is currently running ZFS pool version 10.

The following versions are supported:

VER  DESCRIPTION
---  
 1   Initial ZFS version
 2   Ditto blocks (replicated metadata)
 3   Hot spares and double parity RAID-Z
 4   zpool history
 5   Compression using the gzip algorithm
 6   bootfs pool property
 7   Separate intent log devices
 8   Delegated administration
 9   refquota and refreservation properties
 10  Cache devices
For more information on a particular version, including supported releases, 
see:
http://www.opensolaris.org/os/community/zfs/version/N

Where 'N' is the version number.
cyclops 4961#


Note, I haven't tried adding a cache device yet.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] To separate /var or not separate /var, that is the question....

2008-12-11 Thread Marion Hakanson

vincent_b_...@yahoo.com said:
> Just wondering if (excepting the existing zones thread) there are any
> compelling arguments to keep /var as it's own filesystem for your typical
> Solaris server.  Web servers and the like. 

Well, it's been considered a "best practice" for servers for a lot of
years to keep /var/ as a separate fileystem:

(1) You can use special mount options, such as "nosuid", which improves
security.  E.g. world-writable areas (/var/tmp) cannot be seeded with
a trojan or other privilege-escalating attack.

(2) You can limit the size, preventing a non-privileged process from
using up all the system's disk space.

If you don't believe me, go read Sun's own Blueprints books/articles.

Personally, I'd like to place a limit on /var/core/;  That's the only
consistent "out of disk space" cause I've seen on our Solaris-10 systems,
and that happens whether /var/ is separate or not.  Maybe /var/crash/
as well.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool replace - choke point

2008-12-05 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> Thanks for the tips.  I'm not sure if they will be relevant, though.  We
> don't talk directly with the AMS1000.  We are using a USP-VM to virtualize
> all of our storage and we didn't have to add anything to the drv
> configuration files to see the new disk (mpxio was already turned on).  We
> are using the Sun drivers and mpxio and we didn't require any tinkering to
> see the new LUNs.

Yes, the fact that the USP-VM was recognized automatically by Solaris drivers
is a good sign.  I suggest that you check to see what queue-depth and disksort
values you ended up with from the automatic settings:

  echo "*ssd_state::walk softstate |::print -t struct sd_lun un_throttle" \
   | mdb -k

The "ssd_state" would be "sd_state" on an x86 machine (Solaris-10).
The "un_throttle" above will show the current max_throttle (queue depth);
Replace it with "un_min_throttle" to see the min, and "un_f_disksort_disabled"
to see the current queue-sort setting.

The HDS docs for 9500 series suggested 32 as the max_throttle to use, and
the default setting (Solaris-10) was 256 (hopefully with the USP-VM you get
something more reasonable).  And while 32 did work for us, i.e. no operations
were ever lost as far as I could tell, the array back-end -- the drives
themselves, and the internal SATA shelf connections, have an actual queue
depth of four for each array controller.  The AMS1000 has the same limitation
for SATA shelves, according to our HDS engineer.

In short, Solaris, especially with ZFS, functions much better if it does
not try to send more FC operations to the array than the actual physical
devices can handle.  We were actually seeing NFS client operations hang
for minutes at a time when the SAN-hosted NFS server was making its ZFS
devices busy -- and this was true even if clients were using different
devices than the busy ones.  We do not see these hangs after making the
described changes, and I believe this is because the OS is no longer waiting
around for a response from devices that aren't going to respond in a
reasonable amount of time.

Yes, having the USP between the host and the AMS1000 will affect things;
There's probably some huge cache in there somewhere.  But unless you've
got cache of hundreds of GB in size, at some point a resilver operation
is going to end up running at the speed of the actual back-end device.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool replace - choke point

2008-12-04 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> I think we found the choke point.  The silver lining is that it isn't the
> T2000 or ZFS.  We think it is the new SAN, an Hitachi AMS1000, which has
> 7200RPM SATA disks with the cache turned off.  This system has a very small
> cache, and when we did turn it on for one of the replacement LUNs we saw a
> 10x improvement - until the cache filled up about 1 minute later (was using
> zpool iostat).  Oh well. 

We have experience with a T2000 connected to the HDS 9520V, predecessor
to the AMS arrays, with SATA drives, and it's likely that your AMS1000 SATA
has similar characteristics.  I didn't see if you're using Sun's drivers to
talk to the SAN/array, but we are using Solaris-10 (and Sun drivers + MPXIO),
and since the Hitachi storage isn't automatically recognized (sd/ssd,
scsi_vhci), it took a fair amount of tinkering to get parameters adjusted
to work well with the HDS storage.

The combination that has given us best results with ZFS is:
 (a) Tell the array to ignore SYNCHRONIZE_CACHE requests from the host.
 (b) Balance drives within each AMS disk shelf across both array controllers.
 (c) Set the host's max queue depth to 4 for the SATA LUN's (sd/ssd driver).
 (d) Set the host's disable_disksort flag (sd/ssd driver) for HDS LUN's.

Here's the reference we used for setting the parameters in Solaris-10:
  http://wikis.sun.com/display/StorageDev/Parameter+Configuration

Note that the AMS uses read-after-write verification on SATA drives,
so you only have half the IOP's for writes that the drives are capable
of handling.  We've found that small RAID volumes (e.g. a two-drive
mirror) are unbelievably slow, so you'd want to go toward having more
drives per RAID group, if possible.

Honestly, if I recall correctly what I saw in your "iostat" listings
earlier, your situation is not nearly as "bad" as with our older array.
You don't seem to be driving those HDS LUN's to the extreme busy states
that we have seen on our 9520V.  It was not unusual for us to see LUN's
at 100% busy, 100% wait, with 35 ops total in the "actv" and "wait" columns,
and I don't recall seeing any 100%-busy devices in your logs.

But getting the FC queue-depth (max-throttle) setting to match what the
array's back-end I/O can handle greatly reduced the long "zpool status"
and other I/O-related hangs that we were experiencing.  And disabling
the host-side FC queue-sorting greatly improved the overall latency of
the system when busy.  Maybe it'll help yours too.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs boot - U6 kernel patch breaks sparc boot

2008-11-18 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> I thought to look at df output before rebooting, and there are PAGES & PAGES
> like this:
> 
>/var/run/.patchSafeModeOrigFiles/usr/platform/FJSV,GPUZC-M/lib/libcpc.so.1 
7597264   85240 7512024 2%/usr/platform/FJSV,GPUZC-M/lib/libcpc.so.1
> . . .
> Hundreds of mountpoints, what's it doing in there?

That's normal, for deferred-activation patches (like this jumbo kernel patch).
They are loopback mounts which are supposed to keep any kernel-specific things
from being affected by something that would otherwise change the running 
kernel.

Using liveupgrade for patches is quite a bit cleaner, in my opinion,
if you have that option.  It seems to do a good job of updating grub on
all bootable drives as well (as of S10U6, anyway).

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] unable to ludelete BE with ufs

2008-11-12 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> # ludelete beA
> ERROR: cannot open 'pool00/zones/global/home': dataset does not exist
> ERROR: cannot mount mount point  device 
> 
> ERROR: failed to mount file system  on 
> 
> ERROR: unmounting partially mounted boot environment file systems
> ERROR: cannot mount boot environment by icf file 
> ERROR: Cannot mount BE .
> Unable to delete boot environment.
> . . .
> Big Mistake... For ZFS boot I need space for a seperate zfs root pool. So
> whilst booted under beB I backup my pool00 data, destroy pool00, re-create
> pool00 (a little differently, thus the error it would seem) but hold out one
> of the drives and use it to create a rpool00 root pool. Then I 
> . . .

I made this same mistake.  If you "grep pool00 /etc/lu/ICF.1" you'll
see filesystems beA expects to be mounted in beA;  Some of those it
may expect to be able to share between the current BE and beA.

The way to fix things is to create a temporary pool "pool00";  This
need not be on an actual disk, it could be hosted in a file or a slice,
etc.  Then create those datasets in the temporary pool, and try the
"ludelete beA" again.

Note that if the problem datasets are supposed to be shared between
current BE and beA, you'll need them mounted on the original paths
in the current BE, because "ludelete" will use loopback mounts to
attach them into beA during the deletion process.

I guess the moral of the story is that you should ludelete any old BE's
before you alter the filesystems/datasets that it mounts.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> but Marion's is not really possible at all, and won't be for a while with
> other groups' choice of storage-consumer platform, so it'd have to be
> GlusterFS or some other goofy fringe FUSEy thing or not-very-general crude
> in-house hack. 

Well, of course the magnitude of fringe factor is in the eye of the beholder.
I didn't intend to make pNFS seem like a done deal.  I don't quite yet
think of OpenSolaris as a "done deal" either, still using Solaris-10 here
in production, but since this is an OpenSolaris mailing list I should be
more careful.

Anyway, from looking over the wiki/blog info, apparently the sticking
point with pNFS may be client-side availability -- there's only Linux and
(Open)Solaris NFSv4.1 clients just yet.  Still, pNFS claims to be backwards
compatible with NFS v3 clients:  If you point a traditional NFS client at
the pNFS metadata server, the MDS is supposed to relay the data from the
backend data servers.


[EMAIL PROTECTED] said:
> It's a shame that Lustre isn't available on Solaris yet either. 

Actually, that may not be so terribly fringey, either.  Lustre and Sun's
Scalable Storage product can make use of Thumpers:
http://www.sun.com/software/products/lustre/
http://www.sun.com/servers/cr/scalablestorage/

Apparently it's possible to have a Solaris/ZFS data-server for Lustre
backend storage:
http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU

I see they do not yet have anything other than Linux clients, so that's
a limitation.  But you can share out a Lustre filesystem over NFS, potentially
from multiple Lustre clients.  Maybe via CIFS/samba as well.

Lastly, I've considered the idea of using Shared-QFS to glue together
multiple Thumper-hosted ISCSI LUN's.  You could add shared-QFS clients
(acting as NFS/CIFS servers) if the client load needed more than one.
Then SAM-FS would be a possibility for backup/replication.

Anyway, I do feel that none of this stuff is quite "there" yet.  But my
experience with ZFS on fiberchannel SAN storage, that sinking feeling
I've had when a little connectivity glitch resulted in a ZFS panic,
makes me wonder if non-redundant ZFS on an ISCSI SAN is "there" yet,
either.  So far none of our lost-connection incidents resulted in pool
corruption, but we have only 4TB or so.  Restoring that much from tape
is feasible, but even if Gray's 150TB of data can be recreated, it would
take weeks to reload it.

If it's decided that the clustered-filesystem solutions aren't feasible yet,
the suggestion I've seen that I liked the best was Richard's, with a bad-boy
server SAS-connected to multiple J4500's.  But since Gray's project already
has the X4500's, I guess they'd have to find another use for them (:-).

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> In general, such tasks would be better served by T5220 (or the new T5440 :-)
> and J4500s.  This would change the data paths from:
> client  T5220  X4500  disks to
> client  T5440  disks
> 
> With the J4500 you get the same storage density as the X4500, but with SAS
> access (some would call this direct access).  You will have much better
> bandwidth and lower latency between the T5440 (server) and disks while still
> having the ability to multi-head the disks.  The 

There's an odd economic factor here, if you're in the .edu sector:  The
Sun Education Essentials promotional price list has the X4540 priced
lower than a bare J4500 (not on the promotional list, but with a standard
EDU discount).

We have a project under development right now which might be served well
by one of these EDU X4540's with a J4400 attached to it.  The spec sheets
for J4400 and J4500 say you can chain together enough of them to make a
pool of 192 drives.  I'm unsure about the bandwidth of these daisy-chained
SAS interconnects, though.  Any thoughts as to how high one might scale
an X4540-plus-J4x00 solution?  How does the X4540's internal disk bandwidth
compare to that of the (non-RAID) SAS HBA?

Regards,

Marion



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> It's interesting how the speed and optimisation of these maintenance
> activities limit pool size.  It's not just full scrubs.  If the filesystem is
> subject to corruption, you need a backup.  If the filesystem takes two months
> to back up / restore, then you need really solid incremental backup/restore
> features, and the backup needs to be a cold spare, not just a
> backup---restoring means switching the roles of the primary and backup
> system, not actually moving data.   

I'll chime in here with feeling uncomfortable with such a huge ZFS pool,
and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
just seem to be too many moving parts depending on each other, any one of
which can make the entire pool unavailable.

For the stated usage of the original poster, I think I would aim toward
turning each of the Thumpers into an NFS server, configure the head-node
as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS
to the "cluster" of file servers.  You'll end up with a huge logical pool,
but a Thumper outage should result only in loss of access to the data on
that particular system.  The work of scrub/resilver/replication can be
divided among the servers rather than all living on a single head node.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] x4500 vs AVS ?

2008-09-03 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> We did ask our vendor, but we were just told that AVS does not support
> x4500. 

You might have to use the open-source version of AVS, but it's not
clear if that requires OpenSolaris or if it will run on Solaris-10.
Here's a description of how to set it up between two X4500's:

  http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS noob question

2008-08-29 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> I took a snapshot of a directory in which I hold PDF files related to math.
> I then added a 50MB pdf file from a CD (Oxford Math Reference; I strongly
> reccomend this to any math enthusiast) and did "zfs list" to see the size of
> the snapshot (sheer curiosity). I don't have compression turned on for this
> filesystem. However, it seems that the 50MB PDF took up only 64K.  How is
> that possible?  Is ZFS such a good filesystem, that it shrinks files to a
> mere fraction of their size? 

If I understand correctly, you were expecting the snapshot to grow
in size because you made a change to the current filesystem, right?

Since the new file did not exist in the old snapshot, it could never
have known about the blocks of data in the new file.  The older snapshot
only needs to remember blocks that existed at the time of the snapshot
and which differ now, e.g. blocks in files which get modified or removed.

I would expect that when you add a new file, the only blocks that change
would be in the directory node (metadata blocks) which ends up containing
the new file.  That could indeed be 64k worth of changed blocks.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with Traditional SAN

2008-08-21 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> That's the one that's been an issue for me and my customers - they get billed
> back for GB allocated to their servers by the back end arrays.   To be more
> explicit about the 'self-healing properties' -  To deal with any fs
> corruption situation that would traditionally require an fsck on UFS (SAN
> switch crash, multipathing issues, cables going flaky or getting pulled,
> server crash that corrupts fs's) ZFS needs some disk redundancy in place so
> it has parity and can recover.  (raidz, zfs mirror, etc)  Which means to use
> ZFS a customer have to pay more to get the back end storage redundancy they
> need to recover from anything that would cause an fsck on UFS.  I'm not
> saying it's a bad implementation or that the gains aren't worth it, just that
> cost-wise, ZFS is more expensive in this particular bill-back model. 

If your back-end array implements RAID-0, you need not suffer the extra
expense.  Allocate one RAID-0 LUN per physical drive, then use ZFS to
make raidz or mirrored pools as appropriate.

To add to the other anecdotes on this thread:  We have non-redundant
ZFS pools on SAN storage, in production use for about a year, replacing
some SAM-QFS filesystems which were formerly on the same arrays.  We
have had the "normal" ZFS panics occur in the presence of I/O errors
(SAN zoning mistakes, cable issues, switch bugs), and had no ZFS corruption
nor data loss as a result.  We run S10U4 and S10U5, both SPARC and x86.
MPXIO works fine, once you have OS and arrays configured properly.

Note that I'd by far prefer to have ZFS-level redundancy, but our equipment
doesn't support a useful RAID-0, and our customers want cheap storage.  But
we also charge them for tape backups

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD update

2008-08-20 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> Seriously, I don't even care about the cost.  Even with the smallest
> capacity, four of those gives me 128GB of write cache supporting 680MB/s and
> 40k IOPS.  Show me a hardware raid controller that can even come close to
> that.  Four of those will strain even 10GB/s Infiniband. 

I had my sights set lower.  Our Thumper has four hot-spare drives
right now.  I'd take one or two of those out and replace them with
one or two 80GB SSD's, upgrade to S10U6 when available, and set them
up as a separate log device.  Now I've gotten rid of the horrible NFS
latencies that come from NFS-vs-ZIL interaction.

It would only take a tiny SSD for an NFS ZIL, really.  We have an old
array with 1GB cache, and telling that to ignore cache-flush requests
from ZFS made a huge difference in NFS latency.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver in progress - which disk is inconsistent?

2008-08-08 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> AFAIK there is no way to tell resilvering to pause, so I want to detach the
> inconsistent disk and attach it again tonight, when it won't affect users. To
> do that I need to know which disk is inconsistent, but zpool status does not
> show me any info in regard.
>  
> Is there any way to identify which disk is inconsistent? 

I know this is too late to help you now, but...  Doesn't "zpool status -v"
do what you want?

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS jammed while busy

2008-05-20 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> I'm curious about your array configuration above... did you create your
> RAIDZ2 as one vdev or multiple vdev's?  If multiple, how many?  On mine, I
> have all 10 disks set up as one RAIDZ2 vdev which is supposed to be near the
> performance limit... I'm wondering how much I would gain by splitting it into
> two vdev's for the price of losing 1.5TB (2 disks) worth of storage. 

You've probably already seen/heard this, but I haven't seen it mentioned
in this thread.  The consensus is, and measurements seem to confirm, that
splitting it into two vdev's will double your available IOPS for small,
random read loads on raidz/raidz2.  Here are some references and examples:

 http://blogs.sun.com/roch/entry/when_to_and_not_to
 http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
 http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance1
 http://acc.ohsu.edu/~hakansom/thumper_bench.html

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs device busy

2008-03-28 Thread Marion Hakanson


[EMAIL PROTECTED] said:
> I am having trouble destroying a zfs file system (device busy) and fuser
> isn't telling me who has the file open: 
> . . .
> This situation appears to occur every night during a system test. The  only
> peculiar operation on the errant file system is that another system  NFS
> mounts it with vers=2 in a non-global zone, and then halts that  zone. I
> haven't been able to reproduce the problem outside the test. 

If you have a filesystem shared-out (exported) on an NFS-server, you'll get
this kind of behavior.  No client need have it mounted.  You must first do
a "unshare /files/custfs/cust12/2053699a" in your example before trying
to unmount it.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] scrub performance

2008-03-06 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> It is also interesting to note that this system is now making negative
> progress. I can understand the remaining time estimate going up with time,
> but what does it mean for the % complete number to go down after 6 hours of
> work? 

Sorry I don't have any helpful experience in this area.  It occurs to me
that perhaps you are detecting a gravity wave of some sort -- Thumpers
are pretty heavy, and thus may be more affected than the average server.
Or the guys at SLAC have, unbeknownst to you, somehow accelerated your
Thumper to near the speed of light.

(:-)

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] nfs over zfs

2008-02-28 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> i am a little new to zfs so please excuse my ignorance.  i have a poweredge
> 2950 running Nevada B82 with an Apple Xraid attached over a fiber hba.  they
> are formatted to JBOD with the pool configured as follows: 
> . . .
> i have a filesystem (tpool4/seplog) shared over nfs.  creating files locally
> seems to be fine but writing files over nfs seem to be extremely slow  on one
> of the clients(os x) it reports over 3hours to copy a 500MB file.   also
> during the copy when i issue a zpool iostat -v 5 the response time increases
> for the command. i have also noticed that none of the led's on the drives
> flicker. 

If you haven't already, tell the Xraid to ignore cache-flush requests
from the host:
  http://www.opensolaris.org/jive/thread.jspa?threadID=11641

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] filebench for Solaris 10?

2008-02-19 Thread Marion Hakanson


[EMAIL PROTECTED] said:
> This is what I get with the filebench-1.1.0_x86_pkg.tar.gz from  SourceForge:
> 
> # pkgadd -d .
> pkgadd: ERROR: no packages were found in 
> 
> # ls
> install/  pkginfo   pkgmapreloc/
> . . .

Um, "cd .." and "pkgadd -d ." again.  The package is the actual directory
that you unpacked.  Note the instructions for unpacking confused me a bit
as well.  I had expected to "pkgadd -d . filebench", but pkgadd is smart
enough to scan the entire "-d" directory for packages.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 'du' is not accurate on zfs

2008-02-19 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> It may not be relevant, but I've seen ZFS add weird delays to things too.  I
> deleted a file to free up space, but when I checked no more space was
> reported.  A second or two later the space appeared. 

Run the "sync" command before you do the "du".  That flushes the ARC and/or
ZIL out to disk, after which you'll get accurate results.  I do the same when
timing how long it takes to create a file -- time the file creation plus the
sync to see how long it takes to get the data to nonvolatile storage.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] five megabytes per second with Microsoft iSCSI initiator (2.06)

2008-02-19 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> I'm creating a zfs volume, and sharing it with "zfs set shareiscsi=on
> poolname/volume". I can access the iSCSI volume without any problems, but IO
> is terribly slow, as in five megabytes per second sustained transfers.
> 
> I've tried creating an iSCSI target stored on a UFS filesystem, and get the
> same slow IO. I've tried every level of RAID available in ZFS with the same
> results. 

Apologies if you've already done so, but try testing your network (without
iSCSI and storage).  You can use "ttcp" from blastwave.org on the Solaris
side, and PCATTCP on the Windows side.  That should tell you if your
TCP/IP stacks and network hardware are in good condition.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] filebench for Solaris 10?

2008-02-19 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> Some of us are still using Solaris 10 since it is the version of  Solaris
> released and supported by Sun.  The 'filebench' software from  SourceForge
> does not seem to install or work on Solaris 10.  The  'pkgadd' command
> refuses to recognize the package, even when it is set  to Solaris 2.4 mode. 

I've installed and run filebench (version 1.1.0) from the SourceForge
packages on Solaris-10 here, both SPARC and x86_64, with no problems.
Looks like I downloaded it 23-Jan-2008.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write throttling

2008-02-15 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> I also tried using O_DSYNC, which stops the pathological behaviour but makes
> things pretty slow - I only get a maximum of about 20MBytes/sec, which is
> obviously much less than the hardware can sustain. 

I may misunderstand this situation, but while you're waiting for the new
code from Sun, you might try O_DSYNC and at the same time tell the 6140
to ignore cache-flush requests from the host.  That should get you running
at spindle-speed:

  http://blogs.digitar.com/jjww/?itemid=44

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 >

1 - 100 of 161 matches

Mail list logo