Re: [zfs-discuss] Running on Dell hardware?

2010-10-23 Thread Henrik Johansen

'Tim Cook' wrote:

[... snip ... ]


Dell requires Dell branded drives as of roughly 8 months ago.  I don't
think there was ever an H700 firmware released that didn't require
this.  I'd bet you're going to waste a lot of money to get a drive the
system refuses to recognize.


This should no longer be an issue as Dell has abandoned that practice
because of customer pressure.


--Tim





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Running on Dell hardware?

2010-10-14 Thread Henrik Johansen

'Edward Ned Harvey' wrote:

From: Henrik Johansen [mailto:hen...@scannet.dk]

The 10g models are stable - especially the R905's are real workhorses.


You would generally consider all your machines stable now?
Can you easily pdsh to all those machines?


Yes - the only problem child has been 1 R610 (the other 2 that we have
in production have not shown any signs of trouble)


kstat | grep current_cstate ; kstat | grep supported_max_cstates

I'd really love to see if some current_cstate is higher than
supported_max_cstates is an accurate indicator of system instability.


Here's a little sample from different machines : 


R610 #1

current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  0
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2

R610 #2

current_cstate  3
current_cstate  0
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2

PE2900

current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1

PER905 
current_cstate  1

current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0

Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Henrik Johansen

'Edward Ned Harvey' wrote:

I have a Dell R710 which has been flaky for some time.  It crashes
about once per week.  I have literally replaced every piece of hardware
in it, and reinstalled Sol 10u9 fresh and clean.

I am wondering if other people out there are using Dell hardware, with
what degree of success, and in what configuration?


We are running (Open)Solaris on lots of 10g servers (PE2900, PE1950, PE2950,
R905) and some 11g (R610 and soon some R815) with both PERC and non-PERC
controllers and lots of MD1000's.

The 10g models are stable - especially the R905's are real workhorses.

We have had only one 11g server (R610) which caused trouble. The box
froze at least once a week - after replacing almost the entire box I
switched from the old iscsitgt to COMSTAR and the box has been stable
since. Go figure ...

I might add that none of these machine use the onboard Broadcom nic's.


The failure seems to be related to the perc 6i.  For some period around
the time of crash, the system still responds to ping, and anything
currently in memory or running from remote storage continues to
function fine.  But new processes that require the local storage
... Such as inbound ssh etc, or even physical login at the console
... those are all hosed.  And eventually the system stops responding to
ping.  As soon as the problem starts, the only recourse is power cycle.

I can't seem to reproduce the problem reliably, but it does happen
regularly.  Yesterday it happened several times in one day, but
sometimes it will go 2 weeks without a problem.

Again, just wondering what other people are using, and experiencing.
To see if any more clues can be found to identify the cause.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future of OpenSolaris

2010-02-22 Thread Henrik Johansen

On 02/22/10 12:00 PM, Michael Ramchand wrote:

I think Oracle have been quite clear about their plans for OpenSolaris.
They have publicly said they plan to continue to support it and the
community.

They're just a little distracted right now because they are in the
process of on-boarding many thousand Sun employees, and trying to get
them feeling happy, comfortable and at home in their new surroundings so
that they can start making money again.

The silence means that you're in a queue and they forgot to turn the
hold music on. Have patience. :-)


Well - once thing that makes me feel a bit uncomfortable is the fact 
that you no longer can buy OpenSolaris Support subscriptions.


Almost every trace of it has vanished from the Sun/Oracle website and a 
quick call to our local Sun office confirmed that they apparently no 
longer sell them.



On 02/22/10 09:22, Eugen Leitl wrote:

Oracle's silence is starting to become a bit ominous. What are
the future options for zfs, should OpenSolaris be left dead
in the water by Suracle? I have no insight into who core
zfs developers are (have any been fired by Sun even prior to
the merger?), and who's paying them. Assuming a worst case
scenario, what would be the best candidate for a fork? Nexenta?
Debian already included FreeBSD as a kernel flavor into its
fold, it seems Nexenta could be also a good candidate.

Maybe anyone in the know could provide a short blurb on what
the state is, and what the options are.








--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future of OpenSolaris

2010-02-22 Thread Henrik Johansen

On 02/22/10 03:35 PM, Jacob Ritorto wrote:

On 02/22/10 09:19, Henrik Johansen wrote:

On 02/22/10 02:33 PM, Jacob Ritorto wrote:

On 02/22/10 06:12, Henrik Johansen wrote:

Well - once thing that makes me feel a bit uncomfortable is the fact
that you no longer can buy OpenSolaris Support subscriptions.

Almost every trace of it has vanished from the Sun/Oracle website and a
quick call to our local Sun office confirmed that they apparently no
longer sell them.


I was actually very startled to see that since we're using it in
production here. After digging through the web for hours, I found that
OpenSolaris support is now included in Solaris support. This is a win
for us because we never know if a particular box, especially a dev box,
is going to remain Solaris or OpenSolaris for the duration of a support
purchase and now we're free to mix and mingle. If you refer to the
Solaris support web page (png attached if the mailing list allows),
you'll see that OpenSolaris is now officially part of the deal and is no
longer being treated as a second class support offering.


That would be *very* nice indeed. I have checked the URL in your
screenshot but I am getting a different result (png attached).

Ohwell - I'll just have to wait and see.


Confirmed your finding Henrik.  This is a showstopper for us as the
higherups are already quite leery of Sun/Oracle and the future of
Solaris.  I'm calling Oracle to see if I can get some answers.  The SUSE
folks recently took a big chunk of our UNIX business here and
OpenSolaris was my main tool in battling that.  For us, the loss of
OpenSolaris and its support likely indicates the end of Solaris altogether.


Well - I too am reluctant to put more OpenSolaris boxes into production 
until this matter has been resolved.


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] future of OpenSolaris

2010-02-22 Thread Henrik Johansen

On 02/22/10 09:52 PM, Tim Cook wrote:



On Mon, Feb 22, 2010 at 2:21 PM, Jacob Ritorto jacob.rito...@gmail.com
mailto:jacob.rito...@gmail.com wrote:


Since it seems you have absolutely no grasp of what's happening here,


Coming from the guy proclaiming the sky is falling without actually
having ANY official statement whatsoever to back up that train of thought.

perhaps it would be best for you to continue to sit idly by and let
this happen.  Thanks helping out with the crude characterisations
though.


Idly let what happen?  The unconfirmed death of opensolaris that you've
certified for us all without any actual proof?


Well - the lack of support subscriptions *is* a death sentence for 
OpenSolaris in many companies and I believe that this is what the OP 
complained about.




Do you understand that the OpenSolaris page has a sunset in
it and the Solaris page doesn't?


I understand previous versions of every piece of software Oracle sells
have Sunset pages, yes.  If you read the page I sent you, it clearly
states that every release of Opensolaris gets 5 years of support from
GA.  That doesn't mean they aren't releasing another version.  That
doesn't mean they're ending the opensolaris project.  That doesn't mean
they are no longer selling support for it.  Had you actually read the
link I posted, you'd have figured that out.

Sun provides contractual support on the OpenSolaris OS for up to five
years from the product's first General Availability (GA) date as
described http://www.sun.com/service/eosl/eosl_opensolaris.html.
OpenSolaris Package Updates are released approximately every 6 months.
OpenSolaris Subscriptions entitle customers during the term of the
Customer's Subscription contract to receive support on their current
version of OpenSolaris, as well as receive individual Package Updates
and OpenSolaris Support Repository Package Updates when made
commercially available by Sun. Sun may require a Customer to download
and install Package Updates or OpenSolaris OS Updates that have been
released since Customer's previous installation of OpenSolaris,
particularly when fixes have already been

  Have you spent enough (any) time
trying to renew your contracts only to see that all mentions of
OpenSolaris have been deleted from the support pages over the last few
days?


Can you tell me which Oracle rep you've spoken to who confirmed the
cancellation of Opensolaris?  It's funny, nobody I've talked to seems to
have any idea what you're talking about.  So please, a name would be
wonderful so I can direct my inquiry to this as-of-yet unnamed source.


I have spoken to our local Oracle sales office last week because I 
wanted to purchase a new OpenSolaris support contract - I was informed 
that this was no longer possible and that Oracle is unable to provide 
paid support for OpenSolaris at this time.




  This, specifically, is what has been yanked out from under me
and my company.  This represents years of my and my team's effort and
investment.


Again, without some sort of official word, nothing has changed...


I take the official Oracle website to be rather ... official ?

Lets recap, shall we ?

a) Almost every trace of OpenSolaris Support subscriptions vanished from 
the official website within the last 14 days.


b) An Oracle sales rep informed me personally last week that I could no 
longer purchase support subscriptions for OpenSolaris.


Please, do me a favor and call your local Oracle rep and ask for an 
Opensolaris Support subscription quote and let us know how it goes ...




It says right here those contracts are for both solaris AND opensolaris.

http://www.sun.com/service/subscriptions/index.jsp

Click Sun System Service Plans
http://www.sun.com/service/serviceplans/sunspectrum/index.jsp:
http://www.sun.com/service/serviceplans/sunspectrum/index.jsp


  Sun System Service Plans for Solaris

Sun System Service Plans for the Solaris Operating System provide
integrated hardware and* Solaris OS (or OpenSolaris OS)* support service
coverage to help keep your systems running smoothly. This single price,
complete system approach is ideal for companies running Solaris on Sun
hardware.



Sun System Service Plans != (Open)Solaris Support subscriptions


But thank you for the scare chicken little.





--Tim



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale ZFS deployments out there (200 disks)

2010-01-29 Thread Henrik Johansen

On 01/28/10 11:13 PM, Lutz Schumann wrote:

While thinking about ZFS as the next generation filesystem without
limits I am wondering if the real world is ready for this kind of
incredible technology ...

I'm actually speaking of hardware :)

ZFS can handle a lot of devices. Once in the import bug
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6761786)
is fixed it should be able to handle a lot of disks.


That was fixed in build 125.


I want to ask the ZFS community and users what large scale deploments
are out there.  How man disks ? How much capacity ? Single pool or
many pools on a server ? How does resilver work in those
environtments ? How to you backup ? What is the experience so far ?
Major headakes ?

It would be great if large scale users would share their setups and
experiences with ZFS.


The largest ZFS deployment that we have is currently comprised of 22 
Dell MD1000 enclosures (330 750 GB Nearline SAS disks). We have 3 head 
nodes and use one zpool per node, comprised of rather narrow (5+2) 
RAIDZ2 vdevs. This setup is exclusively used for storing backup data.


Resilver times could be better - I am sure that this will improve once 
we upgrade from S10u9 to 2010.03.


One of the things that I am missing in ZFS is the ability to prioritize 
background operations like scrub and resilver. All our disks are idle 
during daytime and I would love to be able to take advantage of this, 
especially during resilver operations.


This setup has been running for about a year with no major issues so 
far. The only hickups we've had were all HW related (no fun in firmware 
upgrading 200+ disks).



Will you ? :) Thanks, Robert



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale ZFS deployments out there (200 disks)

2010-01-29 Thread Henrik Johansen

On 01/29/10 07:36 PM, Richard Elling wrote:

On Jan 29, 2010, at 12:45 AM, Henrik Johansen wrote:

On 01/28/10 11:13 PM, Lutz Schumann wrote:

While thinking about ZFS as the next generation filesystem
without limits I am wondering if the real world is ready for this
kind of incredible technology ...

I'm actually speaking of hardware :)

ZFS can handle a lot of devices. Once in the import bug
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6761786)



is fixed it should be able to handle a lot of disks.


That was fixed in build 125.


I want to ask the ZFS community and users what large scale
deploments are out there.  How man disks ? How much capacity ?
Single pool or many pools on a server ? How does resilver work in
those environtments ? How to you backup ? What is the experience
so far ? Major headakes ?

It would be great if large scale users would share their setups
and experiences with ZFS.


The largest ZFS deployment that we have is currently comprised of
22 Dell MD1000 enclosures (330 750 GB Nearline SAS disks). We have
3 head nodes and use one zpool per node, comprised of rather narrow
(5+2) RAIDZ2 vdevs. This setup is exclusively used for storing
backup data.


This is an interesting design.  It looks like a good use of hardware
and redundancy for backup storage. Would you be able to share more of
the details? :-)


Each head node (Dell PE 2900's) has 3 PERC 6/E controllers (LSI 1078 
based) with 512 MB cache each.


The PERC 6/E supports both load-balancing and path failover so each 
controller has 2 SAS connections to a daisy chained group of 3 MD1000 
enclosures.


The RAIDZ2 vdev layout was chosen because it gives a reasonable 
performance vs space ratio and it maps nicely onto the 15 disk MD1000's 
( 2 x (5+2) +1 ).


There is room for improvement in the design (fewer disks per controller, 
faster PCI Express slots, etc) but performance is good enough for our 
current needs.




Resilver times could be better - I am sure that this will improve
once we upgrade from S10u9 to 2010.03.


Nit: Solaris 10 u9 is 10/03 or 10/04 or 10/05, depending on what you
read. Solaris 10 u8 is 11/09.


One of the things that I am missing in ZFS is the ability to
prioritize background operations like scrub and resilver. All our
disks are idle during daytime and I would love to be able to take
advantage of this, especially during resilver operations.


Scrub I/O is given the lowest priority and is throttled. However, I
am not sure that the throttle is in Solaris 10, because that source
is not publicly available. In general, you will not notice a resource
cap until the system utilization is high enough that the cap is
effective.  In other words, if the system is mostly idle, the scrub
consumes the bulk of the resources.


That's not what I am seeing - resilver operations crawl even when the 
pool is idle.



This setup has been running for about a year with no major issues
so far. The only hickups we've had were all HW related (no fun in
firmware upgrading 200+ disks).


ugh. -- richard




--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-08-27 Thread Henrik Johansen


Ross Walker wrote:

On Aug 27, 2009, at 4:30 AM, David Bond david.b...@tag.no wrote:


Hi,

I was directed here after posting in CIFS discuss (as i first  
thought that it could be a CIFS problem).


I posted the following in CIFS:

When using iometer from windows to the file share on opensolaris  
svn101 and svn111 I get pauses every 5 seconds of around 5 seconds  
(maybe a little less) where no data is transfered, when data is  
transfered it is at a fair speed and gets around 1000-2000 iops with  
1 thread (depending on the work type). The maximum read response  
time is 200ms and the maximum write response time is 9824ms, which  
is very bad, an almost 10 seconds delay in being able to send data  
to the server.
This has been experienced on 2 test servers, the same servers have  
also been tested with windows server 2008 and they havent shown this  
problem (the share performance was slightly lower than CIFS, but it  
was consistent, and the average access time and maximums were very  
close.



I just noticed that if the server hasnt hit its target arc size, the  
pauses are for maybe .5 seconds, but as soon as it hits its arc  
target, the iops drop to around 50% of what it was and then there  
are the longer pauses around 4-5 seconds. and then after every pause  
the performance slows even more. So it appears it is definately  
server side.


This is with 100% random io with a spread of 33% write 66% read, 2KB  
blocks. over a 50GB file, no compression, and a 5.5GB target arc size.




Also I have just ran some tests with different IO patterns and 100  
sequencial writes produce and consistent IO of 2100IOPS, except when  
it pauses for maybe .5 seconds every 10 - 15 seconds.


100% random writes produce around 200 IOPS with a 4-6 second pause  
around every 10 seconds.


100% sequencial reads produce around 3700IOPS with no pauses, just  
random peaks in response time (only 16ms) after about 1 minute of  
running, so nothing to complain about.


100% random reads produce around 200IOPS, with no pauses.

So it appears that writes cause a problem, what is causing these  
very long write delays?


A network capture shows that the server doesnt respond to the write  
from the client when these pauses occur.


Also, when using iometer, the initial file creation doesnt have and  
pauses in the creation, so it  might only happen when modifying files.


Any help on finding a solution to this would be really appriciated.


What version? And system configuration?

I think it might be the issue where ZFS/ARC write caches more then the  
underlying storage can handle writing in a reasonable time.


There is a parameter to control how much is write cached, I believe it  
is zfs_write_override.


You should be able to disable the write throttle mechanism altogether
with the undocumented zfs_no_write_throttle tunable.

I never got around to testing this though ...



-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Ross Walker wrote:

On Aug 4, 2009, at 8:36 PM, Carson Gaspar car...@taltos.org wrote:


Ross Walker wrote:

I get pretty good NFS write speeds with NVRAM (40MB/s 4k sequential  
write). It's a Dell PERC 6/e with 512MB onboard.

...
there, dedicated slog device with NVRAM speed. It would be even  
better to have a pair of SSDs behind the NVRAM, but it's hard to  
find compatible SSDs for these controllers, Dell currently doesn't  
even support SSDs in their RAID products :-(


Isn't the PERC 6/e just a re-branded LSI? LSI added SSD support  
recently.


Yes, but the LSI support of SSDs is on later controllers.


Sure that's not just a firmware issue ?

My PERC 6/E seems to support SSD's : 


# ./MegaCli -AdpAllInfo -a2 | grep -i ssd
Enable Copyback to SSD on SMART Error   : No
Enable SSD Patrol Read  : No
Allow SSD SAS/SATA Mix in VD : No
Allow HDD/SSD Mix in VD  : No


Controller info : 
   Versions


Product Name: PERC 6/E Adapter
Serial No   : 
FW Package Build: 6.0.3-0002

Mfg. Data

Mfg. Date   : 06/08/07
Rework Date : 06/08/07
Revision No : 
Battery FRU : N/A


Image Versions in Flash:

FW Version : 1.11.82-0473
BIOS Version   : NT13-2
WebBIOS Version: 1.1-32-e_11-Rel
Ctrl-R Version : 1.01-010B
Boot Block Version : 1.00.00.01-0008


I currently have 2 x Intel X25-E (32 GB) as dedicated slogs and 1 x
Intel X25-M (80 GB) for the L2ARC behind a PERC 6/i on my Dell R905
testbox.

So far there have been no problems with them.



-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Ross Walker wrote:
On Aug 4, 2009, at 10:22 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 wrote:



On Tue, 4 Aug 2009, Ross Walker wrote:
Are you sure that it is faster than an SSD?  The data is indeed  
pushed closer to the disks, but there may be considerably more  
latency associated with getting that data into the controller  
NVRAM cache than there is into a dedicated slog SSD.


I don't see how, as the SSD is behind a controller it still must  
make it to the controller.


If you take a look at 'iostat -x' output you will see that the  
system knows about a queue for each device.  If it was any other  
way, then a slow device would slow down access to all of the other  
devices.  If there is concern about lack of bandwidth (PCI-E?) to  
the controller, then you can use a separate controller for the SSDs.


It's not bandwidth. Though with a lot of mirrors that does become a  
concern.


Well the duplexing benefit you mention does hold true. That's a  
complex real-world scenario that would be hard to benchmark in  
production.


But easy to see the effects of.


I actually meant to say, hard to bench out of production.

Tests done by others show a considerable NFS write speed advantage  
when using a dedicated slog SSD rather than a controller's NVRAM  
cache.


I get pretty good NFS write speeds with NVRAM (40MB/s 4k sequential  
write). It's a Dell PERC 6/e with 512MB onboard.


I get 47.9 MB/s (60.7 MB/s peak) here too (also with 512MB NVRAM),  
but that is not very good when the network is good for 100 MB/s.   
With an SSD, some other folks here are getting essentially network  
speed.


In testing with ram disks I was only able to get a max of around 60MB/ 
s with 4k block sizes, with 4 outstanding.


I can do 64k blocks now and get around 115MB/s.


I just ran some filebench microbenchmarks against my 10 Gbit testbox
which is a Dell R905, 4 x 2.5 Ghz AMD Quad Core CPU's and 64 GB RAM.

My current pool is comprised of 7 mirror vdevs (SATA disks), 2 Intel
X25-E as slogs and 1 Intel X25-M for the L2ARC.

The pool is a MD1000 array attached to a PERC 6/E using 2 SAS cables.

The nic's are ixgbe based.

Here are the numbers : 

Randomwrite benchmark - via 10Gbit NFS : 
IO Summary: 4483228 ops, 73981.2 ops/s, (0/73981 r/w) 578.0mb/s, 44us cpu/op, 0.0ms latency


Randomread benchmark - via 10Gbit NFS :
IO Summary: 7663903 ops, 126467.4 ops/s, (126467/0 r/w) 988.0mb/s, 5us cpu/op, 
0.0ms latency

The real question is if these numbers can be trusted - I am currently
preparing new test runs with other software to be able to do a
comparison. 

There is still bus and controller plus SSD latency. I suppose one  
could use a pair of disks as an slog mirror, enable NVRAM just for  
those and let the others do write-through with their disk caches


But this encounters the problem that when the NVRAM becomes full  
then you hit the wall of synchronous disk write performance.  With  
the SSD slog, the write log can be quite large and disk writes are  
then done in a much more efficient ordered fashion similar to non- 
sync writes.


Yes, you have a point there.

So, what SSD disks do you use?

-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Ross Walker wrote:

On Aug 4, 2009, at 10:17 PM, James Lever j...@jamver.id.au wrote:



On 05/08/2009, at 11:41 AM, Ross Walker wrote:


What is your recipe for these?


There wasn't one! ;)

The drive I'm using is a Dell badged Samsung MCCOE50G5MPQ-0VAD3.


So the key is the drive needs to have the Dell badging to work?

I called my rep about getting a Dell badged SSD and he told me they  
didn't support those in MD series enclosures so therefore were  
unavailable.


If the Dell branded SSD's are Samsung's then you might want to search
the archives - if I remember correctly there were mentionings of
less-than-desired performance using them but I cannot recall the
details.



Maybe it's time for a new account rep.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Joseph L. Casale wrote:

Quick snipped from zpool iostat :

  mirror 1.12G   695G  0  0  0  0
c8t12d0  -  -  0  0  0  0
c8t13d0  -  -  0  0  0  0
  c7t2d04K  29.0G  0  1.56K  0   200M
  c7t3d04K  29.0G  0  1.58K  0   202M

The disks on c7 are both Intel X25-E 


Henrik,
So the SATA discs are in the MD1000 behind the PERC 6/E and how
have you configured/attached the 2 SSD slogs and L2ARC drive? If
I understand you, you have sued 14 of the 15 slots in the MD so
I assume you have the 3 SSD's in the R905, what controller are
they running on?


The internal PERC 6/i controller - but I've had them on the PERC 6/E
during other test runs since I have a couple of spare MD1000's at hand. 


Both controllers work well with the SSD's.


Thanks!
jlc
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best controller card for 8 SATA drives ?

2009-06-23 Thread Henrik Johansen

Erik Ableson wrote:
The problem I had was with the single raid 0 volumes (miswrote RAID 1  
on the original message)


This is not a straight to disk connection and you'll have problems if  
you ever need to move disks around or move them to another controller.


Would you mind explaining exactly what issues or problems you had ? I
have moved disks around several controllers without problems. You must
remember however to create the RAID 0 lun throught LSI's megaraid CLI
tool and / or to clear any foreign config before the controller will
expose the disk(s) to the OS.

The only real problem that I can think of is that you cannot use the
autoreplace functionality of recent ZFS versions with these controllers.

I agree that the MD1000 with ZFS is a rocking, inexpensive setup (we  
have several!) but I'd recommend using a SAS card with a true JBOD  
mode for maximum flexibility and portability. If I remember correctly,  
I think we're using the Adaptec 3085. I've pulled 465MB/s write and  
1GB/s read off the MD1000 filled with SATA drives.


Cordialement,

Erik Ableson

+33.6.80.83.58.28
Envoyé depuis mon iPhone

On 23 juin 2009, at 21:18, Henrik Johansen hen...@scannet.dk wrote:


Kyle McDonald wrote:

Erik Ableson wrote:


Just a side note on the PERC labelled cards: they don't have a  
JBOD mode so you _have_ to use hardware RAID. This may or may not  
be an issue in your configuration but it does mean that moving  
disks between controllers is no longer possible. The only way to  
do a pseudo JBOD is to create broken RAID 1 volumes which is not  
ideal.



It won't even let you make single drive RAID 0 LUNs? That's a shame.


We currently have 90+ disks that are created as single drive RAID 0  
LUNs

on several PERC 6/E (LSI 1078E chipset) controllers and used by ZFS.

I can assure you that they work without any problems and perform very
well indeed.

In fact, the combination of PERC 6/E and MD1000 disk arrays has worked
so well for us that we are going to double the number of disks during
this fall.

The lack of portability is disappointing. The trade-off though is  
battery backed cache if the card supports it.


-Kyle



Cordialement,

Erik Ableson

+33.6.80.83.58.28
Envoyé depuis mon iPhone

On 23 juin 2009, at 04:33, Eric D. Mudama edmud...@bounceswoosh.org 
 wrote:


 On Mon, Jun 22 at 15:46, Miles Nordin wrote:
 edm == Eric D Mudama edmud...@bounceswoosh.org writes:

  edm We bought a Dell T610 as a fileserver, and it comes with an
  edm LSI 1068E based board (PERC6/i SAS).

 which driver attaches to it?

 pciids.sourceforge.net says this is a 1078 board, not a 1068  
board.


 please, be careful.  There's too much confusion about these  
cards.


 Sorry, that may have been confusing.  We have the cheapest storage
 option on the T610, with no onboard cache.  I guess it's called  
the

 Dell SAS6i/R while they reserve the PERC name for the ones with
 cache.  I had understood that they were basically identical  
except for

 the cache, but maybe not.

 Anyway, this adapter has worked great for us so far.


 snippet of prtconf -D:


 i86pc (driver name: rootnex)
pci, instance #0 (driver name: npe)
pci8086,3411, instance #6 (driver name: pcie_pci)
pci1028,1f10, instance #0 (driver name: mpt)
sd, instance #1 (driver name: sd)
sd, instance #6 (driver name: sd)
sd, instance #7 (driver name: sd)
sd, instance #2 (driver name: sd)
sd, instance #4 (driver name: sd)
sd, instance #5 (driver name: sd)


 For this board the mpt driver is being used, and here's the  
prtconf

 -pv info:


  Node 0x1f
assigned-addresses:
81020010..fc00..0100.83020014..

 df2ec000..4000.8302001c.
 .df2f..0001
reg:
0002.....01020010....0100.03020014....4000.0302001c.

 ...0001
compatible: 'pciex1000,58.1028.1f10.8' +  
'pciex1000,58.1028.1f10'  + 'pciex1000,58.8' + 'pciex1000,58' +  
'pciexclass,01' +  'pciexclass,0100' +  
'pci1000,58.1028.1f10.8' +  'pci1000,58.1028.1f10' +  
'pci1028,1f10' + 'pci1000,58.8' +  'pci1000,58' + 'pciclass, 
01' + 'pciclass,0100'

model:  'SCSI bus controller'
power-consumption:  0001.0001
devsel-speed:  
interrupts:  0001
subsystem-vendor-id:  1028
subsystem-id:  1f10
unit-address:  '0'
class-code:  0001
revision-id:  0008
vendor-id:  1000
device-id:  0058
pcie-capid-pointer:  0068
pcie-capid-reg:  0001
name:  'pci1028,1f10'


 --eric


 --
 Eric D. Mudama
 edmud...@mail.bounceswoosh.org

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large zpool design considerations

2008-07-04 Thread Henrik Johansen
Chris Cosby wrote:
I'm going down a bit of a different path with my reply here. I know that all
shops and their need for data are different, but hear me out.

1) You're backing up 40TB+ of data, increasing at 20-25% per year. That's
insane. Perhaps it's time to look at your backup strategy no from a hardware
perspective, but from a data retention perspective. Do you really need that
much data backed up? There has to be some way to get the volume down. If
not, you're at 100TB in just slightly over 4 years (assuming the 25% growth
factor). If your data is critical, my recommendation is to go find another
job and let someone else have that headache.

Well, we are talking about backup for ~900 servers that are in
production. Our retention period is 14 days for stuff like web servers,
and 3 weeks for SQL and such. 

We could deploy deduplication but it makes me a wee bit uncomfortable to
blindly trust our backup software.

2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares
and such) - $12,500 for raw drive hardware. Enclosures add some money, as do
cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives.
In my world, I know yours is different, but the difference in a $100,000
solution and a $75,000 solution is pretty negligible. The short description
here: you can afford to do mirrors. Really, you can. Any of the parity
solutions out there, I don't care what your strategy, is going to cause you
more trouble than you're ready to deal with.

Good point. I'll take that into consideration.

I know these aren't solutions for you, it's just the stuff that was in my
head. The best possible solution, if you really need this kind of volume, is
to create something that never has to resilver. Use some nifty combination
of hardware and ZFS, like a couple of somethings that has 20TB per container
exported as a single volume, mirror those with ZFS for its end-to-end
checksumming and ease of management.

That's my considerably more than $0.02

On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn 
[EMAIL PROTECTED] wrote:

 On Thu, 3 Jul 2008, Don Enrique wrote:
 
  This means that i potentially could loose 40TB+ of data if three
  disks within the same RAIDZ-2 vdev should die before the resilvering
  of at least one disk is complete. Since most disks will be filled i
  do expect rather long resilvering times.

 Yes, this risk always exists.  The probability of three disks
 independently dying during the resilver is exceedingly low. The chance
 that your facility will be hit by an airplane during resilver is
 likely higher.  However, it is true that RAIDZ-2 does not offer the
 same ease of control over physical redundancy that mirroring does.
 If you were to use 10 independent chassis and split the RAIDZ-2
 uniformly across the chassis then the probability of a similar
 calamity impacting the same drives is driven by rack or facility-wide
 factors (e.g. building burning down) rather than shelf factors.
 However, if you had 10 RAID arrays mounted in the same rack and the
 rack falls over on its side during resilver then hope is still lost.

 I am not seeing any options for you here.  ZFS RAIDZ-2 is about as
 good as it gets and if you want everything in one huge pool, there
 will be more risk.  Perhaps there is a virtual filesystem layer which
 can be used on top of ZFS which emulates a larger filesystem but
 refuses to split files across pools.

 In the future it would be useful for ZFS to provide the option to not
 load-share across huge VDEVs and use VDEV-level space allocators.

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
chris -at- microcozm -dot- net
=== Si Hoc Legere Scis Nimium Eruditionis Habes

-- 
Med venlig hilsen / Best Regards

Henrik Johansen
[EMAIL PROTECTED]


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Henrik Johansen
[Richard Elling] wrote:
 Don Enrique wrote:
 Hi,

 I am looking for some best practice advice on a project that i am working on.

 We are looking at migrating ~40TB backup data to ZFS, with an annual data 
 growth of
 20-25%.

 Now, my initial plan was to create one large pool comprised of X RAIDZ-2 
 vdevs ( 7 + 2 )
 with one hotspare per 10 drives and just continue to expand that pool as 
 needed.

 Between calculating the MTTDL and performance models i was hit by a rather 
 scary thought.

 A pool comprised of X vdevs is no more resilient to data loss than the 
 weakest vdev since loss
 of a vdev would render the entire pool unusable.
   

 Yes, but a raidz2 vdev using enterprise class disks is very reliable.

That's nice to hear.

 This means that i potentially could loose 40TB+ of data if three disks 
 within the same RAIDZ-2
 vdev should die before the resilvering of at least one disk is complete. 
 Since most disks
 will be filled i do expect rather long resilvering times.

 We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project 
 with as much hardware
 redundancy as we can get ( multiple controllers, dual cabeling, I/O 
 multipathing, redundant PSUs,
 etc.)
   

 nit: SATA disks are single port, so you would need a SAS implementation
 to get multipathing to the disks.  This will not significantly impact the
 overall availability of the data, however.  I did an availability  
 analysis of
 thumper to show this.
 http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs

Yeah, I read your blog. Very informative indeed. 

I am using SAS HBA cards and SAS enclosures with SATA disks so I should
be fine.

 I could use multiple pools but that would make data management harder which 
 in it self is a lengthy
 process in our shop.

 The MTTDL figures seem OK so how much should i need to worry ? Anyone having 
 experience from
 this kind of setup ?
   

 I think your design is reasonable.  We'd need to know the exact
 hardware details to be able to make more specific recommendations.
 -- richard

Well, my choice of hardware is kind of limited by 2 things :

1. We are a 100% Dell shop.
2. We already have lots of enclosures that i would like to reuse for my project.

The HBA cards are SAS 5/E (LSI SAS1068 chipset) cards, the enclosures are
Dell MD1000 diskarrays.



-- 
Med venlig hilsen / Best Regards

Henrik Johansen
[EMAIL PROTECTED]


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs data corruption

2008-04-24 Thread johansen
 I'm just interested in understanding how zfs determined there was data
 corruption when I have checksums disabled and there were no
 non-retryable read errors reported in the messages file.

If the metadata is corrupt, how is ZFS going to find the data blocks on
disk?

   I don't believe it was a real disk read error because of the
   absence of evidence in /var/adm/messages.

It's not safe to jump to this conclusion.  Disk drivers that support FMA
won't log error messages to /var/adm/messages.  As more support for I/O
FMA shows up, you won't see random spew in the messages file any more.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-11 Thread johansen
 Is deleting the old files/directories in the ZFS file system
 sufficient or do I need to destroy/recreate the pool and/or file
 system itself?  I've been doing the former.

The former should be sufficient, it's not necessary to destroy the pool.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread johansen
 -Still playing with 'recsize' values but it doesn't seem to be doing
 much...I don't think I have a good understand of what exactly is being
 written...I think the whole file might be overwritten each time
 because it's in binary format.

The other thing to keep in mind is that the tunables like compression
and recsize only affect newly written blocks.  If you have a bunch of
data that was already laid down on disk and then you change the tunable,
this will only cause new blocks to have the new size.  If you experiment
with this, make sure all of your data has the same blocksize by copying
it over to the new pool once you've changed the properties.

 -Setting zfs_nocacheflush, though got me drastically increased
 throughput--client requests took, on average, less than 2 seconds
 each!
 
 So, in order to use this, I should have a storage array, w/battery
 backup, instead of using the internal drives, correct?

zfs_nocacheflush should only be used on arrays with a battery backed
cache.  If you use this option on a disk, and you lose power, there's no
guarantee that your write successfully made it out of the cache.

A performance problem when flushing the cache of an individual disk
implies that there's something wrong with the disk or its firmware.  You
can disable the write cache of an individual disk using format(1M).  When you
do this, ZFS won't lose any data, whereas enabling zfs_nocacheflush can
lead to problems.

I'm attaching a DTrace script that will show the cache-flush times
per-vdev.  Remove the zfs_nocacheflush tuneable and re-run your test
while using this DTrace script.  If one particular disk takes longer
than the rest to flush, this should show us.  In that case, we can
disable the write cache on that particular disk.  Otherwise, we'll need
to disable the write cache on all of the disks.

The script is attached as zfs_flushtime.d

Use format(1M) with the -e option to adjust the write_cache settings for
SCSI disks.

-j
#!/usr/sbin/dtrace -Cs
/*
 * CDDL HEADER START
 *
 * The contents of this file are subject to the terms of the
 * Common Development and Distribution License (the License).
 * You may not use this file except in compliance with the License.
 *
 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 * or http://www.opensolaris.org/os/licensing.
 * See the License for the specific language governing permissions
 * and limitations under the License.
 *
 * When distributing Covered Code, include this CDDL HEADER in each
 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 * If applicable, add the following below this CDDL HEADER, with the
 * fields enclosed by brackets [] replaced with your own identifying
 * information: Portions Copyright [] [name of copyright owner]
 *
 * CDDL HEADER END
 */

/*
 * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

#define DKIOC   (0x04  8)
#define DKIOCFLUSHWRITECACHE(DKIOC|34)

fbt:zfs:vdev_disk_io_start:entry
/(args[0]-io_cmd == DKIOCFLUSHWRITECACHE)  (self-traced == 0)/
{
self-traced = args[0];
self-start = timestamp;
}

fbt:zfs:vdev_disk_ioctl_done:entry
/args[0] == self-traced/
{
@a[stringof(self-traced-io_vd-vdev_path)] =
quantize(timestamp - self-start);
self-start = 0;
self-traced = 0;
}

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mdb ::memstat including zfs buffer details?

2007-11-12 Thread johansen
I don't think it should be too bad (for ::memstat), given that (at
least in Nevada), all of the ZFS caching data belongs to the zvp
vnode, instead of kvp.

ZFS data buffers are attached to zvp; however, we still keep metadata in
the crashdump.  At least right now, this means that cached ZFS metadata
has kvp as its vnode.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fileserver performance tests

2007-10-08 Thread johansen
 statfile1 988ops/s   0.0mb/s  0.0ms/op   22us/op-cpu
 deletefile1   991ops/s   0.0mb/s  0.0ms/op   48us/op-cpu
 closefile2997ops/s   0.0mb/s  0.0ms/op4us/op-cpu
 readfile1 997ops/s 139.8mb/s  0.2ms/op  175us/op-cpu
 openfile2 997ops/s   0.0mb/s  0.0ms/op   28us/op-cpu
 closefile1   1081ops/s   0.0mb/s  0.0ms/op6us/op-cpu
 appendfilerand1   982ops/s  14.9mb/s  0.1ms/op   91us/op-cpu
 openfile1 982ops/s   0.0mb/s  0.0ms/op   27us/op-cpu
 
 IO Summary:   8088 ops 8017.4 ops/s, (997/982 r/w) 155.6mb/s,508us 
 cpu/op,   0.2ms

 I expected to see some higher numbers really...
 a simple time mkfile 16g lala gave me something like 280Mb/s.

mkfile isn't an especially realistic test for performance.  You'll note
that the fileserver workload is performing stats, deletes, closes,
reads, opens, and appends.  Mkfile is a write benchmark.  You might
consider trying the singlestreamwrite benchmark, if you're looking for
a single-threaded write performance test.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-05 Thread johansen
 But note that, for ZFS, the win with direct I/O will be somewhat
 less.  That's because you still need to read the page to compute
 its checksum.  So for direct I/O with ZFS (with checksums enabled),
 the cost is W:LPS, R:2*LPS.  Is saving one page of writes enough to
 make a difference?  Possibly not.

It's more complicated than that.  The kernel would be verifying
checksums on buffers in a user's address space.  For this to work, we
have to map these buffers into the kernel and simultaneously arrange for
these pages to be protected from other threads in the user's address
space.  We discussed some of the VM gymnastics required to properly
implement this back in January:

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-January/thread.html#36890

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/WAFL lawsuit

2007-09-06 Thread johansen-osdev
It's Columbia Pictures vs. Bunnell:

http://www.eff.org/legal/cases/torrentspy/columbia_v_bunnell_magistrate_order.pdf

The Register syndicated a Security Focus article that summarizes the
potential impact of the court decision:

http://www.theregister.co.uk/2007/08/08/litigation_data_retention/


-j

On Thu, Sep 06, 2007 at 08:14:56PM +0200, [EMAIL PROTECTED] wrote:
 
 
 It really is a shot in the dark at this point, you really never know what
 will happen in court (take the example of the recent court decision that
 all data in RAM be held for discovery ?!WHAT, HEAD HURTS!?).  But at the
 end of the day,  if you waited for a sure bet on any technology or
 potential patent disputes you would not implement anything, ever.
 
 
 Do you have a reference for all data in RAM most be held.  I guess we
 need to build COW RAM as well.
 
 Casper
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely long creat64 latencies on higly utilized zpools

2007-08-15 Thread johansen-osdev
You might also consider taking a look at this thread:

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-July/041760.html

Although I'm not certain, this sounds a lot like the other pool
fragmentation issues.

-j

On Wed, Aug 15, 2007 at 01:11:40AM -0700, Yaniv Aknin wrote:
 Hello friends,
 
 I've recently seen a strange phenomenon with ZFS on Solaris 10u3, and was 
 wondering if someone may have more information.
 
 The system uses several zpools, each a bit under 10T, each containing one zfs 
 with lots and lots of small files (way too many, about 100m files and 75m 
 directories).
 
 I have absolutely no control over the directory structure and believe me I 
 tried to change it.
 
 Filesystem usage patterns are create and read, never delete and never rewrite.
 
 When volumes approach 90% usage, and under medium/light load (zpool iostat 
 reports 50mb/s and 750iops reads), some creat64 system calls take over 50 
 seconds to complete (observed with 'truss -D touch'). When doing manual 
 tests, I've seen similar times on unlink() calls (truss -D rm). 
 
 I'd like to stress this happens on /some/ of the calls, maybe every 100th 
 manual call (I scripted the test), which (along with normal system 
 operations) would probably be every 10,000th or 100,000th call.
 
 Other system parameters (memory usage, loadavg, process number, etc) appear 
 nominal. The machine is an NFS server, though the crazy latencies were 
 observed both local and remote.
 
 What would you suggest to further diagnose this? Has anyone seen trouble with 
 high utilization and medium load? (with or without insanely high filecount?)
 
 Many thanks in advance,
  - Yaniv
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] is send/receive incremental

2007-08-08 Thread johansen-osdev
You can do it either way.  Eric Kustarz has a good explanation of how to
set up incremental send/receive on your laptop.  The description is on
his blog:

http://blogs.sun.com/erickustarz/date/20070612

The technique he uses is applicable to any ZFS filesystem.

-j

On Wed, Aug 08, 2007 at 04:44:16PM -0600, Peter Baumgartner wrote:
 
I'd like to send a backup of my filesystem offsite nightly using zfs
send/receive. Are those done incrementally so only changes move or
would a full copy get shuttled across everytime?
--
Pete

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] si3124 controller problem and fix (fwd)

2007-07-17 Thread johansen-osdev
In an attempt to speed up progress on some of the si3124 bugs that Roger
reported, I've created a workspace with the fixes for:

   6565894 sata drives are not identified by si3124 driver
   6566207 si3124 driver loses interrupts.

I'm attaching a driver which contains these fixes as well as a diff of
the changes I used to produce them.

I don't have access to a si3124 chipset, unfortunately.

Would somebody be able to review these changes and try the new driver on
a si3124 card?

Thanks,

-j

On Tue, Jul 17, 2007 at 02:39:00AM -0700, Nigel Smith wrote:
 You can see the  status of bug here:
 
 http://bugs.opensolaris.org/view_bug.do?bug_id=6566207
 
 Unfortunately, it's showing no progress since 20th June.
 
 This fix really could do to be in place for S10u4 and snv_70.
 Thanks
 Nigel Smith
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


si3124.tar.gz
Description: application/tar-gz

--- usr/src/uts/common/io/sata/adapters/si3124/si3124.c ---

Index: usr/src/uts/common/io/sata/adapters/si3124/si3124.c
--- /ws/onnv-clone/usr/src/uts/common/io/sata/adapters/si3124/si3124.c  Mon Nov 
13 23:20:01 2006
+++ 
/export/johansen/si-fixes/usr/src/uts/common/io/sata/adapters/si3124/si3124.c   
Tue Jul 17 14:37:17 2007
@@ -22,11 +22,11 @@
 /*
  * Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
-#pragma ident  @(#)si3124.c   1.4 06/11/14 SMI
+#pragma ident  @(#)si3124.c   1.5 07/07/17 SMI
 
 
 
 /*
  * SiliconImage 3124/3132 sata controller driver
@@ -381,11 +381,11 @@
 
 extern struct mod_ops mod_driverops;
 
 static  struct modldrv modldrv = {
mod_driverops, /* driverops */
-   si3124 driver v1.4,
+   si3124 driver v1.5,
sictl_dev_ops, /* driver ops */
 };
 
 static  struct modlinkage modlinkage = {
MODREV_1,
@@ -2808,10 +2808,13 @@
si_portp = si_ctlp-sictl_ports[port];
mutex_enter(si_portp-siport_mutex);
 
/* Clear Port Reset. */
ddi_put32(si_ctlp-sictl_port_acc_handle,
+   (uint32_t *)PORT_CONTROL_SET(si_ctlp, port),
+   PORT_CONTROL_SET_BITS_PORT_RESET);
+   ddi_put32(si_ctlp-sictl_port_acc_handle,
(uint32_t *)PORT_CONTROL_CLEAR(si_ctlp, port),
PORT_CONTROL_CLEAR_BITS_PORT_RESET);
 
/*
 * Arm the interrupts for: Cmd completion, Cmd error,
@@ -3509,16 +3512,16 @@
port);
 
if (port_intr_status  INTR_COMMAND_COMPLETE) {
(void) si_intr_command_complete(si_ctlp, si_portp,
port);
-   }
-
+   } else {
/* Clear the interrupts */
ddi_put32(si_ctlp-sictl_port_acc_handle,
(uint32_t *)(PORT_INTERRUPT_STATUS(si_ctlp, port)),
port_intr_status  INTR_MASK);
+   }
 
/*
 * Note that we did not clear the interrupt for command
 * completion interrupt. Reading of slot_status takes care
 * of clearing the interrupt for command completion case.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: [storage-discuss] NCQ performance

2007-05-29 Thread johansen-osdev
 When sequential I/O is done to the disk directly there is no performance
 degradation at all.  

All filesystems impose some overhead compared to the rate of raw disk
I/O.  It's going to be hard to store data on a disk unless some kind of
filesystem is used.  All the tests that Eric and I have performed show
regressions for multiple sequential I/O streams.  If you have data that
shows otherwise, please feel free to share.

 [I]t does not take any additional time in ldi_strategy(),
 bdev_strategy(), mv_rw_dma_start().  In some instance it actually
 takes less time.   The only thing that sometimes takes additional time
 is waiting for the disk I/O.

Let's be precise about what was actually observed.  Eric and I saw
increased service times for the I/O on devices with NCQ enabled when
running multiple sequential I/O streams.  Everything that we observed
indicated that it actually took the disk longer to service requests when
many sequential I/Os were queued.

-j


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
 *sata_hba_list::list sata_hba_inst_t satahba_next | ::print 
 sata_hba_inst_t satahba_dev_port | ::array void* 32 | ::print void* | 
 ::grep .!=0 | ::print sata_cport_info_t cport_devp.cport_sata_drive | 
 ::print -a sata_drive_info_t satadrv_features_support satadrv_settings 
 satadrv_features_enabled

 This gives me mdb: failed to dereference symbol: unknown symbol
 name. 

You may not have the SATA module installed.  If you type:

::modinfo !  grep sata

and don't get any output, your sata driver is attached some other way.

My apologies for the confusion.

-K
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
At Matt's request, I did some further experiments and have found that
this appears to be particular to your hardware.  This is not a general
32-bit problem.  I re-ran this experiment on a 1-disk pool using a 32
and 64-bit kernel.  I got identical results:

64-bit
==

$ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
count=1
1+0 records in
1+0 records out

real   20.1
user0.0
sys 1.2

62 Mb/s

# /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   19.0
user0.0
sys 2.6

65 Mb/s

32-bit
==

/usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
count=1
1+0 records in
1+0 records out

real   20.1
user0.0
sys 1.7

62 Mb/s

# /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   19.1
user0.0
sys 4.3

65 Mb/s

-j

On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote:
 Marko Milisavljevic wrote:
 now lets try:
 set zfs:zfs_prefetch_disable=1
 
 bingo!
 
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  609.00.0 77910.00.0  0.0  0.80.01.4   0  83 c0d0
 
 only 1-2 % slower then dd from /dev/dsk. Do you think this is general
 32-bit problem, or specific to this combination of hardware?
 
 I suspect that it's fairly generic, but more analysis will be necessary.
 
 Finally, should I file a bug somewhere regarding prefetch, or is this
 a known issue?
 
 It may be related to 6469558, but yes please do file another bug report. 
  I'll have someone on the ZFS team take a look at it.
 
 --matt
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
Marko,
Matt and I discussed this offline some more and he had a couple of ideas
about double-checking your hardware.

It looks like your controller (or disks, maybe?) is having trouble with
multiple simultaneous I/Os to the same disk.  It looks like prefetch
aggravates this problem.

When I asked Matt what we could do to verify that it's the number of
concurrent I/Os that is causing performance to be poor, he had the
following suggestions:

set zfs_vdev_{min,max}_pending=1 and run with prefetch on, then
iostat should show 1 outstanding io and perf should be good.

or turn prefetch off, and have multiple threads reading
concurrently, then iostat should show multiple outstanding ios
and perf should be bad.

Let me know if you have any additional questions.

-j

On Wed, May 16, 2007 at 11:38:24AM -0700, [EMAIL PROTECTED] wrote:
 At Matt's request, I did some further experiments and have found that
 this appears to be particular to your hardware.  This is not a general
 32-bit problem.  I re-ran this experiment on a 1-disk pool using a 32
 and 64-bit kernel.  I got identical results:
 
 64-bit
 ==
 
 $ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
 count=1
 1+0 records in
 1+0 records out
 
 real   20.1
 user0.0
 sys 1.2
 
 62 Mb/s
 
 # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
 1+0 records in
 1+0 records out
 
 real   19.0
 user0.0
 sys 2.6
 
 65 Mb/s
 
 32-bit
 ==
 
 /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
 count=1
 1+0 records in
 1+0 records out
 
 real   20.1
 user0.0
 sys 1.7
 
 62 Mb/s
 
 # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
 1+0 records in
 1+0 records out
 
 real   19.1
 user0.0
 sys 4.3
 
 65 Mb/s
 
 -j
 
 On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote:
  Marko Milisavljevic wrote:
  now lets try:
  set zfs:zfs_prefetch_disable=1
  
  bingo!
  
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   609.00.0 77910.00.0  0.0  0.80.01.4   0  83 c0d0
  
  only 1-2 % slower then dd from /dev/dsk. Do you think this is general
  32-bit problem, or specific to this combination of hardware?
  
  I suspect that it's fairly generic, but more analysis will be necessary.
  
  Finally, should I file a bug somewhere regarding prefetch, or is this
  a known issue?
  
  It may be related to 6469558, but yes please do file another bug report. 
   I'll have someone on the ZFS team take a look at it.
  
  --matt
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-15 Thread johansen-osdev
 Each drive is freshly formatted with one 2G file copied to it. 

How are you creating each of these files?

Also, would you please include a the output from the isalist(1) command?

 These are snapshots of iostat -xnczpm 3 captured somewhere in the
 middle of the operation.

Have you double-checked that this isn't a measurement problem by
measuring zfs with zpool iostat (see zpool(1M)) and verifying that
outputs from both iostats match?

 single drive, zfs file
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  258.30.0 33066.60.0 33.0  2.0  127.77.7 100 100 c0d1
 
 Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s /
 r/s gives 256K, as I would imagine it should.

Not sure.  If we can figure out why ZFS is slower than raw disk access
in your case, it may explain why you're seeing these results.

 What if we read a UFS file from the PATA disk and ZFS from SATA:
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  792.80.0 44092.90.0  0.0  1.80.02.2   1  98 c1d0
  224.00.0 28675.20.0 33.0  2.0  147.38.9 100 100 c0d0
 
 Now that is confusing! Why did SATA/ZFS slow down too? I've retried this a
 number of times, not a fluke.

This could be cache interference.  ZFS and UFS use different caches.

How much memory is in this box?

 I have no idea what to make of all this, except that it ZFS has a problem
 with this hardware/drivers that UFS and other traditional file systems,
 don't. Is it a bug in the driver that ZFS is inadvertently exposing? A
 specific feature that ZFS assumes the hardware to have, but it doesn't? Who
 knows!

This may be a more complicated interaction than just ZFS and your
hardware.  There are a number of layers of drivers underneath ZFS that
may also be interacting with your hardware in an unfavorable way.

If you'd like to do a little poking with MDB, we can see the features
that your SATA disks claim they support.

As root, type mdb -k, and then at the  prompt that appears, enter the
following command (this is one very long line):

*sata_hba_list::list sata_hba_inst_t satahba_next | ::print sata_hba_inst_t 
satahba_dev_port | ::array void* 32 | ::print void* | ::grep .!=0 | ::print 
sata_cport_info_t cport_devp.cport_sata_drive | ::print -a sata_drive_info_t 
satadrv_features_support satadrv_settings satadrv_features_enabled

This should show satadrv_features_support, satadrv_settings, and
satadrv_features_enabled for each SATA disk on the system.

The values for these variables are defined in:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/sata/impl/sata.h

this is the relevant snippet for interpreting these values:

/*
 * Device feature_support (satadrv_features_support)
 */
#define SATA_DEV_F_DMA  0x01
#define SATA_DEV_F_LBA280x02
#define SATA_DEV_F_LBA480x04
#define SATA_DEV_F_NCQ  0x08
#define SATA_DEV_F_SATA10x10
#define SATA_DEV_F_SATA20x20
#define SATA_DEV_F_TCQ  0x40/* Non NCQ tagged queuing */

/*
 * Device features enabled (satadrv_features_enabled)
 */
#define SATA_DEV_F_E_TAGGED_QING0x01/* Tagged queuing enabled */
#define SATA_DEV_F_E_UNTAGGED_QING  0x02/* Untagged queuing enabled */

/*
 * Drive settings flags (satdrv_settings)
 */
#define SATA_DEV_READ_AHEAD 0x0001  /* Read Ahead enabled */
#define SATA_DEV_WRITE_CACHE0x0002  /* Write cache ON */
#define SATA_DEV_SERIAL_FEATURES0x8000  /* Serial ATA feat.  enabled */
#define SATA_DEV_ASYNCH_NOTIFY  0x2000  /* Asynch-event enabled */

This may give us more information if this is indeed a problem with
hardware/drivers supporting the right features.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-14 Thread johansen-osdev
This certainly isn't the case on my machine.

$ /usr/bin/time dd if=/test/filebench/largefile2 of=/dev/null bs=128k 
count=1
1+0 records in
1+0 records out

real1.3
user0.0
sys 1.2

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   22.3
user0.0
sys 2.2

This looks like 56 MB/s on the /dev/dsk and 961 MB/s on the pool.

My pool is configured into a 46 disk RAID-0 stripe.  I'm going to omit
the zpool status output for the sake of brevity.

 What I am seeing is that ZFS performance for sequential access is
 about 45% of raw disk access, while UFS (as well as ext3 on Linux) is
 around 70%. For workload consisting mostly of reading large files
 sequentially, it would seem then that ZFS is the wrong tool
 performance-wise. But, it could be just my setup, so I would
 appreciate more data points.

This isn't what we've observed in much of our performance testing.
It may be a problem with your config, although I'm not an expert on
storage configurations.  Would you mind providing more details about
your controller, disks, and machine setup?

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-14 Thread johansen-osdev
Marko,

I tried this experiment again using 1 disk and got nearly identical
times:

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   21.4
user0.0
sys 2.4

$ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   21.0
user0.0
sys 0.7


 [I]t is not possible for dd to meaningfully access multiple-disk
 configurations without going through the file system. I find it
 curious that there is such a large slowdown by going through file
 system (with single drive configuration), especially compared to UFS
 or ext3.

Comparing a filesystem to raw dd access isn't a completely fair
comparison either.  Few filesystems actually layout all of their data
and metadata so that every read is a completely sequential read.

 I simply have a small SOHO server and I am trying to evaluate which OS to
 use to keep a redundant disk array. With unreliable consumer-level hardware,
 ZFS and the checksum feature are very interesting and the primary selling
 point compared to a Linux setup, for as long as ZFS can generate enough
 bandwidth from the drive array to saturate single gigabit ethernet.

I would take Bart's reccomendation and go with Solaris on something like a
dual-core box with 4 disks.

 My hardware at the moment is the wrong choice for Solaris/ZFS - PCI 3114
 SATA controller on a 32-bit AthlonXP, according to many posts I found.

Bill Moore lists some controller reccomendations here:

http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html

 However, since dd over raw disk is capable of extracting 75+MB/s from this
 setup, I keep feeling that surely I must be able to get at least that much
 from reading a pair of striped or mirrored ZFS drives. But I can't - single
 drive or 2-drive stripes or mirrors, I only get around 34MB/s going through
 ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)

Maybe this is a problem with your controller?  What happens when you
have two simultaneous dd's to different disks running?  This would
simulate the case where you're reading from the two disks at the same
time.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-03 Thread johansen-osdev
A couple more questions here.

[mpstat]

 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   00   0 3109  3616  316  1965   17   48   45   2450  85   0  15
   10   0 3127  3797  592  2174   17   63   46   1760  84   0  15
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   00   0 3051  3529  277  2012   14   25   48   2160  83   0  17
   10   0 3065  3739  606  1952   14   37   47   1530  82   0  17
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   00   0 3011  3538  316  2423   26   16   52   2020  81   0  19
   10   0 3019  3698  578  2694   25   23   56   3090  83   0  17
 
 # lockstat -kIW -D 20 sleep 30
 
 Profiling interrupt: 6080 events in 31.341 seconds (194 events/sec)
 
 Count indv cuml rcnt nsec Hottest CPU+PILCaller  
 ---
  2068  34%  34% 0.00 1767 cpu[0] deflate_slow
  1506  25%  59% 0.00 1721 cpu[1] longest_match   
  1017  17%  76% 0.00 1833 cpu[1] mach_cpu_idle   
   454   7%  83% 0.00 1539 cpu[0] fill_window 
   215   4%  87% 0.00 1788 cpu[1] pqdownheap  
snip

What do you have zfs compresison set to?  The gzip level is tunable,
according to zfs set, anyway:

PROPERTY   EDIT  INHERIT   VALUES
compression YES  YES   on | off | lzjb | gzip | gzip-[1-9]

You still have idle time in this lockstat (and mpstat).

What do you get for a lockstat -A -D 20 sleep 30?

Do you see anyone with long lock hold times, long sleeps, or excessive
spinning?

The largest numbers from mpstat are for interrupts and cross calls.
What does intrstat(1M) show?

Have you run dtrace to determine the most frequent cross-callers?

#!/usr/sbin/dtrace -s

sysinfo:::xcalls
{
@a[stack(30)] = count();
}

END
{
trunc(@a, 30);
}

is an easy way to do this.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
This seems a bit strange.  What's the workload, and also, what's the
output for:

 ARC_mru::print size lsize
 ARC_mfu::print size lsize
and
 ARC_anon::print size

For obvious reasons, the ARC can't evict buffers that are in use.
Buffers that are available to be evicted should be on the mru or mfu
list, so this output should be instructive.

-j

On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
 
 FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
 
 
  arc::print -tad
 {
 . . .
c02e29e8 uint64_t size = 0t10527883264
c02e29f0 uint64_t p = 0t16381819904
c02e29f8 uint64_t c = 0t1070318720
c02e2a00 uint64_t c_min = 0t1070318720
c02e2a08 uint64_t c_max = 0t1070318720
 . . .
 
 Perhaps c_max does not do what I think it does?
 
 Thanks,
 /jim
 
 
 Jim Mauro wrote:
 Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
 (update 3). All file IO is mmap(file), read memory segment, unmap, close.
 
 Tweaked the arc size down via mdb to 1GB. I used that value because
 c_min was also 1GB, and I was not sure if c_max could be larger than
 c_minAnyway, I set c_max to 1GB.
 
 After a workload run:
  arc::print -tad
 {
 . . .
   c02e29e8 uint64_t size = 0t3099832832
   c02e29f0 uint64_t p = 0t16540761088
   c02e29f8 uint64_t c = 0t1070318720
   c02e2a00 uint64_t c_min = 0t1070318720
   c02e2a08 uint64_t c_max = 0t1070318720
 . . .
 
 size is at 3GB, with c_max at 1GB.
 
 What gives? I'm looking at the code now, but was under the impression
 c_max would limit ARC growth. Granted, it's not a factor of 10, and
 it's certainly much better than the out-of-the-box growth to 24GB
 (this is a 32GB x4500), so clearly ARC growth is being limited, but it
 still grew to 3X c_max.
 
 Thanks,
 /jim
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
Gar.  This isn't what I was hoping to see.  Buffers that aren't
available for eviction aren't listed in the lsize count.  It looks like
the MRU has grown to 10Gb and most of this could be successfully
evicted.

The calculation for determining if we evict from the MRU is in
arc_adjust() and looks something like:

top_sz = ARC_anon.size + ARC_mru.size

Then if top_sz  arc.p and ARC_mru.lsize  0 we evict the smaller of
ARC_mru.lsize and top_size - arc.p

In your previous message it looks like arc.p is  (ARC_mru.size +
ARC_anon.size).  It might make sense to double-check these numbers
together, so when you check the size and lsize again, also check arc.p.

How/when did you configure arc_c_max?  arc.p is supposed to be
initialized to half of arc.c.  Also, I assume that there's a reliable
test case for reproducing this problem?

Thanks,

-j

On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
 
 
  ARC_mru::print -d size lsize
 size = 0t10224433152
 lsize = 0t10218960896
  ARC_mfu::print -d size lsize
 size = 0t303450112
 lsize = 0t289998848
  ARC_anon::print -d size
 size = 0
 
 
 So it looks like the MRU is running at 10GB...
 
 What does this tell us?
 
 Thanks,
 /jim
 
 
 
 [EMAIL PROTECTED] wrote:
 This seems a bit strange.  What's the workload, and also, what's the
 output for:
 
   
 ARC_mru::print size lsize
 ARC_mfu::print size lsize
 
 and
   
 ARC_anon::print size
 
 
 For obvious reasons, the ARC can't evict buffers that are in use.
 Buffers that are available to be evicted should be on the mru or mfu
 list, so this output should be instructive.
 
 -j
 
 On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
   
 FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
 
 
 
 arc::print -tad
   
 {
 . . .
c02e29e8 uint64_t size = 0t10527883264
c02e29f0 uint64_t p = 0t16381819904
c02e29f8 uint64_t c = 0t1070318720
c02e2a00 uint64_t c_min = 0t1070318720
c02e2a08 uint64_t c_max = 0t1070318720
 . . .
 
 Perhaps c_max does not do what I think it does?
 
 Thanks,
 /jim
 
 
 Jim Mauro wrote:
 
 Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
 (update 3). All file IO is mmap(file), read memory segment, unmap, close.
 
 Tweaked the arc size down via mdb to 1GB. I used that value because
 c_min was also 1GB, and I was not sure if c_max could be larger than
 c_minAnyway, I set c_max to 1GB.
 
 After a workload run:
   
 arc::print -tad
 
 {
 . . .
  c02e29e8 uint64_t size = 0t3099832832
  c02e29f0 uint64_t p = 0t16540761088
  c02e29f8 uint64_t c = 0t1070318720
  c02e2a00 uint64_t c_min = 0t1070318720
  c02e2a08 uint64_t c_max = 0t1070318720
 . . .
 
 size is at 3GB, with c_max at 1GB.
 
 What gives? I'm looking at the code now, but was under the impression
 c_max would limit ARC growth. Granted, it's not a factor of 10, and
 it's certainly much better than the out-of-the-box growth to 24GB
 (this is a 32GB x4500), so clearly ARC growth is being limited, but it
 still grew to 3X c_max.
 
 Thanks,
 /jim
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
Something else to consider, depending upon how you set arc_c_max, you
may just want to set arc_c and arc_p at the same time.  If you try
setting arc_c_max, and then setting arc_c to arc_c_max, and then set
arc_p to arc_c / 2, do you still get this problem?

-j

On Thu, Mar 15, 2007 at 05:18:12PM -0700, [EMAIL PROTECTED] wrote:
 Gar.  This isn't what I was hoping to see.  Buffers that aren't
 available for eviction aren't listed in the lsize count.  It looks like
 the MRU has grown to 10Gb and most of this could be successfully
 evicted.
 
 The calculation for determining if we evict from the MRU is in
 arc_adjust() and looks something like:
 
 top_sz = ARC_anon.size + ARC_mru.size
 
 Then if top_sz  arc.p and ARC_mru.lsize  0 we evict the smaller of
 ARC_mru.lsize and top_size - arc.p
 
 In your previous message it looks like arc.p is  (ARC_mru.size +
 ARC_anon.size).  It might make sense to double-check these numbers
 together, so when you check the size and lsize again, also check arc.p.
 
 How/when did you configure arc_c_max?  arc.p is supposed to be
 initialized to half of arc.c.  Also, I assume that there's a reliable
 test case for reproducing this problem?
 
 Thanks,
 
 -j
 
 On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
  
  
   ARC_mru::print -d size lsize
  size = 0t10224433152
  lsize = 0t10218960896
   ARC_mfu::print -d size lsize
  size = 0t303450112
  lsize = 0t289998848
   ARC_anon::print -d size
  size = 0
  
  
  So it looks like the MRU is running at 10GB...
  
  What does this tell us?
  
  Thanks,
  /jim
  
  
  
  [EMAIL PROTECTED] wrote:
  This seems a bit strange.  What's the workload, and also, what's the
  output for:
  

  ARC_mru::print size lsize
  ARC_mfu::print size lsize
  
  and

  ARC_anon::print size
  
  
  For obvious reasons, the ARC can't evict buffers that are in use.
  Buffers that are available to be evicted should be on the mru or mfu
  list, so this output should be instructive.
  
  -j
  
  On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:

  FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
  
  
  
  arc::print -tad

  {
  . . .
 c02e29e8 uint64_t size = 0t10527883264
 c02e29f0 uint64_t p = 0t16381819904
 c02e29f8 uint64_t c = 0t1070318720
 c02e2a00 uint64_t c_min = 0t1070318720
 c02e2a08 uint64_t c_max = 0t1070318720
  . . .
  
  Perhaps c_max does not do what I think it does?
  
  Thanks,
  /jim
  
  
  Jim Mauro wrote:
  
  Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
  (update 3). All file IO is mmap(file), read memory segment, unmap, close.
  
  Tweaked the arc size down via mdb to 1GB. I used that value because
  c_min was also 1GB, and I was not sure if c_max could be larger than
  c_minAnyway, I set c_max to 1GB.
  
  After a workload run:

  arc::print -tad
  
  {
  . . .
   c02e29e8 uint64_t size = 0t3099832832
   c02e29f0 uint64_t p = 0t16540761088
   c02e29f8 uint64_t c = 0t1070318720
   c02e2a00 uint64_t c_min = 0t1070318720
   c02e2a08 uint64_t c_max = 0t1070318720
  . . .
  
  size is at 3GB, with c_max at 1GB.
  
  What gives? I'm looking at the code now, but was under the impression
  c_max would limit ARC growth. Granted, it's not a factor of 10, and
  it's certainly much better than the out-of-the-box growth to 24GB
  (this is a 32GB x4500), so clearly ARC growth is being limited, but it
  still grew to 3X c_max.
  
  Thanks,
  /jim
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
I suppose I should have been more forward about making my last point.
If the arc_c_max isn't set in /etc/system, I don't believe that the ARC
will initialize arc.p to the correct value.   I could be wrong about
this; however, next time you set c_max, set c to the same value as c_max
and set p to half of c.  Let me know if this addresses the problem or
not.

-j

 How/when did you configure arc_c_max?  
 Immediately following a reboot, I set arc.c_max using mdb,
 then verified reading the arc structure again.
 arc.p is supposed to be
 initialized to half of arc.c.  Also, I assume that there's a reliable
 test case for reproducing this problem?
   
 Yep. I'm using a x4500 in-house to sort out performance of a customer test
 case that uses mmap. We acquired the new DIMMs to bring the
 x4500 to 32GB, since the workload has a 64GB working set size,
 and we were clobbering a 16GB thumper. We wanted to see how doubling
 memory may help.
 
 I'm trying clamp the ARC size because for mmap-intensive workloads,
 it seems to hurt more than help (although, based on experiments up to this
 point, it's not hurting a lot).
 
 I'll do another reboot, and run it all down for you serially...
 
 /jim
 
 Thanks,
 
 -j
 
 On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
   
 
 ARC_mru::print -d size lsize
   
 size = 0t10224433152
 lsize = 0t10218960896
 
 ARC_mfu::print -d size lsize
   
 size = 0t303450112
 lsize = 0t289998848
 
 ARC_anon::print -d size
   
 size = 0
 
 So it looks like the MRU is running at 10GB...
 
 What does this tell us?
 
 Thanks,
 /jim
 
 
 
 [EMAIL PROTECTED] wrote:
 
 This seems a bit strange.  What's the workload, and also, what's the
 output for:
 
  
   
 ARC_mru::print size lsize
 ARC_mfu::print size lsize

 
 and
  
   
 ARC_anon::print size

 
 For obvious reasons, the ARC can't evict buffers that are in use.
 Buffers that are available to be evicted should be on the mru or mfu
 list, so this output should be instructive.
 
 -j
 
 On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
  
   
 FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
 
 

 
 arc::print -tad
  
   
 {
 . . .
   c02e29e8 uint64_t size = 0t10527883264
   c02e29f0 uint64_t p = 0t16381819904
   c02e29f8 uint64_t c = 0t1070318720
   c02e2a00 uint64_t c_min = 0t1070318720
   c02e2a08 uint64_t c_max = 0t1070318720
 . . .
 
 Perhaps c_max does not do what I think it does?
 
 Thanks,
 /jim
 
 
 Jim Mauro wrote:

 
 Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
 (update 3). All file IO is mmap(file), read memory segment, unmap, 
 close.
 
 Tweaked the arc size down via mdb to 1GB. I used that value because
 c_min was also 1GB, and I was not sure if c_max could be larger than
 c_minAnyway, I set c_max to 1GB.
 
 After a workload run:
  
   
 arc::print -tad

 
 {
 . . .
 c02e29e8 uint64_t size = 0t3099832832
 c02e29f0 uint64_t p = 0t16540761088
 c02e29f8 uint64_t c = 0t1070318720
 c02e2a00 uint64_t c_min = 0t1070318720
 c02e2a08 uint64_t c_max = 0t1070318720
 . . .
 
 size is at 3GB, with c_max at 1GB.
 
 What gives? I'm looking at the code now, but was under the impression
 c_max would limit ARC growth. Granted, it's not a factor of 10, and
 it's certainly much better than the out-of-the-box growth to 24GB
 (this is a 32GB x4500), so clearly ARC growth is being limited, but it
 still grew to 3X c_max.
 
 Thanks,
 /jim
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
   
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] understanding zfs/thunoer bottlenecks?

2007-02-27 Thread johansen-osdev
 it seems there isn't an algorithm in ZFS that detects sequential write
 in traditional fs such as ufs, one would trigger directio.

There is no directio for ZFS.  Are you encountering a situation in which
you believe directio support would improve performance?  If so, please
explain.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS multi-threading

2007-02-08 Thread johansen-osdev
 Would the logic behind ZFS take full advantage of a heavily multicored
 system, such as on the Sun Niagara platform? Would it utilize of the
 32 concurrent threads for generating its checksums? Has anyone
 compared ZFS on a Sun Tx000, to that of a 2-4 thread x64 machine?

Pete and I are working on resolving ZFS scalability issues with Niagara and
StarCat right now.  I'm not sure if any official numbers about ZFS
performance on Niagara have been published.

As far as concurrent threads generating checksums goes, the system
doesn't work quite the way you have postulated.  The checksum is
generated in the ZIO_STAGE_CHECKSUM_GENERATE pipeline state for writes,
and verified in the ZIO_STAGE_CHECKSUM_VERIFY pipeline stage for reads.
Whichever thread happens to advance the pipline to the checksum generate
stage is the thread that will actually perform the work.  ZFS does not
break the work of the checksum into chunks and have multiple CPUs
perform the computation.  However, it is possible to have concurrent
writes simultaneously in the checksum_generate stage.

More details about this can be found in zfs/zio.c and zfs/sys/zio_impl.h

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-24 Thread johansen-osdev
 And this feature is independant on whether   or not the data  is
 DMA'ed straight into the user buffer.

I suppose so, however, it seems like it would make more sense to
configure a dataset property that specifically describes the caching
policy that is desired.  When directio implies different semantics for
different filesystems, customers are going to get confused.

 The  other  feature,  is to  avoid a   bcopy by  DMAing full
 filesystem block reads straight into user buffer (and verify
 checksum after). The I/O is high latency, bcopy adds a small
 amount. The kernel memory can  be freed/reuse straight after
 the user read  completes. This is  where I ask, how much CPU
 is lost to the bcopy in workloads that benefit from DIO ?

Right, except that if we try to DMA into user buffers with ZFS there's a
bunch of other things we need the VM to do on our behalf to protect the
integrity of the kernel data that's living in user pages.  Assume you
have a high-latency I/O and you've locked some user pages for this I/O.
In a pathological case, when another thread tries to access the locked
pages and then also blocks,  it does so for the duration of the first
thread's I/O.  At that point, it seems like it might be easier to accept
the cost of the bcopy instead of blocking another thread.

I'm not even sure how to assess the impact of VM operations required to
change the permissions on the pages before we start the I/O.

 The quickest return on  investement  I see for  the  directio
 hint would be to tell ZFS to not grow the ARC when servicing
 such requests.

Perhaps if we had an option that specifies not to cache data from a
particular dataset, that would suffice.  I think you've filed a CR along
those lines already (6429855)?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread johansen-osdev
 Basically speaking - there needs to be some sort of strategy for
 bypassing the ARC or even parts of the ARC for applications that
 may need to advise the filesystem of either:
 1) the delicate nature of imposing additional buffering for their
 data flow
 2) already well optimized applications that need more adaptive
 cache in the application instead of the underlying filesystem or
 volume manager

This advice can't be sensibly delivered to ZFS via a Direct I/O
mechanism.  Anton's characterization of Direct I/O as, an optimization
which allows data to be transferred directly between user data buffers
and disk, without a memory-to-memory copy, is concise and accurate.
Trying to intuit advice from this is unlikely to be useful.  It would be
better to develop a separate mechanism for delivering advice about the
application to the filesystem.  (fadvise, perhaps?)

A DIO implementation for ZFS is more complicated than UFS and adversely
impacts well optimized applications.

I looked into this late last year when we had a customer who was
suffering from too much bcopy overhead.  Billm found another workaround
instead of bypassing the ARC.

The challenge for implementing DIO for ZFS is in dealing with access to
the pages mapped by the user application.  Since ZFS has to checksum all
of its data, the user's pages that are involved in the direct I/O cannot
be written to by another thread during the I/O.  If this policy isn't
enforced, it is possible for the data written to or read from disk to be
different from their checksums.

In order to protect the user pages while a DIO is in progress, we want
support from the VM that isn't presently implemented.  To prevent a page
from being accessed by another thread, we have to unmap the TLB/PTE
entries and lock the page.  There's a cost associated with this, as it
may be necessary to cross-call other CPUs.  Any thread that accesses the
locked pages will block.  While it's possible lock pages in the VM
today, there isn't a neat set of interfaces the filesystem can use to
maintain the integrity of the user's buffers.  Without an experimental
prototype to verify the design, it's impossible to say whether overhead
of manipulating the page permissions is more than the cost of bypassing
the cache.

What do you see as potential use cases for ZFS Direct I/O?  I'm having a
hard time imagining a situation in which this would be useful to a
customer.  The application would probably have to be single-threaded,
and if not, it would have to be pretty careful about how its threads
access buffers involved in I/O.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread johansen-osdev
 Note also that for most applications, the size of their IO operations
 would often not match the current page size of the buffer, causing
 additional performance and scalability issues.

Thanks for mentioning this, I forgot about it.

Since ZFS's default block size is configured to be larger than a page,
the application would have to issue page-aligned block-sized I/Os.
Anyone adjusting the block size would presumably be responsible for
ensuring that the new size is a multiple of the page size.  (If they
would want Direct I/O to work...)

I believe UFS also has a similar requirement, but I've been wrong
before.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: slow reads question...

2006-09-22 Thread johansen
ZFS uses a 128k block size.  If you change dd to use a bs=128k, do you observe 
any performance improvement?

 | # time dd if=zeros-10g of=/dev/null bs=8k
 count=102400
 | 102400+0 records in
 | 102400+0 records out

 | real1m8.763s
 | user0m0.104s
 | sys 0m1.759s

It's also worth noting that this dd used less system and user time than the 
read from the raw device, yet took a longer time in real time.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: slow reads question...

2006-09-22 Thread johansen-osdev
Harley:

I had tried other sizes with much the same results, but
 hadnt gone as large as 128K.  With bs=128K, it gets worse:
 
 | # time dd if=zeros-10g of=/dev/null bs=128k count=102400
 | 81920+0 records in
 | 81920+0 records out
 | 
 | real2m19.023s
 | user0m0.105s
 | sys 0m8.514s

I may have done my math wrong, but if we assume that the real
time is the actual amount of time we spent performing the I/O (which may
be incorrect) haven't you done better here?

In this case you pushed 81920 128k records in ~139 seconds -- approx
75437 k/sec.

Using ZFS with 8k bs, you pushed 102400 8k records in ~68 seconds --
approx 12047 k/sec.

Using the raw device you pushed 102400 8k records in ~23 seconds --
approx 35617 k/sec.

I may have missed something here, but isn't this newest number the
highest performance so far?

What does iostat(1M) say about your disk read performance?

Is there any other info I can provide which would help?

Are you just trying to measure ZFS's read performance here?

It might be interesting to change your outfile (of) argument and see if
we're actually running into some other performance problem.  If you
change of=/tmp/zeros does performance improve or degrade?  Likewise, if
you write the file out to another disk (UFS, ZFS, whatever), does this
improve performance?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: slow reads question...

2006-09-22 Thread johansen-osdev
Harley:

 Old 36GB drives:
 
 | # time mkfile -v 1g zeros-1g
 | zeros-1g 1073741824 bytes
 | 
 | real2m31.991s
 | user0m0.007s
 | sys 0m0.923s
 
 Newer 300GB drives:
 
 | # time mkfile -v 1g zeros-1g
 | zeros-1g 1073741824 bytes
 | 
 | real0m8.425s
 | user0m0.010s
 | sys 0m1.809s

This is a pretty dramatic difference.  What type of drives were your old
36g drives?

I am wondering if there is something other than capacity
 and seek time which has changed between the drives.  Would a
 different scsi command set or features have this dramatic a
 difference?

I'm hardly the authority on hardware, but there are a couple of
possibilties.  Your newer drives may have a write cache.  It's also
quite likely that the newer drives have a faster speed of rotation and
seek time.

If you subtract the usr + sys time from the real time in these
measurements, I suspect the result is the amount of time you were
actually waiting for the I/O to finish.  In the first case, you spent
99% of your total time waiting for stuff to happen, whereas in the
second case it was only ~86% of your overall time.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Memory Usage

2006-09-12 Thread johansen
 1) You should be able to limit your cache max size by
 setting arc.c_max.  Its currently initialized to be
 phys-mem-size - 1GB.

Mark's assertion that this is not a best practice is something of an 
understatement.  ZFS was designed so that users/administrators wouldn't have to 
configure tunables to achieve optimal system performance.  ZFS performance is 
still a work in progress.

The problem with adjusting arc.c_max is that its definition may change from one 
release to another.  It's an internal kernel variable, its existence isn't 
guaranteed.  There are also no guarantees about the semantics of what a future 
arc.c_max might mean.  It's possible that future implementations may change the 
definition such that reducing c_max has other unintended consequences.

Unfortunately, at the present time this is probably the only way to limit the 
cache size.  Mark and I are working on strategies to make sure that ZFS is a 
better citizen when it comes to memory usage and performance.  Mark has 
recently made a number of changes which should help ZFS reduce its memory 
footprint.  However, until these changes and others make it into a production 
build we're going to have to live with this inadvisable approach for adjusting 
the cache size.

-j
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss