Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Willard Korfhage
Looks like it was RAM. I ran memtest+ 4.00, and it found no problems. I removed 
2 of the 3 sticks of RAM, ran a backup, and had no errors. I'm running more 
extensive tests, but it looks like that was it. A new motherboard, CPU and ECC 
RAM are on the way to me now.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Peter Schuller
Hello,

For desktop use, and presumably rapidly changing non-desktop uses, I
find the ARC cache pretty annoying in its behavior. For example this
morning I had to hit my launch-terminal key perhaps 50 times (roughly)
before it would start completing without disk I/O. There are plenty of
other examples as well, such as /var/db/pkg not being pulled
aggressively into cache such that pkg_* operations (this is on
FreeBSD) are slower than they should (I have to run pkg_info some
number of times before *it* will complete without disk I/O too).

I would be perfectly happy with pure LRU caching behavior or an
approximation thereof, and would therefore like to essentially
completely turn off all MFU-like weighting.

I have not investigated in great depth so it's possible this
represents an implementation problem rather than the actual intended
policy of the ARC. If the former, can someone confirm/deny? If the
latter, is there some way to tweak it? I have not found one (other
than changing the code). Is there any particular reason why such knobs
are not exposed? Am I missing something?

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with zfs and a STK RAID INT SAS HBA

2010-04-05 Thread Ragnar Sundblad

On 5 apr 2010, at 04.35, Edward Ned Harvey wrote:

 When running the card in copyback write cache mode, I got horrible
 performance (with zfs), much worse than with copyback disabled
 (which I believe should mean it does write-through), when tested
 with filebench.
 
 When I benchmark my disks, I also find that the system is slower with
 WriteBack enabled.  I would not call it much worse, I'd estimate about 10%
 worse.

Yes, I oversimplified - I have been benchmarking with filebench,
just running the tests shipped with the OS trimmed a little
according to http://www.solarisinternals.com/wiki/index.php/FileBench.
For most tests, I typically get a little worse performance with
writeback enabled (or copyback, as they called it on this card),
maybe about 10 % in average could be about right for these tests too.

The interesting part is that with these tests and writeback disabled,
on a 4 way stripe of sun stock 2.5 146 GB 1 RPM drives, the test
takes 2 hours and 18 minutes (138 minutes) to complete, but with
writeback enabled it takes 16 hours 57 minutes (1017 minutes), or 
over 7.3 times as long time!

I can't (yet) explain the large difference in test time and the
small diff in test results.

Maybe a hardware - or driver - problem has its' part in this.

I have made a few simple tests with these cards before and was
not really impressed, even with all the bells and whistles turned of
they merely seemed to be an IOPS and maybe BW bottleneck, but the above
seems just not right.

  This, naturally, is counterintuitive.  I do have an explanation,
 however, which is partly conjecture:  With the WriteBack enabled, when the
 OS tells the HBA to write something, it seems to complete instantly.  So the
 OS will issue another, and another, and another.  The HBA has no knowledge
 of the underlying pool data structure, so it cannot consolidate the smaller
 writes into larger sequential ones.  It will brainlessly (or
 less-brainfully) do as it was told, and write the blocks to precisely the
 addresses that it was instructed to write.  Even if those are many small
 writes, scattered throughout the platters.  ZFS is smarter than that.  It's
 able to consolidate a zillion tiny writes, as well as some larger writes,
 all into a larger sequential transaction.  ZFS has flexibility, in choosing
 precisely how large a transaction it will create, before sending it to disk.
 One of the variables used to decide how large the transaction should be is
 ... Is the disk busy writing, right now?  If the disks are still busy, I
 might as well wait a little longer and continue building up my next
 sequential block of data to write.  If it appears to have completed the
 previous transaction already, no need to wait any longer.  Don't let the
 disks sit idle.  Just send another small write to the disk.
 
 Long story short, I think, ZFS simply does a better job of write buffering
 than the HBA could possibly do.  So you benefit by disabling the WriteBack,
 in order to allow ZFS handle that instead.

You could think that ZIL transactions could get a speedup by the
writeback cache, meaning more sync actions per second, and in some
cases that seems to be true, and that the card should be designed to
be able to handle intermittent load as the txg completions bursts
(typically every 30 seconds), but something strange obviously happens,
at least on this setup.

(Actually I'd prefer if I could conclude that there is no use for
writeback caching HBAs - I'd like these machines to be as stable as
they possible can and therefore to be just as plain and simple as possible,
and for us to be able to just quickly move the disks if one machine should
brake - with some data stuck in some silly writeback cache inside a HBA
that may or may not cooperate depending on it's state of mind, mood and the
moon phase, that can't be done and I'd need a much more complicated
(= error- and mistake-prone) setup. But my tests so far seems just not
right and probably can't be used to conclude anything.
I'd rather use slogs, and have a few Intel X25-Es to test with, but
then I just recently read on this list that X25-Es aren't supported for
slog anymore! Maybe because they always have their writeback cache
turned on by default and ignore cache flush commands (and that is not a
bug - is the design from outer space?), I don't know yet.
(Don't know why I am stubbornly fooling around with this intel junk - they
right now manage to annoy me with a crappy (or broken) PCI-PCI bridge,
a crappy HBA and a crappy SSD drives...))

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Andreas Höschler

Hi all,

while setting of our X4140 I have - following suggestions - added two 
SSDs as log devices as follows


zpool add tank log c1t6d0 c1t7d0

I currently have

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0s0  ONLINE   0 0 0
c1t1d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
logs
  c1t6d0ONLINE   0 0 0
  c1t7d0ONLINE   0 0 0

errors: No known data errors

We have performance problems especially with FrontBase (relational 
database) running on this ZFS configuration and need to look for 
optimizations.


• I would like to remove the two SSDs as log devices from the pool and 
instead add them as a separate pool for sole use  by the database to 
see how this enhences performance. I could certainly do


zpool detach tank c1t7d0

to remove one disk from the log mirror. But how can I get back the 
second SSD?


Any experiinces with running database on ZFS pools? What can I do to 
tune the performance? Smaller block size may be?


Thanks a lot,

 Andreas


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS getting slower over time

2010-04-05 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Marcus Wilhelmsson
 
 I have a problem with my zfs system, it's getting slower and slower
 over time. When the OpenSolaris machine is rebooted and just started I
 get about 30-35MB/s in read and write but after 4-8 hours I'm down to
 maybe 10MB/s and it varies between 4-18MB/s. Now, if i reboot the
 machine it's all gone and I have perfect speed again.
 
 Does it have something to do with the cache? I use a separate SSD as a
 cache disk.

If it is somehow related to the cache, fortunately that's easy to test for.
Just remove your log device, or cache device, and see if that helps.  But I
doubt it.


 Anyways, here's my setup:
 OpenSolaris 1.34 dev
 C2D with 4GB ram
 4x 1,5TB WD SATA drives and 1x Corsair 32GB SSD as cache

Not knowing much, I'm going to suspect your RAM.  4G doesn't sound like much
to me.  How large is your filesystem?  I think the number of files is
probably more relevant than the total number of Gb.


 Doesn't seem to matter if I copy files locally on the computer or if I
 use CIFS, still getting the same degredation in speed. Last night I
 left my workstation copying files to/from the server for about 8 hours
 and you could see the performance dropping from about 28MB/s down to
 under 10MB/s after a couple of hours.

What are you using to measure the performance?

Is it read, or write?  Do you have compression or dedupe enabled?

Please send your /etc/release file.  At least the relevant parts.
Also, please send your zpool status

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS getting slower over time

2010-04-05 Thread Marcus Wilhelmsson
  From: zfs-discuss-boun...@opensolaris.org
 [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Marcus
 Wilhelmsson
  
  I have a problem with my zfs system, it's getting
 slower and slower
  over time. When the OpenSolaris machine is rebooted
 and just started I
  get about 30-35MB/s in read and write but after 4-8
 hours I'm down to
  maybe 10MB/s and it varies between 4-18MB/s. Now,
 if i reboot the
  machine it's all gone and I have perfect speed
 again.
  
  Does it have something to do with the cache? I use
 a separate SSD as a
  cache disk.
 
 If it is somehow related to the cache, fortunately
 that's easy to test for.
 Just remove your log device, or cache device, and see
 if that helps.  But I
 doubt it.

I doubt it as well, but it's worth a try.

 
 
  Anyways, here's my setup:
  OpenSolaris 1.34 dev
  C2D with 4GB ram
  4x 1,5TB WD SATA drives and 1x Corsair 32GB SSD as
 cache
 
 Not knowing much, I'm going to suspect your RAM.  4G
 doesn't sound like much
 to me.  How large is your filesystem?  I think the
 number of files is
 probably more relevant than the total number of Gb.
 
 
  Doesn't seem to matter if I copy files locally on
 the computer or if I
  use CIFS, still getting the same degredation in
 speed. Last night I
  left my workstation copying files to/from the
 server for about 8 hours
  and you could see the performance dropping from
 about 28MB/s down to
  under 10MB/s after a couple of hours.
 
 What are you using to measure the performance?
Well, I'm comparing the transfer speed from my WinXP computer and my Mac and 
see how much it changes over five hours of constant copying (copying large 
iso-files of about 20GB). Since the system is used as a home NAS there won't be 
lot's of random I/O, but rather me copying stuff over the network. Any 
suggestions on how to do a proper performance test is welcome.

 
 Is it read, or write?  Do you have compression or
 dedupe enabled?

Both read and write, but mostly read.

No compression or dedup.

 Please send your /etc/release file.  At least the
 relevant parts.
   OpenSolaris Development snv_134 X86
 Assembled 01 March 2010

 Also, please send your zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c3d0s0ONLINE   0 0 0

errors: No known data errors

  pool: s1
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
s1  ONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
cache
  c4t4d0ONLINE   0 0 0

errors: No known data errors
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Andreas Höschler

 • I would like to remove the two SSDs as log devices from the pool and
 instead add them as a separate pool for sole use  by the database to
 see how this enhences performance. I could certainly do
 
 zpool detach tank c1t7d0
 
 to remove one disk from the log mirror. But how can I get back the
 second SSD?

If you're running solaris, sorry, you can't remove the log device.  You
better keep your log mirrored until you can plan for destroying and
recreating the pool.  Actually, in your example, you don't have a mirror of
logs.  You have two separate logs.  This is fine for opensolaris (zpool
=19), but not solaris (presently up to zpool 15).  If this is solaris, and
*either* one of those SSD's fails, then you lose your pool.

If you're running opensolaris, man zpool and look for zpool remove

Is the database running locally on the machine?  Or at the other end of
something like nfs?  You should have better performance using your present
config than just about any other config ... By enabling the log devices,
such as you've done, you're dedicating the SSD's for sync writes.  And
that's what the database is probably doing.  This config should be *better*
than dedicating the SSD's as their own pool.  Because with the dedicated log
device on a stripe of mirrors, you're allowing the spindle disks to do what
they're good at (sequential blocks) and allowing the SSD's to do what
they're good at (low latency IOPS).

If you're running zpool 19 or greater (you can check with zpool update) it
should be safe to run with the log device un-mirrored.  In which case, you
might think about using one SSD as log, and one SSD as cache.  That might
help.

You can verify the behavior of your database, if you run zpool iostat and
then do some database stuff, you should see the writes increasing on the log
SSD's.  If you don't see that, then your DB is not doing sync writes (I bet
it is).  

You don't benefit from dedicated log device unless you're doing sync writes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS getting slower over time

2010-04-05 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Marcus Wilhelmsson
   pool: s1
  state: ONLINE
  scrub: none requested
 config:
 
 NAMESTATE READ WRITE CKSUM
 s1  ONLINE   0 0 0
   raidz1-0  ONLINE   0 0 0
 c4t0d0  ONLINE   0 0 0
 c4t1d0  ONLINE   0 0 0
 c4t2d0  ONLINE   0 0 0
 c4t3d0  ONLINE   0 0 0
 cache
   c4t4d0ONLINE   0 0 0

With this configuration, you should be pretty good at reading large files,
or repeat-reading random files that you've recently read.  Your write
performance could be lower.  And you would have really poor sync write
performance.

If reading a large sequential file, you should be able to max out Gb
Ethernet.  But the system you're receiving the file to can only go as fast
as a single disk, unless you've got something like a hardware raid
controller.  Still, you should be able to get approx 60 Mbytes/sec across
CIFS, where the bottleneck is your laptop hard drive, where you're receiving
the file.

The test I would recommend, would be:  time cp
/Volumes/somemount/somefile.iso . on a mac, and in windows running cygwin,
time cp /cygdrive/someletter/somefile.iso .

That should be an apples-to-apples test, which would really give you some
number you know is accurate.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] no hot spare activation?

2010-04-05 Thread Garrett D'Amore
While testing a zpool with a different storage adapter using my blkdev 
device, I did a test which made a disk unavailable -- all attempts to 
read from it report EIO.


I expected my configuration (which is a 3 disk test, with 2 disks in a 
RAIDZ and a hot spare) to work where the hot spare would automatically 
be activated.  But I'm finding that ZFS does not behave this way -- if 
only some I/Os are failed, then the hot spare is failed, but if ZFS 
decides that the label is gone, it takes no attempt to recruit a hot spare.


I had added FMA notification to my blkdev driver - it will post 
device.no_response or device.invalid_state ereports (per the 
ddi_fm_ereport_post() man page) in certain failure scenarios.


I *suspect* the problem is in the FMA notification for zfs-retire, where 
the event is not being interpreted in a way that ZFS retire can figure 
out that the drive is toasted.


Of course, this is just an educated guess on my part.  I'm no ZFS nor 
FMA expert here.


Am I missing something here?  Under what conditions can I expect hot 
spares to be recruited?


My zpool status showing the results is below.

- Garrett


 pfexec zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c1t0d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: testpool
 state: DEGRADED
status: One or more devices could not be used because the label is 
missing or

invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
testpoolDEGRADED 0 0 0
  raidz1-0  DEGRADED 0 0 0
c2t3d0  ONLINE   0 0 0
c2t3d1  UNAVAIL  9   132 0  experienced I/O failures
spares
  c2t3d2AVAIL

errors: No known data errors

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS getting slower over time

2010-04-05 Thread Marcus Wilhelmsson
Alright, I've made the benchmarks and there isn't a difference worth mentioning 
except that i only get about 30MB/s (to my Mac, which has an SSD as system 
disk). I've also tried copying to a ram disk with slightly better results.

Well, now that I've restarted the server I probably won't see the 5 sec dips 
until later tonight.

Thanks for the help, even though it doesn't seem to make a difference. I'll try 
to get a screen capture of the dips in performance next time they occur.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-05 Thread Kyle McDonald
On 4/4/2010 11:04 PM, Edward Ned Harvey wrote:
 Actually, It's my experience that Sun (and other vendors) do exactly
 that for you when you buy their parts - at least for rotating drives, I
 have no experience with SSD's.

 The Sun disk label shipped on all the drives is setup to make the drive
 the standard size for that sun part number. They have to do this since
 they (for many reasons) have many sources (diff. vendors, even diff.
 parts from the same vendor) for the actual disks they use for a
 particular Sun part number.
 
 Actually, if there is a fdisk partition and/or disklabel on a drive when it
 arrives, I'm pretty sure that's irrelevant.  Because when I first connect a
 new drive to the HBA, of course the HBA has to sign and initialize the drive
 at a lower level than what the OS normally sees.  So unless I do some sort
 of special operation to tell the HBA to preserve/import a foreign disk, the
 HBA will make the disk blank before the OS sees it anyway.

   
That may be true. Though these days they may be spec'ing the drives to
the manufacturer's at an even lower level.

So does your HBA have newer firmware now than it did when the first disk
was connected?
Maybe it's the HBA that is handling the new disks differently now, than
it did when the first one was plugged in?

Can you down rev the HBA FW? Do you have another HBa that might still
have the older Rev you coudltest it on?

  -Kyle


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Andreas Höschler

Hi Edward,

thanks a lot for your detailed response!


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Andreas Höschler

• I would like to remove the two SSDs as log devices from the pool and
instead add them as a separate pool for sole use  by the database to
see how this enhences performance. I could certainly do

zpool detach tank c1t7d0

to remove one disk from the log mirror. But how can I get back the
second SSD?


If you're running solaris, sorry, you can't remove the log device.  You
better keep your log mirrored until you can plan for destroying and
recreating the pool.  Actually, in your example, you don't have a 
mirror of

logs.  You have two separate logs.  This is fine for opensolaris (zpool
=19), but not solaris (presently up to zpool 15).  If this is 
solaris, and

*either* one of those SSD's fails, then you lose your pool.


I run Solaris 10 (not Open Solaris)!

You say the log mirror

  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
...
logs
  c1t6d0ONLINE   0 0 0
  c1t7d0ONLINE   0 0 0

does not do me anything good (redundancy-wise)!? Shouldn't I dettach 
the second drive then and try to use it for something else, may be 
another machine?


I understand it is very dangerous to use SSDs for logs then (no 
redundancy)!?



If you're running opensolaris, man zpool and look for zpool remove

Is the database running locally on the machine?


Yes!


 Or at the other end of
something like nfs?  You should have better performance using your 
present
config than just about any other config ... By enabling the log 
devices,

such as you've done, you're dedicating the SSD's for sync writes.  And
that's what the database is probably doing.  This config should be 
*better*
than dedicating the SSD's as their own pool.  Because with the 
dedicated log
device on a stripe of mirrors, you're allowing the spindle disks to do 
what

they're good at (sequential blocks) and allowing the SSD's to do what
they're good at (low latency IOPS).


OK!

I actually have two machines here, one production machine (X4240 with 
16 disks, no SSDs) with performance issues and another development 
machine X4140 with 6 disks and two SDDs configured as shown in my 
previous mail. The question for me is how to improve the performance of 
the production machine and whether buying SSDs for this machine is 
worth the investment.


zpool iostat on the development machine with the SSDs gives me

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rpool114G   164G  0  4  13.5K  36.0K
tank 164G   392G  3131   444K  10.8M
--  -  -  -  -  -  -

When I do that on the production machine without SSDs I get

pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rpool   98.3G  37.7G  0  7  32.5K  36.9K
tank 480G   336G 16 53  1.69M  2.05M
--  -  -  -  -  -  -

It is interesting to note that the write bandwidth on the SSD machine 
is 5 times higher. I take this as an indicaor that the SSDs have some 
effect.


I am still wondering what your if one SSd fails you loe your pool 
means to me. Would you recommend to dettach one of the SSDs in the 
development machine and add to o the production machine with


zpool add tank log c1t15d0

?? And how save (reliable) is it to use SSDs for this? I mean when do I 
have to expect the SSD to fail and thus ruin the pool!?


Thanks a lot,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mpxio load-balancing...it doesn't work??

2010-04-05 Thread Torrey McMahon
 Not true. There are different ways that a storage array, and it's 
controllers, connect to the host visible front end ports which might be 
confusing the author but i/o isn't duplicated as he suggests.


On 4/4/2010 9:55 PM, Brad wrote:

I had always thought that with mpxio, it load-balances IO request across your 
storage ports but this article 
http://christianbilien.wordpress.com/2007/03/23/storage-array-bottlenecks/ has 
got me thinking its not true.

The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames are 10 bytes long 
-) per port. As load balancing software (Powerpath, MPXIO, DMP, etc.) are most of the 
times used both for redundancy and load balancing, I/Os coming from a host can take 
advantage of an aggregated bandwidth of two ports. However, reads can use only one path, 
but writes are duplicated, i.e. a host write ends up as one write on each host port.

Is this true?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Why does ARC grow above hard limit?

2010-04-05 Thread Mike Z
I would appreciate if somebody can clarify  a  few points.

I am doing some random WRITES  (100% writes, 100% random) testing and observe 
that ARC grows way beyond the hard limit during the test. The hard limit is 
set 512 MB via /etc/system and I see the size going up to 1 GB - how come is it 
happening?

mdb's ::memstat reports 1.5 GB used - does this include ARC as well or is it 
separate?

I see on the backed only reads (205 MB/s) and almost no writes (1.1 MB/s) - any 
ides what is being read?

--- BEFORE TEST 
# ~/bin/arc_summary.pl

System Memory:
 Physical RAM:  12270 MB
 Free Memory :  7108 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 136 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)
...


 ::memstat
Page SummaryPagesMB  %Tot
     
Kernel 800895  3128   25%
ZFS File Data  394450  1540   13%
Anon   106813   4173%
Exec and libs4178160%
Page cache  14333550%
Free (cachelist)22996891%
Free (freelist)   1797511  7021   57%

Total 3141176 12270
Physical  3141175 12270


--- DURING THE TEST
# ~/bin/arc_summary.pl 
System Memory:
 Physical RAM:  12270 MB
 Free Memory :  6687 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 1336 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:  87%446 MB (p)
 Most Frequently Used Cache Size:12%65 MB (c-p)

ARC Efficency:
 Cache Access Total: 51681761
 Cache Hit Ratio:  52%   27056475   [Defined State for 
buffer]
 Cache Miss Ratio: 47%   24625286   [Undefined State for 
Buffer]
 REAL Hit Ratio:   52%   27056475   [MRU/MFU Hits Only]

 Data Demand   Efficiency:35%
 Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable)

CACHE HITS BY CACHE LIST:
  Anon:   --%Counter Rolled.
  Most Recently Used: 13%3627289 (mru)  [ 
Return Customer ]
  Most Frequently Used:   86%23429186 (mfu) [ 
Frequent Customer ]
  Most Recently Used Ghost:   17%4657584 (mru_ghost)[ 
Return Customer Evicted, Now Back ]
  Most Frequently Used Ghost: 32%8712009 (mfu_ghost)[ 
Frequent Customer Evicted, Now Back ]
CACHE HITS BY DATA TYPE:
  Demand Data:30%8308866 
  Prefetch Data:   0%0 
  Demand Metadata:69%18747609 
  Prefetch Metadata:   0%0 
CACHE MISSES BY DATA TYPE:
  Demand Data:61%15113029 
  Prefetch Data:   0%0 
  Demand Metadata:38%9511898 
  Prefetch Metadata:   0%359 
-
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Khyron
Response below...

2010/4/5 Andreas Höschler ahoe...@smartsoft.de

 Hi Edward,

 thanks a lot for your detailed response!


  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Andreas Höschler

 • I would like to remove the two SSDs as log devices from the pool and
 instead add them as a separate pool for sole use  by the database to
 see how this enhences performance. I could certainly do

zpool detach tank c1t7d0

 to remove one disk from the log mirror. But how can I get back the
 second SSD?


 If you're running solaris, sorry, you can't remove the log device.  You
 better keep your log mirrored until you can plan for destroying and
 recreating the pool.  Actually, in your example, you don't have a mirror
 of
 logs.  You have two separate logs.  This is fine for opensolaris (zpool

 =19), but not solaris (presently up to zpool 15).  If this is solaris,
 and

 *either* one of those SSD's fails, then you lose your pool.


 I run Solaris 10 (not Open Solaris)!

 You say the log mirror


  pool: tank
  state: ONLINE
  scrub: none requested
 config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
...

logs
  c1t6d0ONLINE   0 0 0
  c1t7d0ONLINE   0 0 0

 does not do me anything good (redundancy-wise)!? Shouldn't I dettach the
 second drive then and try to use it for something else, may be another
 machine?


No, he did *not* say that a mirrored SLOG has no benefit, redundancy-wise.
He said that YOU do *not* have a mirrored SLOG.  You have 2 SLOG devices
which are striped.  And if this machine is running Solaris 10, then you
cannot
remove a log device because those updates have not made their way into
Solaris 10 yet.  You need pool version = 19 to remove log devices, and S10
does not currently have patches to ZFS to get to a pool version = 19.

If your SLOG above were mirrored, you'd have mirror under logs.  And you

probably would have log not logs - notice the s at the end meaning
plural,
meaning multiple independent log devices, not a mirrored pair of logs which
would effectively look like 1 device.

-- 
You can choose your friends, you can choose the deals. - Equity Private

If Linux is faster, it's a Solaris bug. - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] no hot spare activation?

2010-04-05 Thread Eric Schrock

On Apr 5, 2010, at 11:43 AM, Garrett D'Amore wrote:  
 
 I see ereport.fs.zfs.io_failure, and ereport.fs.zfs.probe_failure.  Also, 
 ereport.io.service.lost and ereport.io.device.inval_state.  There is indeed a 
 fault.fs.zfs.device in the list as well.

The ereports are not interesting, only the fault.  In FMA, ereports contribute 
to diagnosis, but faults are the only thing that are presented to the user and 
retire agents.

 Everything seems to be correct *except* that ZFS isn't automatically doing 
 the replace operation with the hot spare.
 
 It feels to me like this is possibly a ZFS bug --- perhaps ZFS is expecting a 
 specific set of FMA faults that only sd delivers?  (Recall this is with a 
 different target device.)

Yes, it may be a bug.  You will have to step through the zfs retire agent to 
see what goes wrong when it receives the list.suspect event.  This code path is 
tested many, many times every day, so it's not as obvious as this doesn't 
work.

The ZFS retire agent subscribes only to ZFS faults.  The underlying driver or 
other telemetry has no bearing on the diagnosis or associated action.

- Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mpxio load-balancing...it doesn't work??

2010-04-05 Thread Bob Friesenhahn

On Sun, 4 Apr 2010, Brad wrote:

I had always thought that with mpxio, it load-balances IO request 
across your storage ports but this article 
http://christianbilien.wordpress.com/2007/03/23/storage-array-bottlenecks/ 
has got me thinking its not true.


The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames 
are 10 bytes long -) per port. As load balancing software 
(Powerpath, MPXIO, DMP, etc.) are most of the times used both for 
redundancy and load balancing, I/Os coming from a host can take 
advantage of an aggregated bandwidth of two ports. However, reads 
can use only one path, but writes are duplicated, i.e. a host write 
ends up as one write on each host port. 


Is this true?


This text seems strange and wrong since duplicating writes would 
result in duplicate writes to disks, which could cause corruption if 
the ordering was not perfectly preserved.  Depending on the storage 
array capabilities, MPXIO could use different strategies.  A common 
strategy is active/standby on a per-LUN level.  Even with 
active/standby, effective load sharing is possible if the storage 
array can be told to assign preference between a LUN and a port. 
That is what I have done with my own setup.  1/2 the LUNs have a 
preference for each port so that with all paths functional, the FC 
traffic is similar for each FC link.


--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Andreas Höschler

Hi Khyron,

No, he did *not* say that a mirrored SLOG has no benefit, 
redundancy-wise.
He said that YOU do *not* have a mirrored SLOG.  You have 2 SLOG 
devices
which are striped.  And if this machine is running Solaris 10, then 
you cannot

remove a log device because those updates have not made their way into
Solaris 10 yet.  You need pool version = 19 to remove log devices, 
and S10

does not currently have patches to ZFS to get to a pool version = 19.

If your SLOG above were mirrored, you'd have mirror under logs.  
And you
probably would have log not logs - notice the s at the end 
meaning plural,
meaning multiple independent log devices, not a mirrored pair of logs 
which

would effectively look like 1 device.


Thanks for the clarification! This is very annoying. My intend was to 
create a log mirror. I used


zpool add tank log c1t6d0 c1t7d0

and this was obviously false. Would

zpool add tank mirror log c1t6d0 c1t7d0

have done what I intended to do? If so it seems I have to tear down the 
tank pool and recreate it from scratc!?. Can I simply use


zpool destroy -f tank

to do so?

Thanks,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Bob Friesenhahn

On Mon, 5 Apr 2010, Peter Schuller wrote:


For desktop use, and presumably rapidly changing non-desktop uses, I
find the ARC cache pretty annoying in its behavior. For example this
morning I had to hit my launch-terminal key perhaps 50 times (roughly)
before it would start completing without disk I/O. There are plenty of
other examples as well, such as /var/db/pkg not being pulled
aggressively into cache such that pkg_* operations (this is on
FreeBSD) are slower than they should (I have to run pkg_info some
number of times before *it* will complete without disk I/O too).


It sounds like you are complaining about how FreeBSD has implemented 
zfs in the system rather than about zfs in general.  These problems 
don't occur under Solaris.  Zfs and the kernel need to agree on how to 
allocate/free memory, and it seems that Solaris is more advanced than 
FreeBSD in this area.  It is my understanding that FreeBSD offers 
special zfs tunables to adjust zfs memory usage.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Peter Schuller
 It sounds like you are complaining about how FreeBSD has implemented zfs in
 the system rather than about zfs in general.  These problems don't occur
 under Solaris.  Zfs and the kernel need to agree on how to allocate/free
 memory, and it seems that Solaris is more advanced than FreeBSD in this
 area.  It is my understanding that FreeBSD offers special zfs tunables to
 adjust zfs memory usage.

It may be FreeBSD specific, but note that I a not talking about the
amount of memory dedicated to the ARC and how it balances with free
memory on the system. I am talking about eviction policy. I could be
wrong but I didn't think ZFS port made significant changes there.

And note that part the of *point* of the ARC (at least according to
the original paper, though it was a while since I read it), as opposed
to a pure LRU, is to do some weighting on frequency of access, which
is exactly consistent with what I'm observing (very quick eviction
and/or lack of inserton of data, particularly in the face of unrelated
long-term I/O having happened in the background). It would likely also
be the desired behavior for longer-running homogenous disk access
patterns where optimal use of cache over long period may be more
important than immediately reacting to a changing access pattern. So
it's not like there is no reason to believe this can be about ARC
policy.

Why would this *not* occurr on Solaris? It seems to me that it would
imply the ARC was broken on Solaris, since it is not *supposed* to be
a pure LRU by design. Again, there may very well be a FreeBSD specific
issue here that is altering the behavior, and maybe the extremity of
it that I am reporting is not supposed to be happening, but I believe
the issue is more involved than what you're implying in your response.

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] EON ZFS Storage 0.60.0 based on snv 130, Sun-set release!

2010-04-05 Thread Andre Lue
Embedded Operating system/Networking (EON), RAM based live ZFS NAS appliance is 
released on Genunix! This release marks the end of SXCE releases and Sun 
Microsystems as we know it! It is dubbed the Sun-set release! Many thanks to Al 
at Genunix.org for download hosting and serving the Opensolaris community.

EON Deduplication ZFS storage is available in 32 and 64-bit, CIFS and Samba 
versions:
EON 64-bit x86 CIFS ISO image version 0.60.0 based on snv_130
* eon-0.600-130-64-cifs.iso
* MD5: 55c5837985f282f9272f5275163f7d7b
* Size: ~93Mb
* Released: Monday 05-April-2010

EON 64-bit x86 Samba ISO image version 0.60.0 based on snv_130
* eon-0.600-130-64-smb.iso
* MD5: bf095f2187c29fb543285b72266c0295
* Size: ~106Mb
* Released: Monday 05-April-2010

EON 32-bit x86 CIFS ISO image version 0.60.0 based on snv_130
* eon-0.600-130-32-cifs.iso
* MD5: e2b312feefbfb14792c0d190e7ff69cf
* Size: ~59Mb
* Released: Monday 05-April-2010

EON 32-bit x86 Samba ISO image version 0.60.0 based on snv_130
* eon-0.600-130-32-smb.iso
* MD5: bcf6dc76bc9a22cff1431da20a5c56e2
* Size: ~73Mb
* Released: Monday 05-April-2010

EON 64-bit x86 CIFS ISO image version 0.60.0 based on snv_130 (NO HTTPD)
* eon-0.600-130-64-cifs-min.iso
* MD5: 78b0bb116c0e32a48c473ce1b94e604f
* Size: ~87Mb
* Released: Monday 05-April-2010

EON 64-bit x86 Samba ISO image version 0.60.0 based on snv_130 (NO HTTPD)
* eon-0.600-130-64-smb-min.iso
* MD5: e74732c41e4b3a9a06f52779bc9f8352
* Size: ~101Mb
* Released: Monday 05-April-2010

New/Changes/Fixes:
- Active Directory integration problem resolved
- Hotplug errors at boot are being worked on and are safe to ignore.
- Updated /mnt/eon0/.exec with new service configuration additions (light, 
nginx, afpd, and more ...).
- Updated ZFS, NFS v3 performance tuning in /etc/system
- Added megasys driver.
- EON rebooting at grub(since snv_122) in ESXi, Fusion and various versions of 
VMware workstation. This is related to bug 6820576. Workaround, at grub press e 
and add on the end of the kernel line -B disable-pcieb=true

http://eonstorage.blogspot.com/
http://sites.google.com/site/eonstorage/
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?

2010-04-05 Thread Kyle McDonald
I've seen the Nexenta and EON webpages, but I'm not looking to build my own.

Is there anything out there I can just buy?

 -Kyle

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?

2010-04-05 Thread Ahmed Kamal
Install nexenta on a dell poweredge ?
or one of these http://www.pogolinux.com/products/storage_director

On Mon, Apr 5, 2010 at 9:48 PM, Kyle McDonald kmcdon...@egenera.com wrote:

 I've seen the Nexenta and EON webpages, but I'm not looking to build my
 own.

 Is there anything out there I can just buy?

  -Kyle

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?

2010-04-05 Thread Volker A. Brandt
Kyle McDonald writes:
 I've seen the Nexenta and EON webpages, but I'm not looking to build my own.

 Is there anything out there I can just buy?

In Germany, someone sells preconfigured hardware based on Nexenta:

http://www.thomas-krenn.com/de/storage-loesungen/storage-systeme/nexentastor/nexentastor-sc846-unified-storage.html

I have no experience with them but I wish them success. :-)


Regards -- Volker
-- 

Volker A. Brandt  Consulting and Support for Sun Solaris
Brandt  Brandt Computer GmbH   WWW: http://www.bb-c.de/
Am Wiesenpfad 6, 53340 Meckenheim Email: v...@bb-c.de
Handelsregister: Amtsgericht Bonn, HRB 10513  Schuhgröße: 45
Geschäftsführer: Rainer J. H. Brandt und Volker A. Brandt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Bob Friesenhahn

On Mon, 5 Apr 2010, Peter Schuller wrote:


It may be FreeBSD specific, but note that I a not talking about the
amount of memory dedicated to the ARC and how it balances with free
memory on the system. I am talking about eviction policy. I could be
wrong but I didn't think ZFS port made significant changes there.


The ARC is designed to use as much memory as is available up to a 
limit.  If the kernel allocator needs memory and there is none 
available, then the allocator requests memory back from the zfs ARC. 
Note that some systems have multiple memory allocators.  For example, 
there may be a memory allocator for the network stack, and/or for 
a filesystem.


The FreeBSD kernel is not the same as Solaris.  While Solaris uses a 
common allocator between most of the kernel and zfs, FreeBSD may use 
different allocators, which are not able to share memory.  The space 
available for zfs might be pre-allocated.  I assume that you have 
already read the FreeBSD ZFS tuning guide 
(http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem 
section in the handbook 
(http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made 
sure that your system is tuned appropriately.



Why would this *not* occurr on Solaris? It seems to me that it would
imply the ARC was broken on Solaris, since it is not *supposed* to be
a pure LRU by design. Again, there may very well be a FreeBSD specific
issue here that is altering the behavior, and maybe the extremity of
it that I am reporting is not supposed to be happening, but I believe
the issue is more involved than what you're implying in your response.


There have been a lot of eyeballs looking at how zfs does its caching, 
and a ton of benchmarks (mostly focusing on server thoughput) to 
verify the design.  While there can certainly be zfs shortcomings (I 
have found several) these are few and far between.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?

2010-04-05 Thread Roy Sigurd Karlsbakk
- Kyle McDonald kmcdon...@egenera.com  skrev:

 I've seen the Nexenta and EON webpages, but I'm not looking to build
 my own.

 Is there anything out there I can just buy?

I've setup a few systems with supermicro hardware - works well and doesn't cost 
a whole lot

roy

--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Peter Schuller
 The ARC is designed to use as much memory as is available up to a limit.  If
 the kernel allocator needs memory and there is none available, then the
 allocator requests memory back from the zfs ARC. Note that some systems have
 multiple memory allocators.  For example, there may be a memory allocator
 for the network stack, and/or for a filesystem.

Yes, but again I am concerned with what the ARC chooses to cache and
for how long, not how the ARC balances memory with other parts of the
kernel. At least, none of my observations lead me to believe the
latter is the problem here.

 might be pre-allocated.  I assume that you have already read the FreeBSD ZFS
 tuning guide (http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem
 section in the handbook
 (http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made sure
 that your system is tuned appropriately.

Yes, I have been tweaking and fiddling and reading off and on since
ZFS was originally added to CURRENT.

This is not about tuning in that sense. The fact that the little data
necessary to start an 'urxvt' instance does not get cached for at
least 1-2 seconds on an otherwise mostly idle system is either the
result of cache policy, an implementation bug (freebsd or otherwise),
or a matter of an *extremely* small cache size. I have observed this
behavior for a very long time across versions of both ZFS and FreeBSD,
and with different forms of arc sizing tweaks.

It's entirely possibly there are FreeBSD issues preventing the ARC to
size itself appropriately. What I am saying though is that all
indications are that data is not being selected for caching at all, or
else is evicted extremely quickly, unless sufficient frequency has
been accumulated to, presumably, make the ARC decide to cache the
data.

This is entirely what I would expect from a caching policy that tries
to adapt to long-term access patterns and avoid pre-mature cache
eviction by looking at frequency of access. I don't see what it is
that is so outlandish about my query. These are fundamental ways in
which caches of different types behave, and there is a legitimate
reason to not use the same cache eviction policy under all possible
workloads. The behavior I am seeing is consistent with a caching
policy that tries too hard (for my particular use case) to avoid
eviction in the face of short-term changes in access pattern.

 There have been a lot of eyeballs looking at how zfs does its caching, and a
 ton of benchmarks (mostly focusing on server thoughput) to verify the
 design.  While there can certainly be zfs shortcomings (I have found
 several) these are few and far between.

That's a very general statement. I am talking about specifics here.
For example, you can have mountains of evidence that shows that a
plain LRU is optimal (under some conditions). That doesn't change
the fact that if I want to avoid a sequential scan of a huge data set
to completely evict everything in the cache, I cannot use a plain LRU.

In this case I'm looking for the reverse; i.e., increasing the
importance of 'recenticity' because my workload is such that it would
be more optimal than the behavior I am observing. Benchmarks are
irrelevant except insofar as they show that my problem is not with the
caching policy, since I am trying to address an empirically observed
behavior.

I *will* try to look at how the ARC sizes itself, as I'm unclear on
several things in the way memory is being reported by FreeBSD, but as
far as I can tell these are different issues. Sure, a bigger ARC might
hide the behavior I happen to see; but I want the cache to behave in a
way where I do not need gigabytes of extra ARC size to lure it into
caching the data necessary for 'urxvt' without having to start it 50
times in a row to accumulate statistics.

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Richard Elling
On Apr 5, 2010, at 2:23 PM, Peter Schuller wrote:
 That's a very general statement. I am talking about specifics here.
 For example, you can have mountains of evidence that shows that a
 plain LRU is optimal (under some conditions). That doesn't change
 the fact that if I want to avoid a sequential scan of a huge data set
 to completely evict everything in the cache, I cannot use a plain LRU.

In simple terms, the ARC is divided into a MRU and MFU side.
target size (c) = target MRU size (p) + target MFU size (c-p)

On Solaris, to get from the MRU to the MFU side, the block must be 
read at least once in 62.5 milliseconds.  For pure read-once workloads, 
the data won't to the MFU side and the ARC will behave exactly like an 
(adaptable) MRU cache.

 In this case I'm looking for the reverse; i.e., increasing the
 importance of 'recenticity' because my workload is such that it would
 be more optimal than the behavior I am observing. Benchmarks are
 irrelevant except insofar as they show that my problem is not with the
 caching policy, since I am trying to address an empirically observed
 behavior.
 
 I *will* try to look at how the ARC sizes itself, as I'm unclear on
 several things in the way memory is being reported by FreeBSD, but as
 far as I can tell these are different issues. Sure, a bigger ARC might
 hide the behavior I happen to see; but I want the cache to behave in a
 way where I do not need gigabytes of extra ARC size to lure it into
 caching the data necessary for 'urxvt' without having to start it 50
 times in a row to accumulate statistics.

I'm not convinced you have attributed the observation to the ARC
behaviour.  Do you have dtrace (or other) data to explain what process
is causing the physical I/Os?
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Peter Schuller
 In simple terms, the ARC is divided into a MRU and MFU side.
        target size (c) = target MRU size (p) + target MFU size (c-p)

 On Solaris, to get from the MRU to the MFU side, the block must be
 read at least once in 62.5 milliseconds.  For pure read-once workloads,
 the data won't to the MFU side and the ARC will behave exactly like an
 (adaptable) MRU cache.

Ok. That differs significantly from my understanding, though in
retrospect I should have realized it given that arc stats contain only
references to mru and mfu... I previously was under the impression
that the ZFS ARC had an LRU:Ish side to complement the MFU side.
MRU+MFU changes things.

I will have to look into it in better detail to understand the
consequences. Is there a paper that describes the ARC as it is
implemented in ZFS (since it clearly diverges from the IBM ARC)?

 I *will* try to look at how the ARC sizes itself, as I'm unclear on
 several things in the way memory is being reported by FreeBSD, but as

For what it's worth I confirmed that the ARC was too small and that
there are clearly remaining issues with the interaction between the
ARC the rest of the FreeBSD kernel.  (I wasn't sure before but I
confirmed I Was looking at the right number.) I'll try to monitor more
carefully and see if I can figure out when the ARC shrinks and why it
doesn't grow back. Informally my observations have always been that
things behave great for a while after boot, but degenerate over time.

In this case it was sitting at it's minium size, which was 214M. I
realize this is far below what is recommended or even designed for,
but is it clearly caching *something* and I clearly *could* make it
cache urxvt+deps by re-running it several tens of times in rapid
succession.

 I'm not convinced you have attributed the observation to the ARC
 behaviour.  Do you have dtrace (or other) data to explain what process
 is causing the physical I/Os?

In the urxvt case, I am basing my claim on informal observations.
I.e., hit terminal launch key, wait for disks to rattle, get my
terminal. Repeat. Only by repeating it very many times in very rapid
succession am I able to coerce it to be cached such that I can
immediately get my terminal. And what I mean by that is that it keeps
necessitating disk I/O for a long time, even on rapid successive
invocations. But once I have repeated it enough times it seems to
finally enter the cache.

(No dtrace unfortunately. I confess to not having learned dtrace yet,
in spite of thinking it's massively cool.)

However, I will of course accept that given the minimal ARC size at
the time I am moving completely away from the designed-for use-case.
And if that is responsible, it is of course my own fault. Given
MRU+MFU I'll have to back off with my claims. Under the (incorrect)
assumption of LRU+MFU I felt the behavior was unexpected, even with a
small cache size. Given MRU+MFU and without knowing further details
right now, I accept that the ARC may fundamentally need a bigger cache
size in relation to the working set in order to be effective in the
way I am using it here. I was basing my expectations on LRU-style
behavior.

Thanks!

-- 
/ Peter Schuller
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Bill Sommerfeld

On 04/05/10 15:24, Peter Schuller wrote:

In the urxvt case, I am basing my claim on informal observations.
I.e., hit terminal launch key, wait for disks to rattle, get my
terminal. Repeat. Only by repeating it very many times in very rapid
succession am I able to coerce it to be cached such that I can
immediately get my terminal. And what I mean by that is that it keeps
necessitating disk I/O for a long time, even on rapid successive
invocations. But once I have repeated it enough times it seems to
finally enter the cache.


Are you sure you're not seeing unrelated disk update activity like atime 
updates, mtime updates on pseudo-terminals, etc., ?


I'd want to start looking more closely at I/O traces (dtrace can be very 
helpful here) before blaming any specific system component for the 
unexpected I/O.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Richard Elling
On Apr 5, 2010, at 3:24 PM, Peter Schuller wrote:
 I will have to look into it in better detail to understand the
 consequences. Is there a paper that describes the ARC as it is
 implemented in ZFS (since it clearly diverges from the IBM ARC)?

There are various blogs, but perhaps the best documentation is in 
the comments starting at
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#25
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Andreas Höschler

 Thanks for the clarification! This is very annoying. My intend was to
 create a log mirror. I used
 
   zpool add tank log c1t6d0 c1t7d0
 
 and this was obviously false. Would
 
   zpool add tank mirror log c1t6d0 c1t7d0
 
 have done what I intended to do? If so it seems I have to tear down the
 tank pool and recreate it from scratc!?. Can I simply use
 
   zpool destroy -f tank
 
 to do so?

Yes.  You're unfortunately in a bad place right now, due to a simple command
line error.  If you have the ability to destroy and recreate the pool,
that's what you should do.  If you can't afford the downtime, then you
better buy a couple more SSD's, and attach them to the first ones, to mirror
them.

On a related note, this creates additional work, but I think I'm going to
start recommending this now ... If you create a slice on each drive which is
slightly smaller than the drive, and use those slices instead of the full
device, you might be happy you did some day in the future.  I had a mirrored
SSD, and one drive failed.  The replacement disk is precisely exactly the
same, yet for no apparent reason, it appears 0.001Gb smaller than the
original, and hence I cannot un-degrade the mirror.  Since you can't zpool
remove a log device in solaris 10, that's a fatal problem.  The only
solution is to destroy and recreate the pool.  Or buy a new SSD which is
definitely larger...  Say, 64G instead of the 32G I already have.  But
that's another thousand bucks, or more.

Please see the thread
http://opensolaris.org/jive/thread.jspa?threadID=127162tstart=0

One more note.  I heard, but I don't remember where, that the OS will refuse
to use more than half of the system RAM on ZIL log device anyway.  So if
you've got 32G of ram, the maximum useful ZIL log device would be 16G.  Not
sure if that makes any difference ... For performance reasons, suppose
you've got 32G ram, and 32G SSD, and you create a 16G or 17G slice to use
for log device ... For performance reasons, you should leave the rest of
that SSD unused anyway.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-05 Thread Edward Ned Harvey
 From: Kyle McDonald [mailto:kmcdon...@egenera.com]

 So does your HBA have newer firmware now than it did when the first
 disk
 was connected?
 Maybe it's the HBA that is handling the new disks differently now, than
 it did when the first one was plugged in?
 
 Can you down rev the HBA FW? Do you have another HBa that might still
 have the older Rev you coudltest it on?

I'm planning to get the support guys more involved tomorrow, so ... things
have been pretty stagnant for several days now, I think it's time to start
putting more effort into this.

Long story short, I don't know yet.  But there is one glaring clue:  Prior
to OS installation, I don't know how to configure the HBA.  This means the
HBA must have been preconfigured with the factory installed disks, and I
followed a different process with my new disks, because I was using the GUI
within the OS.  My best hope right now is to find some other way to
configure the HBA, possibly through the ILOM, but I already searched there
and looked at everything.  Maybe I have to shutdown (power cycle) the system
and attach keyboard  monitor.  I don't know yet...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] no hot spare activation?

2010-04-05 Thread Eric Schrock

On Apr 5, 2010, at 3:38 AM, Garrett D'Amore wrote:
 
 Am I missing something here?  Under what conditions can I expect hot spares 
 to be recruited?

Hot spares are activated by the zfs-retire agent in response to a list.suspect 
event containing one of the following faults:

fault.fs.zfs.vdev.io
fault.fs.zfs.vdev.checksum
fault.fs.zfs.device

The last of these (fault.fs.zfs.device) is what is diagnosed when a label is 
corrupted.  What software are you runnig?  Have you confirmed that you are 
getting one of these faults?  What does 'fmdump -V' show?  Does doing a 'zpool 
replace c2t3d1 c2t3d2' by hand succeed?

- Eric 

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] no hot spare activation?

2010-04-05 Thread Garrett D'Amore

On 04/ 5/10 05:28 AM, Eric Schrock wrote:

On Apr 5, 2010, at 3:38 AM, Garrett D'Amore wrote:
   

Am I missing something here?  Under what conditions can I expect hot spares to 
be recruited?
 

Hot spares are activated by the zfs-retire agent in response to a list.suspect 
event containing one of the following faults:

fault.fs.zfs.vdev.io
fault.fs.zfs.vdev.checksum
fault.fs.zfs.device

The last of these (fault.fs.zfs.device) is what is diagnosed when a label is 
corrupted.  What software are you runnig?  Have you confirmed that you are 
getting one of these faults?  What does 'fmdump -V' show?  Does doing a 'zpool 
replace c2t3d1 c2t3d2' by hand succeed?
   


I see ereport.fs.zfs.io_failure, and ereport.fs.zfs.probe_failure.  
Also, ereport.io.service.lost and ereport.io.device.inval_state.  There 
is indeed a fault.fs.zfs.device in the list as well.


Clearly ZFS thinks the device is unavailable (which is accurate).

And pfexec zpool replace testpool c2t3d1 c2t3d2 works fine, as shown here:

gdam...@tabasco{33} pfexec zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c1t0d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: testpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas 
exist for

the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 0h0m with 0 errors on Mon Apr  5 
08:39:57 2010

config:

NAME  STATE READ WRITE CKSUM
testpool  DEGRADED 0 0 0
  raidz1-0DEGRADED 0 0 0
c2t3d0ONLINE   0 0 0
spare-1   DEGRADED 0 0 0
  c2t3d1  UNAVAIL  9   132 0  cannot open
  c2t3d2  ONLINE   0 0 0  20.8M resilvered
spares
  c2t3d2  INUSE currently in use

errors: No known data errors
gdam...@tabasco{34}


Everything seems to be correct *except* that ZFS isn't automatically 
doing the replace operation with the hot spare.


It feels to me like this is possibly a ZFS bug --- perhaps ZFS is 
expecting a specific set of FMA faults that only sd delivers?  (Recall 
this is with a different target device.)


- Garrett


- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Neil Perrin

On 04/05/10 11:43, Andreas Höschler wrote:

Hi Khyron,

No, he did *not* say that a mirrored SLOG has no benefit, 
redundancy-wise.

He said that YOU do *not* have a mirrored SLOG.  You have 2 SLOG devices
which are striped.  And if this machine is running Solaris 10, then 
you cannot

remove a log device because those updates have not made their way into
Solaris 10 yet.  You need pool version = 19 to remove log devices, 
and S10

does not currently have patches to ZFS to get to a pool version = 19.

If your SLOG above were mirrored, you'd have mirror under logs.  
And you
probably would have log not logs - notice the s at the end 
meaning plural,
meaning multiple independent log devices, not a mirrored pair of logs 
which

would effectively look like 1 device.


Thanks for the clarification! This is very annoying. My intend was to 
create a log mirror. I used


zpool add tank log c1t6d0 c1t7d0

and this was obviously false. Would

zpool add tank mirror log c1t6d0 c1t7d0


zpool add tank log mirror c1t6d0 c1t7d0

You can also do it on the create:

zpool create tank pool devs log mirror c1t6d0 c1t7d0



have done what I intended to do? If so it seems I have to tear down 
the tank pool and recreate it from scratc!?. Can I simply use


zpool destroy -f tank

to do so?


Shouldn't need the -f




Thanks,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Daniel Carosone
On Sun, Apr 04, 2010 at 11:46:16PM -0700, Willard Korfhage wrote:
 Looks like it was RAM. I ran memtest+ 4.00, and it found no problems.

Then why do you suspect the ram?

Especially with 12 disks, another likely candidate could be an
overloaded power supply.  While there may be problems showing up in
RAM, it may only be happening under the combined load of disks, cpu
and memory activity that brings the system into marginal power
conditions.  Sometimes it may be just one rail that is out of bounds,
and other devices are unaffected.

If memtest didn't find any problems without the disk and cpu load,
that tends to support this hypothesis.

So, the memory may not be bad per se, though it's still not ECC and
therefore not good either :-)   Perhaps you can still find a good
use for it elsewhere.

 I removed 2 of the 3 sticks of RAM, ran a backup, and had no
 errors. I'm running more extensive tests, but it looks like that was
 it. A new motherboard, CPU and ECC RAM are on the way to me now. 

Switching to ECC is a good thing.. but be prepared for possible
continued issues (with different detection thaks to ecc) if the root
cause is the psu.  In fact, ECC memory may draw marginally more power
and maybe make the problem worse (the new cpu and motherboard could go
either way, depending on your choices). 

--
Dan.

pgpCHuyPvOur2.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mpxio load-balancing...it doesn't work??

2010-04-05 Thread Brad
I'm wondering if the author is talking about cache mirroring where the cache 
is mirrored between both controllers.  If that is the case, is he saying that 
for every write to the active controlle,r a second write issued on the passive 
controller to keep the cache mirrored?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: Raid and dedup

2010-04-05 Thread Learner Study
Hi Folks:

I'm wondering what is the correct flow when both raid5 and de-dup are
enabled on a storage volume

I think we should do de-dup first and then raid5 ... is that
understanding correct?

Thanks!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Daniel Carosone
On Mon, Apr 05, 2010 at 07:43:26AM -0400, Edward Ned Harvey wrote:
 Is the database running locally on the machine?  Or at the other end of
 something like nfs?  You should have better performance using your present
 config than just about any other config ... By enabling the log devices,
 such as you've done, you're dedicating the SSD's for sync writes.  And
 that's what the database is probably doing.  This config should be *better*
 than dedicating the SSD's as their own pool.  Because with the dedicated log
 device on a stripe of mirrors, you're allowing the spindle disks to do what
 they're good at (sequential blocks) and allowing the SSD's to do what
 they're good at (low latency IOPS).

Others have addressed the rest of the issues well enough, but I
thought I'd respond on this point. 

What you say is fair, if the db is bound by sync write latency.  If it
is bound by read latency, you will still suffer.  You could add more
ssd's as l2arc (and incur more memory overhead), or you could put the
whole pool on ssd (and lose its benefit for other pool uses). There
are many factors here that will determine the best config, but the
current one may well not be the optimal. 

--
Dan.


pgprztnaqaKjB.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: Raid and dedup

2010-04-05 Thread Daniel Carosone
On Mon, Apr 05, 2010 at 06:32:13PM -0700, Learner Study wrote:
 I'm wondering what is the correct flow when both raid5 and de-dup are
 enabled on a storage volume
 
 I think we should do de-dup first and then raid5 ... is that
 understanding correct?

Not really.  Strictly speaking, ZFS doesn't do raid5 - assuming you
mean one of the raidz levels, pools can be created by assembling disks
into one or more raidz groups.  Dedup is then performed within the
pool, and enabled at a dataset (filesystem) granularity. 

If you have raid5 in a san or hw controller, you might build a pool on
top of the LUNs it presents, and again apply dedup within that pool.

--
Dan.


pgpCtUzx4It6o.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: Raid and dedup

2010-04-05 Thread Learner Study
Hi Jeff:

I'm a bit confused...did you say Correct to my orig email or the
reply from Daniel...Is there a doc that may explain it better?

Thanks!


On Mon, Apr 5, 2010 at 6:54 PM, jeff.bonw...@oracle.com
jeff.bonw...@oracle.com wrote:
 Correct.

 Jeff

 Sent from my iPhone

 On Apr 5, 2010, at 6:32 PM, Learner Study learner.st...@gmail.com wrote:

 Hi Folks:

 I'm wondering what is the correct flow when both raid5 and de-dup are
 enabled on a storage volume

 I think we should do de-dup first and then raid5 ... is that
 understanding correct?

 Thanks!
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mpxio load-balancing...it doesn't work??

2010-04-05 Thread Torrey McMahon
 The author mentions multipathing software in the blog entry. Kind of 
hard to mix that up with cache mirroring if you ask me.


On 4/5/2010 9:16 PM, Brad wrote:

I'm wondering if the author is talking about cache mirroring where the cache 
is mirrored between both controllers.  If that is the case, is he saying that for every 
write to the active controlle,r a second write issued on the passive controller to keep 
the cache mirrored?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mpxio load-balancing...it doesn't work??

2010-04-05 Thread Tim Cook
On Mon, Apr 5, 2010 at 8:16 PM, Brad bene...@yahoo.com wrote:

 I'm wondering if the author is talking about cache mirroring where the
 cache is mirrored between both controllers.  If that is the case, is he
 saying that for every write to the active controlle,r a second write issued
 on the passive controller to keep the cache mirrored?


He's talking about multipathing, he just has no clue what
he's talking about.  He specifically calls out applications that are
specifically used for multipathing.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: Raid and dedup

2010-04-05 Thread Daniel Carosone
On Mon, Apr 05, 2010 at 06:58:57PM -0700, Learner Study wrote:
 Hi Jeff:
 
 I'm a bit confused...did you say Correct to my orig email or the
 reply from Daniel...

Jeff is replying to your mail, not mine.

It looks like he's read your question a little differently.  By that
reading, you are correct, because for a given write to a pool, the
data will first be checked for dedup, and then the writes will be sent
to the pool devices where raidz (or hw raid5) will be applied. 

I read the question as more about how to initally set up your pool
(ie, as about a sequence of admin commands).

Now you have answers for both, whichever you originally intended to
ask. :)

 Is there a doc that may explain it better?

Several..  start with the ZFS FAQ and Best Practices Guide.

--
Dan.



pgpBpJITIAEVU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Willard Korfhage
It certainly has symptoms that match a marginal power supply, but I measured 
the power consumption some time ago and found it comfortably within the power 
supply's capacity. I've also wondered if the RAM is fine, but there is just 
some kind of flaky interaction of the ram configuration I had with the 
motherboard.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: Raid and dedup

2010-04-05 Thread Richard Elling
On Apr 5, 2010, at 6:32 PM, Learner Study wrote:
 Hi Folks:
 
 I'm wondering what is the correct flow when both raid5 and de-dup are
 enabled on a storage volume
 
 I think we should do de-dup first and then raid5 ... is that
 understanding correct?

Yes.  If you look at the (somewhat outdated) ZFS Source Tour, you will see
that the ZIO layer feeds I/Os to the VDEV layer which is where raidz is 
implemented.  In ZIO, deduplication occurs after compression and checksumming
but before space allocation. The checksum and physical size are used for the 
deduplication table key.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Tim Cook
On Mon, Apr 5, 2010 at 9:39 PM, Willard Korfhage opensola...@familyk.orgwrote:

 It certainly has symptoms that match a marginal power supply, but I
 measured the power consumption some time ago and found it comfortably within
 the power supply's capacity. I've also wondered if the RAM is fine, but
 there is just some kind of flaky interaction of the ram configuration I had
 with the motherboard.
 --
 This message posted from opensolaris.org


I think the confusion is that you said you ran memtest86+ and the memory
tested just fine.  Did you remove some memory before running memtest86+ and
narrow it down to a certain stick being bad or something?  Your post makes
it sound as though you found that all of the ram is working perfectly fine.
 IE: It's not the problem.

Also, a low power draw doesn't mean much of anything.  The power supply
could just be dying.  Load wouldn't really matter in that scenario (although
a high load will generally help it out the door a bit quicker due to higher
heat/etc.).

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Daniel Carosone
On Mon, Apr 05, 2010 at 09:46:58PM -0500, Tim Cook wrote:
 On Mon, Apr 5, 2010 at 9:39 PM, Willard Korfhage 
 opensola...@familyk.orgwrote:
 
  It certainly has symptoms that match a marginal power supply, but I
  measured the power consumption some time ago and found it comfortably within
  the power supply's capacity. I've also wondered if the RAM is fine, but
  there is just some kind of flaky interaction of the ram configuration I had
  with the motherboard.
 
 I think the confusion is that you said you ran memtest86+ and the memory
 tested just fine.  Did you remove some memory before running memtest86+ and
 narrow it down to a certain stick being bad or something?  Your post makes
 it sound as though you found that all of the ram is working perfectly fine.

Exactly.

 Also, a low power draw doesn't mean much of anything.  The power supply
 could just be dying.

Or just one part of it could be overloaded (like a particular 5v or
12v rail that happens to be shared between too many drives and the
m/b), even if the overall draw at the wall is less than the total
rating. Sometimes, just moving plugs around can help - or at least
show that a better psu is warranted.

--
Dan.


pgpJLlxR1urcu.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Willard Korfhage
Memtest didn't show any errors, but between Frank, early in the thread, saying 
that he had found memory errors that memtest didn't catch, and remove of DIMMs 
apparently fixing the problem, I too soon jumped to the conclusion it was the 
memory. Certainly there are other explanations. 

I see that I have a spare Corsair 620W power supply that I could try. It is a 
Corsair supply of some wattage in there now. If I recall properly, the steady 
state power draw is between 150 and 200 watts.

By the way, I see that now one of the disks is listed as degraded - too many 
errors. Is there a good way to identify exactly which of the disks it is?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Daniel Carosone
On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote:
 By the way, I see that now one of the disks is listed as degraded - too many 
 errors. Is there a good way to identify exactly which of the disks it is?

It's hidden in iostat -E, of all places.

--
Dan.

pgpB1dUBrSfPC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Tim Cook
On Tue, Apr 6, 2010 at 12:24 AM, Daniel Carosone d...@geek.com.au wrote:

 On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote:
  By the way, I see that now one of the disks is listed as degraded - too
 many errors. Is there a good way to identify exactly which of the disks it
 is?

 It's hidden in iostat -E, of all places.

 --
 Dan.


I think he wants to know how to identify which physical drive maps to the
dev ID in solaris.  The only way I can think of is to run something like DD
against the drive to light up the activity LED.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Daniel Carosone
On Tue, Apr 06, 2010 at 12:29:35AM -0500, Tim Cook wrote:
 On Tue, Apr 6, 2010 at 12:24 AM, Daniel Carosone d...@geek.com.au wrote:
 
  On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote:
   By the way, I see that now one of the disks is listed as degraded - too
  many errors. Is there a good way to identify exactly which of the disks it
  is?
 
  It's hidden in iostat -E, of all places.
 
  --
  Dan.
 
 
 I think he wants to know how to identify which physical drive maps to the
 dev ID in solaris.  The only way I can think of is to run something like DD
 against the drive to light up the activity LED.

or look at the serial numbers printed in iostat -E

--
Dan.


pgpmo7XmmGf1I.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss