Re: [zfs-discuss] Dedup performance hit

2010-06-14 Thread Richard Elling
Erik is right, more below...

On Jun 13, 2010, at 10:17 PM, Erik Trimble wrote:

 Hernan F wrote:
 Hello, I tried enabling dedup on a filesystem, and moved files into it to 
 take advantage of it. I had about 700GB of files and left it for some hours. 
 When I returned, only 70GB were moved.
 
 I checked zpool iostat, and it showed about 8MB/s R/W performance (the old 
 and new zfs filesystems are in the same pool). So I disabled dedup for a few 
 seconds and instantly the performance jumped to 80MB/s
 
 It's Athlon64 x2 machine with 4GB RAM, it's only a fileserver (4x1TB SATA 
 for ZFS). arcstat.pl shows 2G for arcsz, top shows 13% CPU during the 8MB/s 
 transfers. 
 Is this normal behavior? Should I always expect such low performance, or is 
 there anything wrong with my setup? 
 Thanks in advance,
 Hernan
  
 You are severely RAM limited.  In order to do dedup, ZFS has to maintain a 
 catalog of every single block it writes and the checksum for that block. This 
 is called the Dedup Table (DDT for short).  
 So, during the copy, ZFS has to (a) read a block from the old filesystem, (b) 
 check the current DDT to see if that block exists and (c) either write the 
 block to the new filesytem (and add an appropriate DDT entry for it), or 
 write a metadata update with the dedup reference block reference.
 
 Likely, you have two problems:
 
 (1) I suspect your source filesystem has lots of blocks (that is, it's likely 
 made up smaller-sized files).  Lots of blocks means lots of seeking back and 
 forth to read all those blocks.
 
 (2) Lots of blocks also means lots of entries in the DDT.  It's trivial to 
 overwhelm a 4GB system with a large DDT.  If the DDT can't fit in RAM, then 
 it has to get partially refreshed from disk.
 
 Thus, here's what's likely going on:
 
 (1)  ZFS reads a block and it's checksum from the old filesystem
 (2)  it checks the DDT to see if that checksum exists
 (3) finding that the entire DDT isn't resident in RAM, it starts a cycle to 
 read the rest of the (potential) entries from the new filesystems' metadata.  
 That is, it tries to reconstruct the DDT from disk.  Which involves a HUGE 
 amount of random seek reads on the new filesystem.
 
 In essence, since you likely can't fit the DDT in RAM, each block read from 
 the old filesystem forces a flurry of reads from the new filesystem. Which 
 eats up the IOPS that your single pool can provide.  It thrashes the disks.  
 Your solution is to either buy more RAM, or find something you can use as an 
 L2ARC cache device for your pool.  Ideally, it would be an SSD.  However, in 
 this case, a plain hard drive would do OK (NOT one already in a pool).To 
 add such a device, you would do:  'zpool add tank mycachedevice'


A typical next question is how large with the DDT become?
Without measuring, I use 3% as a SWAG.  So if you have 700GB
of space then, for a SWAG, you need about 21GB for the DDT.
Since you also want data in the ARC and the ARC can use up to
7/8th of RAM, then a 32GB machine should work fine.  Or perhaps
invest in a 32+GB SSD for a cache. Note that every entry in the
cache SSD also consumes space in the ARC, so your small machine
will find its limits.

If you want to know more precisely how big the DDT is, use
the zdb -D command to get a summary of how many objects
are in the DDT and their size.  Simple arithmetic solves the
equation.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshots, txgs and performance

2010-06-14 Thread Arne Jansen
Marcelo Leal wrote:
 Hello there,
  I think you should share it with the list, if you can, seems like an 
 interesting work. ZFS has some issues with snapshots and spa_sync performance 
 for snapshots deletion.

I'm a bit reluctant to post it to the list where it can still be found
years from now. Because the module is not compiled directly into ZFS
but is a separate module that makes heavy use of internal structures
of ZFS, it is designed for a specific version of ZFS (Solaris U8). It
might still load without problems for years, but already in the next
Solaris version it might wreak havoc because of a changed kernel structure.
A much better way would be to have a similar operation integrated into
the official source tree. I could try to build a patch if it has a
chance of getting accepted.

Until then, I have no problem with sharing it off-list.

--Arne

  
  Thanks
 
  Leal
 [ http://www.eall.com.br/blog ]

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs/zpool] hang at boot

2010-06-14 Thread schatten
Just FYI.
The error was that I created the ZFS at the wrong pool.

rpool/a/b/c
rpool/new

I mounted new in a directory of rpoo/ c. Seems like this hierarchical 
mounting is not working like I thought. ;)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What happens when unmirrored ZIL log devi ce is removed ungracefully

2010-06-14 Thread R . Eulenberg
Hello
I even have this problem on my system. I lost my backup server crashing the
system-hd and the ZIL-device. After setting up a new system (osol 2009.06 and
updating to the latest osol/dev version with zpool-dedup) I tried to import my
backup pool, but I can't. The system tells me there isn't any zpool tank1 trying
to replace / detache / attach / add any kind of device or answers this:
zpool import -f tank1
cannot import 'tank1': one or more devices is currently unavailable
Destroy and re-create the pool from
a backup source.
While using options -F, -X, -V, -C, -D and any combination of them the same
reaction comes from the system.
There are some solutions for problems in which the old cachefile is available or
the ZIL-device isn't destroyed, but for my case there isn't  anyone. 
I need a way importing the zpool by ignoring the ZIL-device.
I spend a week searching in the net, but didn't found something.
For some help I would be very glad.

regards 
Ronny
P.S. hoping you excuse my lousy English.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] size of slog device

2010-06-14 Thread Arne Jansen
Hi,

I known it's been discussed here more than once, and I read the
Evil tuning guide, but I didn't find a definitive statement:

There is absolutely no sense in having slog devices larger than
then main memory, because it will never be used, right?
ZFS will rather flush the txg to disk than reading back from
zil?
So there is a guideline to have enough slog to hold about 10
seconds of zil, but the absolute maximum value is the size of
main memory. Is this correct?

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Thomas Burgess
On Mon, Jun 14, 2010 at 4:41 AM, Arne Jansen sensi...@gmx.net wrote:

 Hi,

 I known it's been discussed here more than once, and I read the
 Evil tuning guide, but I didn't find a definitive statement:

 There is absolutely no sense in having slog devices larger than
 then main memory, because it will never be used, right?
 ZFS will rather flush the txg to disk than reading back from
 zil?
 So there is a guideline to have enough slog to hold about 10
 seconds of zil, but the absolute maximum value is the size of
 main memory. Is this correct?




I thought it was half the size of memory.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Roy Sigurd Karlsbakk
 There is absolutely no sense in having slog devices larger than
 then main memory, because it will never be used, right?
 ZFS will rather flush the txg to disk than reading back from
 zil? So there is a guideline to have enough slog to hold about 10
 seconds of zil, but the absolute maximum value is the size of
 main memory. Is this correct?

ZFS uses at most RAM/2 for ZIL

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup performance hit

2010-06-14 Thread Dennis Clarke


 You are severely RAM limited.  In order to do dedup, ZFS has to maintain
 a catalog of every single block it writes and the checksum for that
 block. This is called the Dedup Table (DDT for short).

 So, during the copy, ZFS has to (a) read a block from the old
 filesystem, (b) check the current DDT to see if that block exists and
 (c) either write the block to the new filesytem (and add an appropriate
 DDT entry for it), or write a metadata update with the dedup reference
 block reference.

 Likely, you have two problems:

 (1) I suspect your source filesystem has lots of blocks (that is, it's
 likely made up smaller-sized files).  Lots of blocks means lots of
 seeking back and forth to read all those blocks.

 (2) Lots of blocks also means lots of entries in the DDT.  It's trivial
 to overwhelm a 4GB system with a large DDT.  If the DDT can't fit in
 RAM, then it has to get partially refreshed from disk.

 Thus, here's what's likely going on:

 (1)  ZFS reads a block and it's checksum from the old filesystem
 (2)  it checks the DDT to see if that checksum exists
 (3) finding that the entire DDT isn't resident in RAM, it starts a cycle
 to read the rest of the (potential) entries from the new filesystems'
 metadata.  That is, it tries to reconstruct the DDT from disk.  Which
 involves a HUGE amount of random seek reads on the new filesystem.

 In essence, since you likely can't fit the DDT in RAM, each block read
 from the old filesystem forces a flurry of reads from the new
 filesystem. Which eats up the IOPS that your single pool can provide.
 It thrashes the disks.  Your solution is to either buy more RAM, or find
 something you can use as an L2ARC cache device for your pool.  Ideally,
 it would be an SSD.  However, in this case, a plain hard drive would do
 OK (NOT one already in a pool).To add such a device, you would do:
 'zpool add tank mycachedevice'



That was an awesome response!  Thank you for that :-)
I tend to config my servers with 16G of ram minimum these days and now I
know why.


-- 
Dennis Clarke
dcla...@opensolaris.ca  - Email related to the open source Solaris
dcla...@blastwave.org   - Email related to open source for Solaris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup performance hit

2010-06-14 Thread remi.urbillac
 

 To add such a device, you would do:
 'zpool add tank mycachedevice'



Hi

Correct me if I'm wrong, but for me the good command should be : 
'zpool add tank cache mycachedevice'

If you don't use the cache keyword, the device would be added as a classical 
top level vdev.

Remi
*
This message and any attachments (the message) are confidential and intended 
solely for the addressees. 
Any unauthorised use or dissemination is prohibited.
Messages are susceptible to alteration. 
France Telecom Group shall not be liable for the message if altered, changed or 
falsified.
If you are not the intended addressee of this message, please cancel it 
immediately and inform the sender.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Arne Jansen
 
 There is absolutely no sense in having slog devices larger than
 then main memory, because it will never be used, right?

Also:  A TXG is guaranteed to flush within 30 sec.  Let's suppose you have a
super fast device, which is able to log 8Gbit/sec (which is unrealistic).
That's 1Gbyte/sec, unrealistically theoretically possible, at best.  You do
the math.  ;-)

That being said, it's difficult to buy an SSD smaller than 32G.  So what are
you going to do?  Slice it and use the remaining space for cache?  Some
people do.  Some people may even get a performance benefit by doing so.  But
if you do, now you've got a cache and a log both competing for IO on the
same device.  The performance benefit degrades for sure.

My advice is to simply acknowledge wasted space in your log device, forget
about it and move on.  Same thing you did with all the wasted space on your
mirrored OS boot device, which can't (or shouldn't) be used by your data
pool.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Arne Jansen
Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Arne Jansen

 There is absolutely no sense in having slog devices larger than
 then main memory, because it will never be used, right?
 
 Also:  A TXG is guaranteed to flush within 30 sec.  Let's suppose you have a
 super fast device, which is able to log 8Gbit/sec (which is unrealistic).
 That's 1Gbyte/sec, unrealistically theoretically possible, at best.  You do
 the math.  ;-)
 
 That being said, it's difficult to buy an SSD smaller than 32G.  So what are
 you going to do?

I'm still building my rotational write delay eliminating driver and am trying
to figure out how much space I can waste on the underlying device without ever
running into problems. I need half the physical memory, or, under the assumption
that it might be tunable, a maximum of my physical memory. It's good to know
a hard upper limit. The more I can waste, the faster the device will be.

Also, to stay in your line of argumentation, this super-fast slog is most
probably a DRAM-based, battery backed solution. In this case it will make
a difference if you buy 8 or 32GB ;)

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Arne Jansen
Roy Sigurd Karlsbakk wrote:
 There is absolutely no sense in having slog devices larger than
 then main memory, because it will never be used, right?
 ZFS will rather flush the txg to disk than reading back from
 zil? So there is a guideline to have enough slog to hold about 10
 seconds of zil, but the absolute maximum value is the size of
 main memory. Is this correct?
 
 ZFS uses at most RAM/2 for ZIL

Thanks!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moved disks to new controller - cannot import pool even after moving ba

2010-06-14 Thread Ross Walker
On Jun 13, 2010, at 2:14 PM, Jan Hellevik  
opensola...@janhellevik.com wrote:


Well, for me it was a cure. Nothing else I tried got the pool back.  
As far as I can tell, the way to get it back should be to use  
symlinks to the fdisk partitions on my SSD, but that did not work  
for me. Using -V got the pool back. What is wrong with that?


If you have a better suggestion as to how I should have recovered my  
pool I am certainly interested in hearing it.


I would take this time to offline one disk at a time, wipe all it's  
tables/labels and re-attach it as an EFI whole disk to avoid hitting  
this same problem again in the future.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Bob Friesenhahn

On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:


There is absolutely no sense in having slog devices larger than
then main memory, because it will never be used, right?
ZFS will rather flush the txg to disk than reading back from
zil? So there is a guideline to have enough slog to hold about 10
seconds of zil, but the absolute maximum value is the size of
main memory. Is this correct?


ZFS uses at most RAM/2 for ZIL


It is good to keep in mind that only small writes go to the dedicated 
slog.  Large writes to to main store.  A succession of that many small 
writes (to fill RAM/2) is highly unlikely.  Also, that the zil is not 
read back unless the system is improperly shut down.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unable to Install 2009.06 on BigAdmin Approved MOBO - FILE SYSTEM FULL

2010-06-14 Thread Cindy Swearingen

Hi Giovanni,

My Monday morning guess is that the disk/partition/slices are not
optimal for the installation.

Can you provide the partition table on the disk that you are attempting 
to install? Use format--disk--partition--print.


You want to put all the disk space in c*t*d*s0. See this section of the 
ZFS troubleshooting guide for an example of fixing the disk/partition/

slice issues:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

Replacing/Relabeling the Root Pool Disk

Thanks,

Cindy


On 06/13/10 17:42, Giovanni wrote:

Hi Guys

I am having trouble installing Opensolaris 2009.06 into my Biostar Tpower I45 motherboard, approved on BigAdmin HCL here: 


http://www.sun.com/bigadmin/hcl/data/systems/details/26409.html -- why is it 
not working?

My setup:
3x 1TB hard-drives SATA 
1x 500GB hard-drive (I have only left this hdd connected to try to isolate the issue, still happens)

4GB DDR2 PC2-6400 Ram (tested GOOD!)
ATI Radeon 4650 512MB DDR2 PCI-E 16x
Motherboard default settings/CMOS cleared

Here's what happens: Opensolaris boot options come up, I choose the first default 
OPensolaris 2009.06 -- I HAVE ALSO TRIED VESA DRIVES and Command line, all of 
these fail.
-

After Select desktop language, 

configuring devices. 
Mounting cdroms 
Reading ZFS Config: done.


opensolaris console login: (cd rom is still being accessed at this time).. few 
seconds later:

then opensolaris ufs: NOTICE: alloc: /: file system full
opensolaris last message repeated 1 time
opensolaris syslogd: /var/adm/messages: No space left on device
opensolaris in.routed[537]: route 0.0.0.0/8 - 0.0.0.0 nexthop is not directly 
connected

---

I logged in as jack / jack on the console and did a df -h

/devices/ramdisk:a = size 164M 100% used mount /
swap 3.3GB used 860K 1%
/mnt/misc/opt 210MB used 210M 100% /mnt/misc

/usr/lib/libc/libc_hwcap1.so.1 2.3G used 2.3G 100% /lib/libc.so.1

/dev/dsk/c7t0d0s2 677M used 677M 100% /media/OpenSolaris

Thanks for any help!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Roy Sigurd Karlsbakk
- Original Message -
 On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:
 
  There is absolutely no sense in having slog devices larger than
  then main memory, because it will never be used, right?
  ZFS will rather flush the txg to disk than reading back from
  zil? So there is a guideline to have enough slog to hold about 10
  seconds of zil, but the absolute maximum value is the size of
  main memory. Is this correct?
 
  ZFS uses at most RAM/2 for ZIL
 
 It is good to keep in mind that only small writes go to the dedicated
 slog. Large writes to to main store. A succession of that many small
 writes (to fill RAM/2) is highly unlikely. Also, that the zil is not
 read back unless the system is improperly shut down.

I thought all sync writes, meaning everything NFS and iSCSI, went into the slog 
- IIRC the docs says so.
 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] COMSTAR dropouts with dedup enabled

2010-06-14 Thread Matthew Anderson
Hi All,

I currently use b134 and COMSTAR to deploy SRP targets for virtual machine 
storage (VMware ESXi4) and have run into some unusual behaviour when dedup is 
enabled for a particular LUN. The target seems to lock up (ESX reports it as 
unavailable) when writing large amount or overwriting data, reads are 
unaffected. The easiest way for me to replicate the problem was to restore a 
2GB SQL database inside a VM. The dropouts lasted anywhere from 3 seconds to a 
few minutes and when connectivity is restored the other LUN's (without dedup) 
drop out for a few seconds.

The problem didn't seem to occur with only a small amount of data on the LUN 
(50GB) and happened more frequently as the LUN filled up. I've since moved all 
data to non-dedup LUN's and I haven't seen a dropout for over a month.  Does 
anyone know why this is happening? I've also seen the behaviour when exporting 
iSCSI targets with COMSTAR. I haven't had a chance to install the SSD's for 
L2ARC and SLOG yet so I'm unsure if that will help the issue.

System specs are-
Single Xeon 5620
24GB DDR3
24x 1.5TB 7200rpm
LSI RAID card

Thanks
-Matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Bob Friesenhahn

On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:


It is good to keep in mind that only small writes go to the dedicated
slog. Large writes to to main store. A succession of that many small
writes (to fill RAM/2) is highly unlikely. Also, that the zil is not
read back unless the system is improperly shut down.


I thought all sync writes, meaning everything NFS and iSCSI, went 
into the slog - IIRC the docs says so.


Check a month or two back in the archives for a post by Matt Ahrens. 
It seems that larger writes (32k?) are written directly to main 
store.  This is probably a change from the original zfs design.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Permament errors in files 0x0

2010-06-14 Thread Jan Ploski
I've been referred to here from the zfs-fuse newsgroup. I have a 
(non-redundant) pool which is reporting errors that I don't quite understand:

# zpool status -v
  pool: green
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress for 1h12m, 2.96% done, 39h44m to go
config:

NAMESTATE READ WRITE CKSUM
green   ONLINE   0 0 2
  disk/by-id/dm-name-green  ONLINE   0 0 4

errors: Permanent errors have been detected in the following files:

metadata:0x0
green:0x0

I read the explanations at 
http://dlc.sun.com/osol/docs/content/ZFSADMIN/gbbwl.html#gbcuz that the 0x0 is 
output when a file path is not available, but I'm still unsure how to proceed 
(of course, I'd also like to know why these errors occurred in the first place 
- after just a couple of days of using zfs-fuse, but that's another story).

It has been suggested to me to copy out all data from the pool and/or recreate 
it from backup, but do I really have to (hours of recovery), or is there a 
faster way to correct the problem? Apart from these alarming messages, the pool 
seems to be in working order, e.g. all files that I tried could be read. I 
guess I'd just like to know [i]what[/i] the corrupted data is and the 
implications.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-06-14 Thread Roy Sigurd Karlsbakk





On 04/10/10 09:28, Edward Ned Harvey wrote: 

- If synchronous writes are large (32K) and block aligned then the blocks are 
written directly to the pool and a small record 
written to the log. Later when the txg commits then the blocks are just linked 
into the txg. However, this processing is not 
done if there are any slogs because I found it didn't perform as well. Probably 
ought to be re-evaluated. 
Won't this affect NFS/iSCSI performance pretty badly where the ZIL is crucial? 

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
r...@karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Neil Perrin

On 06/14/10 12:29, Bob Friesenhahn wrote:

On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:


It is good to keep in mind that only small writes go to the dedicated
slog. Large writes to to main store. A succession of that many small
writes (to fill RAM/2) is highly unlikely. Also, that the zil is not
read back unless the system is improperly shut down.


I thought all sync writes, meaning everything NFS and iSCSI, went 
into the slog - IIRC the docs says so.


Check a month or two back in the archives for a post by Matt Ahrens. 
It seems that larger writes (32k?) are written directly to main 
store.  This is probably a change from the original zfs design.


Bob


If there's a slog then the data, regardless of size, gets written to the 
slog.


If there's no slog and if the data size is greater than 
zfs_immediate_write_sz/zvol_immediate_write_sz
(both default to 32K) then the data is written as a block into the pool 
and the block pointer

written into the log record. This is the WR_INDIRECT write type.

So Matt and Roy are both correct.

But wait, there's more complexity!:

If logbias=throughput is set we always use WR_INDIRECT.

If we just wrote more than 1MB for a single zil commit and there's more 
than 2MB waiting

then we start using the main pool.

Clear as mud?  This is likely to change again...

Neil.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] COMSTAR dropouts with dedup enabled

2010-06-14 Thread Brandon High
On Sun, Jun 13, 2010 at 6:58 PM, Matthew Anderson
matth...@ihostsolutions.com.au wrote:
 The problem didn’t seem to occur with only a small amount of data on the LUN
 (50GB) and happened more frequently as the LUN filled up. I’ve since moved
 all data to non-dedup LUN’s and I haven’t seen a dropout for over a month.

How much memory do you have, and how big is the DDT? You can get the
DDT size with 'zdb -DD'. The total count is the sum of duplicate and
unique entries. Each entry uses ~ 250 bytes per entry, so the count
divided by 4 is a (very rough) estimate of the memory size of the DDT
in kilobytes.

The most likely case is that you don't have enough memory to hold the
entire dedup table in the ARC. When this happens, the host has to read
entries from the main pool, which is slow.

If you want to continue running with dedup, adding a L2ARC may help
since the DDT can be held in the faster cache. Disabling dedup for the
dataset will give you good write performance too.

Be aware that destroying snapshots from this dataset (or destroying
the dataset itself) is likely to create dropouts as well, since the
DDT needs to be scanned to see if a block can be dereferenced. Again,
adding L2ARC may help.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] COMSTAR dropouts with dedup enabled

2010-06-14 Thread Brandon High
On Mon, Jun 14, 2010 at 1:35 PM, Brandon High bh...@freaks.com wrote:
 How much memory do you have, and how big is the DDT? You can get the
 DDT size with 'zdb -DD'. The total count is the sum of duplicate and
 unique entries. Each entry uses ~ 250 bytes per entry, so the count
 divided by 4 is a (very rough) estimate of the memory size of the DDT
 in kilobytes.

One more thing: The default block size is 8k for zvols, which means
that the DDT will grow much faster than for filesystem datasets.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Scrub issues

2010-06-14 Thread Roy Sigurd Karlsbakk
Hi all

It seems zfs scrub is taking a big bit out of I/O when running. During a scrub, 
sync I/O, such as NFS and iSCSI is mostly useless. Attaching an SLOG and some 
L2ARC helps this, but still, the problem remains in that the scrub is given 
full priority.

Is this problem known to the developers? Will it be addressed?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub issues

2010-06-14 Thread Robert Milkowski

On 14/06/2010 22:12, Roy Sigurd Karlsbakk wrote:

Hi all

It seems zfs scrub is taking a big bit out of I/O when running. During a scrub, 
sync I/O, such as NFS and iSCSI is mostly useless. Attaching an SLOG and some 
L2ARC helps this, but still, the problem remains in that the scrub is given 
full priority.

Is this problem known to the developers? Will it be addressed?

   


http://sparcv9.blogspot.com/2010/06/slower-zfs-scrubsresilver-on-way.html
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub issues

2010-06-14 Thread Richard Elling
On Jun 14, 2010, at 2:12 PM, Roy Sigurd Karlsbakk wrote:
 Hi all
 
 It seems zfs scrub is taking a big bit out of I/O when running. During a 
 scrub, sync I/O, such as NFS and iSCSI is mostly useless. Attaching an SLOG 
 and some L2ARC helps this, but still, the problem remains in that the scrub 
 is given full priority.

Scrub always runs at the lowest priority. However, priority scheduling only
works before the I/Os enter the disk queue. If you are running Solaris 10 or
older releases with HDD JBODs, then the default zfs_vdev_max_pending 
is 35. This means that your slow disk will have 35 I/Os queued to it before
priority scheduling makes any difference.  Since it is a slow disk, that could
mean 250 to 1500 ms before the high priority I/O reaches the disk.

 Is this problem known to the developers? Will it be addressed?

In later OpenSolaris releases, the zfs_vdev_max_pending defaults to 10
which helps.  You can tune it lower as described in the Evil Tuning Guide.

Also, as Robert pointed out, CR 6494473 offers a more resource management
friendly way to limit scrub traffic (b143).  Everyone can buy George a beer for
implementing this change :-)

Of course, this could mean that on a busy system a scrub that formerly took
a week might now take a month.  And the fix does not directly address the 
tuning of the queue depth issue with HDDs.  TANSTAAFL.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Native ZFS for Linux

2010-06-14 Thread Peter Jeremy
On 2010-Jun-11 17:41:38 +0800, Joerg Schilling 
joerg.schill...@fokus.fraunhofer.de wrote:
PP.S.: Did you know that FreeBSD _includes_ the GPLd Reiserfs in the FreeBSD 
kernel since a while and that nobody did complain about this, see e.g.:

http://svn.freebsd.org/base/stable/8/sys/gnu/fs/reiserfs/

That is completely irrelevant and somewhat misleading.  FreeBSD has
never prohibited non-BSD-licensed code in their kernel or userland
however it has always been optional and, AFAIR, the GENERIC kernel has
always defaulted to only contain BSD code.  Non-BSD code (whether GPL
or CDDL) is carefully segregated (note the 'gnu' in the above URI).

-- 
Peter Jeremy


pgpvmgKqx7nJf.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Erik Trimble

On 6/14/2010 12:10 PM, Neil Perrin wrote:

On 06/14/10 12:29, Bob Friesenhahn wrote:

On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:


It is good to keep in mind that only small writes go to the dedicated
slog. Large writes to to main store. A succession of that many small
writes (to fill RAM/2) is highly unlikely. Also, that the zil is not
read back unless the system is improperly shut down.


I thought all sync writes, meaning everything NFS and iSCSI, went 
into the slog - IIRC the docs says so.


Check a month or two back in the archives for a post by Matt Ahrens. 
It seems that larger writes (32k?) are written directly to main 
store.  This is probably a change from the original zfs design.


Bob


If there's a slog then the data, regardless of size, gets written to 
the slog.


If there's no slog and if the data size is greater than 
zfs_immediate_write_sz/zvol_immediate_write_sz
(both default to 32K) then the data is written as a block into the 
pool and the block pointer

written into the log record. This is the WR_INDIRECT write type.

So Matt and Roy are both correct.

But wait, there's more complexity!:

If logbias=throughput is set we always use WR_INDIRECT.

If we just wrote more than 1MB for a single zil commit and there's 
more than 2MB waiting

then we start using the main pool.

Clear as mud?  This is likely to change again...

Neil.



How do I monitor the amount of live (i.e. non-committed) data in the 
slog?  I'd like to spend some time with my setup, seeing exactly how 
much I tend to use.


I'd suspect that very few use cases call for more than a couple (2-4) GB 
of slog...


I'm trying to get hard numbers as I'm working on building a 
DRAM/battery/flash slog device in one of my friend's electronics 
prototyping shops.  It would be really nice if I could solve 99% of the 
need with 1 or 2 2GB SODIMMs and the chips from a cheap 4GB USB thumb 
drive...


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub issues

2010-06-14 Thread George Wilson

Richard Elling wrote:

On Jun 14, 2010, at 2:12 PM, Roy Sigurd Karlsbakk wrote:

Hi all

It seems zfs scrub is taking a big bit out of I/O when running. During a scrub, 
sync I/O, such as NFS and iSCSI is mostly useless. Attaching an SLOG and some 
L2ARC helps this, but still, the problem remains in that the scrub is given 
full priority.


Scrub always runs at the lowest priority. However, priority scheduling only
works before the I/Os enter the disk queue. If you are running Solaris 10 or
older releases with HDD JBODs, then the default zfs_vdev_max_pending 
is 35. This means that your slow disk will have 35 I/Os queued to it before

priority scheduling makes any difference.  Since it is a slow disk, that could
mean 250 to 1500 ms before the high priority I/O reaches the disk.


Is this problem known to the developers? Will it be addressed?


In later OpenSolaris releases, the zfs_vdev_max_pending defaults to 10
which helps.  You can tune it lower as described in the Evil Tuning Guide.

Also, as Robert pointed out, CR 6494473 offers a more resource management
friendly way to limit scrub traffic (b143).  Everyone can buy George a beer for
implementing this change :-)



I'll glad accept any beer donations and others on the ZFS team are happy 
to help consume it. :-)


I look forward to hearing people's experience with the new changes.

- George


Of course, this could mean that on a busy system a scrub that formerly took
a week might now take a month.  And the fix does not directly address the 
tuning of the queue depth issue with HDDs.  TANSTAAFL.

 -- richard




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Richard Elling
On Jun 14, 2010, at 6:35 PM, Erik Trimble wrote:
 On 6/14/2010 12:10 PM, Neil Perrin wrote:
 On 06/14/10 12:29, Bob Friesenhahn wrote:
 On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:
 
 It is good to keep in mind that only small writes go to the dedicated
 slog. Large writes to to main store. A succession of that many small
 writes (to fill RAM/2) is highly unlikely. Also, that the zil is not
 read back unless the system is improperly shut down.
 
 I thought all sync writes, meaning everything NFS and iSCSI, went into the 
 slog - IIRC the docs says so.
 
 Check a month or two back in the archives for a post by Matt Ahrens. It 
 seems that larger writes (32k?) are written directly to main store.  This 
 is probably a change from the original zfs design.
 
 Bob
 
 If there's a slog then the data, regardless of size, gets written to the 
 slog.
 
 If there's no slog and if the data size is greater than 
 zfs_immediate_write_sz/zvol_immediate_write_sz
 (both default to 32K) then the data is written as a block into the pool and 
 the block pointer
 written into the log record. This is the WR_INDIRECT write type.
 
 So Matt and Roy are both correct.
 
 But wait, there's more complexity!:
 
 If logbias=throughput is set we always use WR_INDIRECT.
 
 If we just wrote more than 1MB for a single zil commit and there's more than 
 2MB waiting
 then we start using the main pool.
 
 Clear as mud?  This is likely to change again...
 
 Neil.
 
 
 How do I monitor the amount of live (i.e. non-committed) data in the slog?  
 I'd like to spend some time with my setup, seeing exactly how much I tend to 
 use.

zilstat
http://www.richardelling.com/Home/scripts-and-programs-1/zilstat

 I'd suspect that very few use cases call for more than a couple (2-4) GB of 
 slog...

I'd suspect few real cases need more than 1GB.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Neil Perrin

On 06/14/10 19:35, Erik Trimble wrote:

On 6/14/2010 12:10 PM, Neil Perrin wrote:

On 06/14/10 12:29, Bob Friesenhahn wrote:

On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:


It is good to keep in mind that only small writes go to the dedicated
slog. Large writes to to main store. A succession of that many small
writes (to fill RAM/2) is highly unlikely. Also, that the zil is not
read back unless the system is improperly shut down.


I thought all sync writes, meaning everything NFS and iSCSI, went 
into the slog - IIRC the docs says so.


Check a month or two back in the archives for a post by Matt Ahrens. 
It seems that larger writes (32k?) are written directly to main 
store.  This is probably a change from the original zfs design.


Bob


If there's a slog then the data, regardless of size, gets written to 
the slog.


If there's no slog and if the data size is greater than 
zfs_immediate_write_sz/zvol_immediate_write_sz
(both default to 32K) then the data is written as a block into the 
pool and the block pointer

written into the log record. This is the WR_INDIRECT write type.

So Matt and Roy are both correct.

But wait, there's more complexity!:

If logbias=throughput is set we always use WR_INDIRECT.

If we just wrote more than 1MB for a single zil commit and there's 
more than 2MB waiting

then we start using the main pool.

Clear as mud?  This is likely to change again...

Neil.



How do I monitor the amount of live (i.e. non-committed) data in the 
slog?  I'd like to spend some time with my setup, seeing exactly how 
much I tend to use.


I think monitoring the capacity when running zpool iostat -v pool 1 
should be fairly accurate.
A simple d script can be written to determine how often the ZIL (code) 
fails to get a slog block and

has to resort to the allocation in the main pool.

One recent change reduced the amount of data written and possibly the 
slog block fragmentation.

This is zpool version 23: Slim ZIL. So be sure to experiment with that.




I'd suspect that very few use cases call for more than a couple (2-4) 
GB of slog...


I agree this is typically true. Of course it depends on your workload. 
The amount slog data will reflect the
uncommitted synchronous txg data, and the size of each txg will depend 
on memory size.

This area is also undergoing tuning.


I'm trying to get hard numbers as I'm working on building a 
DRAM/battery/flash slog device in one of my friend's electronics 
prototyping shops.  It would be really nice if I could solve 99% of 
the need with 1 or 2 2GB SODIMMs and the chips from a cheap 4GB USB 
thumb drive...




Sounds like fun. Good luck.

Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss