Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Adam Nowacki

On 2013-01-23 21:22, Wojciech Puchar wrote:

While RAID-Z is already a king of bad performance,


I don't believe RAID-Z is any worse than RAID5.  Do you have any actual
measurements to back up your claim?


it is clearly described even in ZFS papers. Both on reads and writes it
gives single drive random I/O performance.


With ZFS and RAID-Z the situation is a bit more complex.

Lets assume 5 disk raidz1 vdev with ashift=9 (512 byte sectors).

A worst case scenario could happen if your random i/o workload was 
reading random files each of 2048 bytes. Each file read would require 
data from 4 disks (5th is parity and won't be read unless there are 
errors). However if files were 512 bytes or less then only one disk 
would be used. 1024 bytes - two disks, etc.


So ZFS is probably not the best choice to store millions of small files 
if random access to whole files is the primary concern.


But lets look at a different scenario - a PostgreSQL database. Here 
table data is split and stored in 1GB files. ZFS splits the file into 
128KiB records (recordsize property). This record is then again split 
into 4 columns each 32768 bytes. 5th column is generated containing 
parity. Each column is then stored on a different disk. You could think 
of it as a regular RAID-5 with stripe size of 32768 bytes.


PostgreSQL uses 8192 byte pages that fit evenly both into ZFS record 
size and column size. Each page access requires only a single disk read. 
Random i/o performance here should be 5 times that of a single disk.


For me the reliability ZFS offers is far more important than pure 
performance.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Wojciech Puchar
then stored on a different disk. You could think of it as a regular RAID-5 
with stripe size of 32768 bytes.


PostgreSQL uses 8192 byte pages that fit evenly both into ZFS record size and 
column size. Each page access requires only a single disk read. Random i/o 
performance here should be 5 times that of a single disk.


think about writing 8192 byte pages randomly. and then doing linear search 
over table.




For me the reliability ZFS offers is far more important than pure 
performance.

Except it is on paper reliability.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Zaphod Beeblebrox
Wow!.!  OK.  It sounds like you (or someone like you) can answer some of my
burning questions about ZFS.

On Thu, Jan 24, 2013 at 8:12 AM, Adam Nowacki nowa...@platinum.linux.plwrote:


 Lets assume 5 disk raidz1 vdev with ashift=9 (512 byte sectors).

 A worst case scenario could happen if your random i/o workload was reading
 random files each of 2048 bytes. Each file read would require data from 4
 disks (5th is parity and won't be read unless there are errors). However if
 files were 512 bytes or less then only one disk would be used. 1024 bytes -
 two disks, etc.

 So ZFS is probably not the best choice to store millions of small files if
 random access to whole files is the primary concern.

 But lets look at a different scenario - a PostgreSQL database. Here table
 data is split and stored in 1GB files. ZFS splits the file into 128KiB
 records (recordsize property). This record is then again split into 4
 columns each 32768 bytes. 5th column is generated containing parity. Each
 column is then stored on a different disk. You could think of it as a
 regular RAID-5 with stripe size of 32768 bytes.


Ok... so my question then would be... what of the small files.  If I write
several small files at once, does the transaction use a record, or does
each file need to use a record?  Additionally, if small files use
sub-records, when you delete that file, does the sub-record get moved or
just wasted (until the record is completely free)?

I'm considering the difference, say, between cyrus imap (one file per
message ZFS, database files on different ZFS filesystem) and dbmail imap
(postgresql on ZFS).

... now I realize that PostgreSQL on ZFS has some special issues (but I
don't have a choice here between ZFS and non-ZFS ... ZFS has already been
chosen), but I'm also figuring that PostgreSQL on ZFS has some waste
compared to cyrus IMAP on ZFS.

So far in my research, Cyrus makes some compelling arguments that the
common use case of most IMAP database files is full scan --- for which it's
database files are optimized and SQL-based files are not.  I agree that
some operations can be more efficient in a good SQL database, but full scan
(as a most often used query) is not.

Cyrus also makes sense to me as a collection of small files ... for which I
expect ZFS to excel... including the ability to snapshot with impunity...
but I am terribly curious how the files are handled in transactions.

I'm actually (right now) running some filesize statistics (and I'll get
back to the list, if asked), but I'd like to know how ZFS is going to store
the arriving mail... :).
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Wojciech Puchar

several small files at once, does the transaction use a record, or does
each file need to use a record?  Additionally, if small files use
sub-records, when you delete that file, does the sub-record get moved or
just wasted (until the record is completely free)?


writes of small files are always good with ZFS.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Adam Nowacki

On 2013-01-24 15:24, Wojciech Puchar wrote:

For me the reliability ZFS offers is far more important than pure
performance.

Except it is on paper reliability.


This on paper reliability in practice saved a 20TB pool. See one of my 
previous emails. Any other filesystem or hardware/software raid without 
per-disk checksums would have failed. Silent corruption of non-important 
files would be the best case, complete filesystem death by important 
metadata corruption as the worst case.


I've been using ZFS for 3 years in many systems. Biggest one has 44 
disks and 4 ZFS pools - this one survived SAS expander disconnects, a 
few kernel panics and countless power failures (UPS only holds for a few 
hours).


So far I've not lost a single ZFS pool or any data stored.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Adam Nowacki

On 2013-01-24 15:45, Zaphod Beeblebrox wrote:

Ok... so my question then would be... what of the small files.  If I write
several small files at once, does the transaction use a record, or does
each file need to use a record?  Additionally, if small files use
sub-records, when you delete that file, does the sub-record get moved or
just wasted (until the record is completely free)?


Each file is a fully self-contained object (together with full parity) 
all the way to the physical storage. A 1 byte file on RAID-Z2 pool will 
always use 3 disks, 3 sectors total for data alone. You can use du to 
verify - it reports physical size together with parity. Metadata like 
directory entry or file attributes is stored separately and shared with 
other files. For small files there may be a lot of wasted space.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Zaphod Beeblebrox
Ok... here's the existing data:

There are 3,236,316 files summing to 97,500,008,691 bytes.  That puts the
average file at 30,127 bytes.  But for the full breakdown:

512 : 7758
1024 : 139046
2048 : 1468904
4096 : 325375
8192 : 492399
16384 : 324728
32768 : 263210
65536 : 102407
131072 : 43046
262144 : 22259
524288 : 17136
1048576 : 13788
2097152 : 8279
4194304 : 4501
8388608 : 2317
16777216 : 1045
33554432 : 119
67108864 : 2

I produced that list with the output of ls -R's byte counts, sorted and
then processed with:

(while read num; do count=$[count+1]; if [ $num -gt $size ]; then echo
$size : $count;size=$[size*2]; count=0; fi; done) imapfilesizelist

... now the new machine has two 2T disks in a ZFS mirror --- so I suppose
it won't waste as much space as a RAID-Z ZFS --- in that files less than
512 bytes will take 512 bytes?  By far the most common case is 2048 bytes
... so that would indicate that a RAID-Z larger than 5 disks would waste
much space.

Does that go to your recomendations on vdev size, then?   To have an 8 or 9
disk vdev, you should be storing at smallest 4k files?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Why DTrace sensor is listed but not called?

2013-01-24 Thread Yuri

On 01/22/2013 16:03, Ryan Stone wrote:

Offhand, I can't of why this isn't working.  However there is already a way
to add new DTrace probes to the kernel, and it's quite simple, so you could
try it:


Thank you for this information, this works.

As for my previous approach, there is a bug in gcc that static empty 
functions with 'noinline' attributes get eliminated by the optimizer.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56099

Yuri
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Wojciech Puchar

So far I've not lost a single ZFS pool or any data stored.

so far my house wasn't robbed.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Wojciech Puchar

There are 3,236,316 files summing to 97,500,008,691 bytes.  That puts the
average file at 30,127 bytes.  But for the full breakdown:


quite low. what do you store.

here is my real world production example of users mail as well as 
documents.



/dev/mirror/home1.eli  2788 1545  124355% 1941057 209811818%   /home


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: NMI watchdog functionality on Freebsd

2013-01-24 Thread John Baldwin
On Wednesday, January 23, 2013 11:57:33 am Ian Lepore wrote:
 On Wed, 2013-01-23 at 08:47 -0800, Matthew Jacob wrote:
  On 1/23/2013 7:25 AM, John Baldwin wrote:
   On Tuesday, January 22, 2013 5:40:55 pm Sushanth Rai wrote:
   Hi,
  
   Does freebsd have some functionality similar to  Linux's NMI watchdog ? 
I'm
   aware of ichwd driver, but that depends to WDT to be available in the
   hardware. Even when it is available, BIOS needs to support a mechanism 
to
   trigger a OS level recovery to get any useful information when system is
   really wedged (with interrupt disabled)
  The principle purpose of a watchdog is to keep the system from hanging. 
  Information is secondary. The ichwd driver can use the LPC part of ICH 
  hardware that's been there since ICH version 4. I implemented this more 
  fully at Panasas. The first importance is to keep the system from being 
  hung. The next piece of information is to detect, on reboot, that a 
  watchdog event occurred. Finally, trying to isolate why is good.
  
  This is equivalent to the tco_WDT stuff on Linux. It's not interrupt 
  driven (it drives the reset line on the processor).
  
 
 I think there's value in the NMI watchdog idea, but unless you back it
 up with a real hardware watchdog you don't really have full watchdog
 functionality.  If the NMI can get the OS to produce some extra info,
 that's great, and using an NMI gives you a good chance of doing that
 even if it is normal interrupt processing that has wedged the machine.
 But calling panic() invokes plenty of processing that can get wedged in
 other ways, so even an NMI-based watchdog isn't g'teed to get the
 machine running again.
 
 But adding a real hardware watchdog that fires on a slightly longer
 timeout than the NMI watchdog gives you the best of everything: you get
 information if it's possible to produce it, and you get a real hardware
 reset shortly thereafter if producing the info fails.

The IPMI watchdog facility has support for a pre-interrupt that fires before 
the real watchdog.  I have coded up support for it in a branch but haven't 
found any hardware that supports it that I could use to test them.  However, 
you could use an NMI pre-timer via the local APIC timer as a generic pre-timer 
for other hardware watchdogs.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: libprocstat(3): retrieve process command line args and environment

2013-01-24 Thread John Baldwin
On Wednesday, January 23, 2013 4:49:50 pm Mikolaj Golub wrote:
 On Wed, Jan 23, 2013 at 11:31:43AM -0500, John Baldwin wrote:
  On Wednesday, January 23, 2013 2:25:00 am Mikolaj Golub wrote:
   IMHO, after adding procstat_getargv and procstat_getargv, the usage of
   kvm_getargv() and kvm_getenvv() (at least in the new code) may be
   deprecated. As this is stated in the man page, BUGS section, these
   routines do not belong in the kvm interface. I suppose they are part
   of libkvm because there was no a better place for them. procstat(1)
   prefers direct sysctl to them (so, again, code duplication, which I am
   going to remove adding procstat_getargv/envv).
  
  Hmm, are you going to rewrite ps(1) to use libprocstat?  Or rather, is that 
  a
  goal someday?  That is one current consumer of kvm_getargv/envv.  That might
  be fine if we want to make more tools use libprocstat instead of using 
  libkvm
  directly.
 
 I didn't have any plans for ps(1) :-) That is why I wrote about new
 code. But if you think it is good to do I might look at it one day...

I'm mostly hoping Robert chimes in to see if that was his intention for
libprocstat. :)  If we can ultimately replace all uses of kvm_get*v() with
calls to procstat_get*v*() then I'm fine with some code duplication in the
interim.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Zaphod Beeblebrox
On Thu, Jan 24, 2013 at 2:26 PM, Wojciech Puchar 
woj...@wojtek.tensor.gdynia.pl wrote:

 There are 3,236,316 files summing to 97,500,008,691 bytes.  That puts the
 average file at 30,127 bytes.  But for the full breakdown:


 quite low. what do you store.


Apparently you're not really following this thread... just trolling?  I had
said that it was cyrus IMAP data (which, for reference, is one file per
email message).


 here is my real world production example of users mail as well as
 documents.


 /dev/mirror/home1.eli  2788 1545  124355% 1941057 209811818%
 /home


Not the same data, I imagine.  I was dealing with the actual byte counts
... that figure is going to be in whole blocks.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Nikolay Denev

On Jan 24, 2013, at 4:24 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl 
wrote:
 
 Except it is on paper reliability.

This on paper reliability saved my ass numerous times.
For example I had one home NAS server machine with flaky SATA controller that 
would not detect one of the four drives from time to time on reboot.
This made my pool degraded several times, and even rebooting with let's say 
disk4 failed to a situation that disk3 is failed did not corrupt any data.
I don't think this is possible with any other open source FS, let alone 
hardware RAID that would drop the whole array because of this.
I have never ever personally lost any data on ZFS. Yes, the performance is 
another topic, and you must know what you are doing, and what is your
usage pattern, but from reliability standpoint, to me ZFS looks more durable 
than anything else.

P.S.: My home NAS is running freebsd-CURRENT with ZFS from the first version 
available. Several drives died, two times the pool was expanded
by replacing all drives one by one and resilvered, no single byte lost.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org