Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 2013-01-23 21:22, Wojciech Puchar wrote: While RAID-Z is already a king of bad performance, I don't believe RAID-Z is any worse than RAID5. Do you have any actual measurements to back up your claim? it is clearly described even in ZFS papers. Both on reads and writes it gives single drive random I/O performance. With ZFS and RAID-Z the situation is a bit more complex. Lets assume 5 disk raidz1 vdev with ashift=9 (512 byte sectors). A worst case scenario could happen if your random i/o workload was reading random files each of 2048 bytes. Each file read would require data from 4 disks (5th is parity and won't be read unless there are errors). However if files were 512 bytes or less then only one disk would be used. 1024 bytes - two disks, etc. So ZFS is probably not the best choice to store millions of small files if random access to whole files is the primary concern. But lets look at a different scenario - a PostgreSQL database. Here table data is split and stored in 1GB files. ZFS splits the file into 128KiB records (recordsize property). This record is then again split into 4 columns each 32768 bytes. 5th column is generated containing parity. Each column is then stored on a different disk. You could think of it as a regular RAID-5 with stripe size of 32768 bytes. PostgreSQL uses 8192 byte pages that fit evenly both into ZFS record size and column size. Each page access requires only a single disk read. Random i/o performance here should be 5 times that of a single disk. For me the reliability ZFS offers is far more important than pure performance. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
then stored on a different disk. You could think of it as a regular RAID-5 with stripe size of 32768 bytes. PostgreSQL uses 8192 byte pages that fit evenly both into ZFS record size and column size. Each page access requires only a single disk read. Random i/o performance here should be 5 times that of a single disk. think about writing 8192 byte pages randomly. and then doing linear search over table. For me the reliability ZFS offers is far more important than pure performance. Except it is on paper reliability. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
Wow!.! OK. It sounds like you (or someone like you) can answer some of my burning questions about ZFS. On Thu, Jan 24, 2013 at 8:12 AM, Adam Nowacki nowa...@platinum.linux.plwrote: Lets assume 5 disk raidz1 vdev with ashift=9 (512 byte sectors). A worst case scenario could happen if your random i/o workload was reading random files each of 2048 bytes. Each file read would require data from 4 disks (5th is parity and won't be read unless there are errors). However if files were 512 bytes or less then only one disk would be used. 1024 bytes - two disks, etc. So ZFS is probably not the best choice to store millions of small files if random access to whole files is the primary concern. But lets look at a different scenario - a PostgreSQL database. Here table data is split and stored in 1GB files. ZFS splits the file into 128KiB records (recordsize property). This record is then again split into 4 columns each 32768 bytes. 5th column is generated containing parity. Each column is then stored on a different disk. You could think of it as a regular RAID-5 with stripe size of 32768 bytes. Ok... so my question then would be... what of the small files. If I write several small files at once, does the transaction use a record, or does each file need to use a record? Additionally, if small files use sub-records, when you delete that file, does the sub-record get moved or just wasted (until the record is completely free)? I'm considering the difference, say, between cyrus imap (one file per message ZFS, database files on different ZFS filesystem) and dbmail imap (postgresql on ZFS). ... now I realize that PostgreSQL on ZFS has some special issues (but I don't have a choice here between ZFS and non-ZFS ... ZFS has already been chosen), but I'm also figuring that PostgreSQL on ZFS has some waste compared to cyrus IMAP on ZFS. So far in my research, Cyrus makes some compelling arguments that the common use case of most IMAP database files is full scan --- for which it's database files are optimized and SQL-based files are not. I agree that some operations can be more efficient in a good SQL database, but full scan (as a most often used query) is not. Cyrus also makes sense to me as a collection of small files ... for which I expect ZFS to excel... including the ability to snapshot with impunity... but I am terribly curious how the files are handled in transactions. I'm actually (right now) running some filesize statistics (and I'll get back to the list, if asked), but I'd like to know how ZFS is going to store the arriving mail... :). ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
several small files at once, does the transaction use a record, or does each file need to use a record? Additionally, if small files use sub-records, when you delete that file, does the sub-record get moved or just wasted (until the record is completely free)? writes of small files are always good with ZFS. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 2013-01-24 15:24, Wojciech Puchar wrote: For me the reliability ZFS offers is far more important than pure performance. Except it is on paper reliability. This on paper reliability in practice saved a 20TB pool. See one of my previous emails. Any other filesystem or hardware/software raid without per-disk checksums would have failed. Silent corruption of non-important files would be the best case, complete filesystem death by important metadata corruption as the worst case. I've been using ZFS for 3 years in many systems. Biggest one has 44 disks and 4 ZFS pools - this one survived SAS expander disconnects, a few kernel panics and countless power failures (UPS only holds for a few hours). So far I've not lost a single ZFS pool or any data stored. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 2013-01-24 15:45, Zaphod Beeblebrox wrote: Ok... so my question then would be... what of the small files. If I write several small files at once, does the transaction use a record, or does each file need to use a record? Additionally, if small files use sub-records, when you delete that file, does the sub-record get moved or just wasted (until the record is completely free)? Each file is a fully self-contained object (together with full parity) all the way to the physical storage. A 1 byte file on RAID-Z2 pool will always use 3 disks, 3 sectors total for data alone. You can use du to verify - it reports physical size together with parity. Metadata like directory entry or file attributes is stored separately and shared with other files. For small files there may be a lot of wasted space. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
Ok... here's the existing data: There are 3,236,316 files summing to 97,500,008,691 bytes. That puts the average file at 30,127 bytes. But for the full breakdown: 512 : 7758 1024 : 139046 2048 : 1468904 4096 : 325375 8192 : 492399 16384 : 324728 32768 : 263210 65536 : 102407 131072 : 43046 262144 : 22259 524288 : 17136 1048576 : 13788 2097152 : 8279 4194304 : 4501 8388608 : 2317 16777216 : 1045 33554432 : 119 67108864 : 2 I produced that list with the output of ls -R's byte counts, sorted and then processed with: (while read num; do count=$[count+1]; if [ $num -gt $size ]; then echo $size : $count;size=$[size*2]; count=0; fi; done) imapfilesizelist ... now the new machine has two 2T disks in a ZFS mirror --- so I suppose it won't waste as much space as a RAID-Z ZFS --- in that files less than 512 bytes will take 512 bytes? By far the most common case is 2048 bytes ... so that would indicate that a RAID-Z larger than 5 disks would waste much space. Does that go to your recomendations on vdev size, then? To have an 8 or 9 disk vdev, you should be storing at smallest 4k files? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Why DTrace sensor is listed but not called?
On 01/22/2013 16:03, Ryan Stone wrote: Offhand, I can't of why this isn't working. However there is already a way to add new DTrace probes to the kernel, and it's quite simple, so you could try it: Thank you for this information, this works. As for my previous approach, there is a bug in gcc that static empty functions with 'noinline' attributes get eliminated by the optimizer. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56099 Yuri ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
So far I've not lost a single ZFS pool or any data stored. so far my house wasn't robbed. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
There are 3,236,316 files summing to 97,500,008,691 bytes. That puts the average file at 30,127 bytes. But for the full breakdown: quite low. what do you store. here is my real world production example of users mail as well as documents. /dev/mirror/home1.eli 2788 1545 124355% 1941057 209811818% /home ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NMI watchdog functionality on Freebsd
On Wednesday, January 23, 2013 11:57:33 am Ian Lepore wrote: On Wed, 2013-01-23 at 08:47 -0800, Matthew Jacob wrote: On 1/23/2013 7:25 AM, John Baldwin wrote: On Tuesday, January 22, 2013 5:40:55 pm Sushanth Rai wrote: Hi, Does freebsd have some functionality similar to Linux's NMI watchdog ? I'm aware of ichwd driver, but that depends to WDT to be available in the hardware. Even when it is available, BIOS needs to support a mechanism to trigger a OS level recovery to get any useful information when system is really wedged (with interrupt disabled) The principle purpose of a watchdog is to keep the system from hanging. Information is secondary. The ichwd driver can use the LPC part of ICH hardware that's been there since ICH version 4. I implemented this more fully at Panasas. The first importance is to keep the system from being hung. The next piece of information is to detect, on reboot, that a watchdog event occurred. Finally, trying to isolate why is good. This is equivalent to the tco_WDT stuff on Linux. It's not interrupt driven (it drives the reset line on the processor). I think there's value in the NMI watchdog idea, but unless you back it up with a real hardware watchdog you don't really have full watchdog functionality. If the NMI can get the OS to produce some extra info, that's great, and using an NMI gives you a good chance of doing that even if it is normal interrupt processing that has wedged the machine. But calling panic() invokes plenty of processing that can get wedged in other ways, so even an NMI-based watchdog isn't g'teed to get the machine running again. But adding a real hardware watchdog that fires on a slightly longer timeout than the NMI watchdog gives you the best of everything: you get information if it's possible to produce it, and you get a real hardware reset shortly thereafter if producing the info fails. The IPMI watchdog facility has support for a pre-interrupt that fires before the real watchdog. I have coded up support for it in a branch but haven't found any hardware that supports it that I could use to test them. However, you could use an NMI pre-timer via the local APIC timer as a generic pre-timer for other hardware watchdogs. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: libprocstat(3): retrieve process command line args and environment
On Wednesday, January 23, 2013 4:49:50 pm Mikolaj Golub wrote: On Wed, Jan 23, 2013 at 11:31:43AM -0500, John Baldwin wrote: On Wednesday, January 23, 2013 2:25:00 am Mikolaj Golub wrote: IMHO, after adding procstat_getargv and procstat_getargv, the usage of kvm_getargv() and kvm_getenvv() (at least in the new code) may be deprecated. As this is stated in the man page, BUGS section, these routines do not belong in the kvm interface. I suppose they are part of libkvm because there was no a better place for them. procstat(1) prefers direct sysctl to them (so, again, code duplication, which I am going to remove adding procstat_getargv/envv). Hmm, are you going to rewrite ps(1) to use libprocstat? Or rather, is that a goal someday? That is one current consumer of kvm_getargv/envv. That might be fine if we want to make more tools use libprocstat instead of using libkvm directly. I didn't have any plans for ps(1) :-) That is why I wrote about new code. But if you think it is good to do I might look at it one day... I'm mostly hoping Robert chimes in to see if that was his intention for libprocstat. :) If we can ultimately replace all uses of kvm_get*v() with calls to procstat_get*v*() then I'm fine with some code duplication in the interim. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Thu, Jan 24, 2013 at 2:26 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: There are 3,236,316 files summing to 97,500,008,691 bytes. That puts the average file at 30,127 bytes. But for the full breakdown: quite low. what do you store. Apparently you're not really following this thread... just trolling? I had said that it was cyrus IMAP data (which, for reference, is one file per email message). here is my real world production example of users mail as well as documents. /dev/mirror/home1.eli 2788 1545 124355% 1941057 209811818% /home Not the same data, I imagine. I was dealing with the actual byte counts ... that figure is going to be in whole blocks. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Jan 24, 2013, at 4:24 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: Except it is on paper reliability. This on paper reliability saved my ass numerous times. For example I had one home NAS server machine with flaky SATA controller that would not detect one of the four drives from time to time on reboot. This made my pool degraded several times, and even rebooting with let's say disk4 failed to a situation that disk3 is failed did not corrupt any data. I don't think this is possible with any other open source FS, let alone hardware RAID that would drop the whole array because of this. I have never ever personally lost any data on ZFS. Yes, the performance is another topic, and you must know what you are doing, and what is your usage pattern, but from reliability standpoint, to me ZFS looks more durable than anything else. P.S.: My home NAS is running freebsd-CURRENT with ZFS from the first version available. Several drives died, two times the pool was expanded by replacing all drives one by one and resilvered, no single byte lost. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org