Apologizes for the garbled table and the incorrect use of language for the speed.
Hopefully this will get through: Sample results (reposted): ALGORITHM FILE NAME FILE SIZE COMPRESSED SIZE BLOCK SIZE MB/s DIFF HAX_lzjb_decompress enwik8 100000000 68721036 1048576 443.8 133.71% ZFS_lzjb_decompress enwik8 100000000 78636337 1024 331.9 HAX_lzjb_decompress silesia.zip 68182744 76529235 1024 2635 579.50% ZFS_lzjb_decompress silesia.zip 68182744 76486571 4194304 454.7 HAX_lzjb_decompress mozilla 51220480 29853404 4096 616.9 150.68% ZFS_lzjb_decompress mozilla 51220480 28868591 4194304 409.4 HAX_lzjb_decompress webster 41458703 26566596 4096 466.6 138.37% ZFS_lzjb_decompress webster 41458703 30135465 1024 337.2 HAX_lzjb_decompress enwik8.zip 36445475 40985240 1048576 2792.3 614.64% ZFS_lzjb_decompress enwik8.zip 36445475 40985489 65536 454.3 HAX_lzjb_decompress nci 33553445 11088497 1024 736.7 120.91% BSD_lzjb_decompress nci 33553445 8714892 4194304 609.3 Richard is correct about mmx/sse. GCC is using movdqu. This is the only extended instruction is see in an objdump. I am putting together a version which hopefully corrects that so that "in-kernel" performance can properly be measured. (-O2 and no mmx/sse). Steven On Fri, Oct 25, 2013 at 3:21 AM, Richard Yao <[email protected]> wrote: > Each context switch from ring 3 to ring 0 must copy the registers to the > stack and they must be read off the stack when returning. Most (all?) > production kernels avoid the use of x87, MMX and SSE registers to > registers to save time on the copies. > > On 10/24/2013 02:39 PM, WebDawg wrote: > > Why are the MMX, SSE/SSE2 instructions not permissible? > > > > > > On Thu, Oct 24, 2013 at 8:25 AM, Richard Yao <[email protected]> wrote: > > > >> GCC can generate code using MMX, SSE and/or SSE2 instructions on x86_64. > >> That could explain the discrepancy between the benchmark results for the > >> original ZFS lzjb implementation and the BSD version and claims by the > >> FreeBSD developers. To my knowledge, those instructions are not > >> permissible on any current Open ZFS platform. That could mean that the > >> benchmark numbers for Strontium/Justin's version could be lower than > >> what the benchmark numbers suggest. > >> > >> I tried invoking `make all CFLAGS='-mno-mmx -mno-sse -mno-sse2'` to see > >> what the difference is, but doing that triggered the following GCC bug: > >> > >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55185 > >> > >> My curiosity has led me to start a GCC 4.4.7 build under the assumption > >> that this is a recent regression in GCC so that I could try again with > >> that. > >> > >> With that said, I like the approach Steven took to benchmarking various > >> implementations and I think it is a step in the right direction. It > >> should become ideal when it can be done with MMX, SSE and SSE2 > >> instructions disabled. > >> > >> On 10/24/2013 11:15 AM, Strontium wrote: > >>> Hi all, > >>> > >>> After a conversation on IRC with Ryao about lzjb performance and the > >>> proposed BSD version LZJB decompressor. I decided to modify the lz4 > >>> benchmark code and wedge in lzjb from ZFS to compare them. > >>> > >>> I have published code and the result here: > >>> https://github.com/stevenj/lzjbbench > >>> > >>> In the process i hacked up an experimental lzjb decompression > >>> implementation. It is not based on the existing code, its from scratch > >>> decoding of the bit stream. > >>> > >>> In the results my decoder is identified as "HAX_lzjb_decompress" > >>> > >>> Sample results: > >>> *ALGORITHM**FILE NAME**FILE SIZE**COMPRESSED SIZE**BLOCK > >> SIZE**MB/s**DIFF* > >>> HAX_lzjb_decompressenwik8100000000687210361048576443.8133.71% > >>> ZFS_lzjb_decompressenwik8100000000786363371024331.9 > >>> HAX_lzjb_decompresssilesia.zip681827447652923510242635579.50% > >>> ZFS_lzjb_decompresssilesia.zip68182744764865714194304454.7 > >>> HAX_lzjb_decompressmozilla51220480298534044096616.9150.68% > >>> ZFS_lzjb_decompressmozilla51220480288685914194304409.4 > >>> HAX_lzjb_decompresswebster41458703265665964096466.6138.37% > >>> ZFS_lzjb_decompresswebster41458703301354651024337.2 > >>> HAX_lzjb_decompressenwik8.zip364454754098524010485762792.3614.64% > >>> ZFS_lzjb_decompressenwik8.zip364454754098548965536454.3 > >>> > >> > HAX_lzjb_decompressnci33553445110884971024736.7120.91%BSD_lzjb_decompressnci > >>> 3355344587148924194304609.3 > >>> > >>> Each of these is my algorithms WORST result vs the alternatives BEST. > >>> This is built with -O3 and run on a AMD FX 8150 and is pure C. > >>> > >>> My github has the full spreadsheet with all the data if anyone is > >>> interested. > >>> > >>> Things i would like to qualify. My algorithm has had no substantial > >> speed > >>> tweaking, its just a first attempt at a faster method. > >>> It primarily works by overcopying and using 8 byte transfers wherever > >>> possible. Basically, the theory is its just as expensive to write one > >> byte > >>> to memory as it is to write 8 (at least on a 64bit machine), so i > write 8 > >>> and then adjust the pointers (which are cheap register operations). > But > >> it > >>> also picks up some easy to optimize corner cases as well, which is why > it > >>> performs so well on decompressing un-compressable data. I know there is > >>> room for improvement still. > >>> Its hacky and i haven't cleaned it, its a single days coding, so i am > >> sure > >>> it can be a lot nicer. > >>> > >>> The LZ4 test suite is good, it try's to, as much as it can, test ONLY > the > >>> speed of decompression or compression and to eliminate IO. This is > good, > >>> because IO is a variable but the efficiency of the algorithm is not. > An > >>> inefficient algorithm may look much better than it really is if slow IO > >> is > >>> allowed to cloud the result. > >>> > >>> I adapted the benchmark code to make it more useful for me when testing > >> new > >>> algorithms. > >>> > >>> I also tested the new changes to lzjb decompression BSD made. Except > in > >>> very few cases, in this test, classical lzjb beats it. nci above is > one > >>> case where the BSD one beats it. My experimental decoder beats them > >> both > >>> by a long margin. > >>> > >>> I also believe LZJB compression should be able to be made significantly > >>> faster. Experiments in that regard are on my"todo" list. > >>> > >>> Ideally when this is clean i would propose it or an improved successor > >> as a > >>> replacement or supplement to the existing implementation of lzjb > >>> decompression. > >>> > >>> Steven (Strontium) > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> developer mailing list > >>> [email protected] > >>> http://lists.open-zfs.org/mailman/listinfo/developer > >>> > >> > >> > >> > >> _______________________________________________ > >> developer mailing list > >> [email protected] > >> http://lists.open-zfs.org/mailman/listinfo/developer > >> > >> > > > > >
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
