Re: [OpenZFS Developer] Experimental LZJB decompressor ~150% faster than stock AND BSD Compressor seems slower than stock!

Steven Johnson Thu, 24 Oct 2013 17:34:06 -0700

Apologizes for the garbled table and the incorrect use of language for the
speed.


Hopefully this will get through:

Sample results (reposted):
ALGORITHM FILE NAME FILE SIZE COMPRESSED SIZE BLOCK SIZE MB/s DIFF
HAX_lzjb_decompress enwik8 100000000 68721036 1048576 443.8 133.71%
ZFS_lzjb_decompress enwik8 100000000 78636337 1024 331.9
HAX_lzjb_decompress silesia.zip 68182744 76529235 1024 2635 579.50%
ZFS_lzjb_decompress silesia.zip 68182744 76486571 4194304 454.7
HAX_lzjb_decompress mozilla 51220480 29853404 4096 616.9 150.68%
ZFS_lzjb_decompress mozilla 51220480 28868591 4194304 409.4
HAX_lzjb_decompress webster 41458703 26566596 4096 466.6 138.37%
ZFS_lzjb_decompress webster 41458703 30135465 1024 337.2
HAX_lzjb_decompress enwik8.zip 36445475 40985240 1048576 2792.3 614.64%
ZFS_lzjb_decompress enwik8.zip 36445475 40985489 65536 454.3
HAX_lzjb_decompress nci 33553445 11088497 1024 736.7 120.91%
BSD_lzjb_decompress nci 33553445 8714892 4194304 609.3

Richard is correct about mmx/sse. GCC is using movdqu. This is the only
extended instruction is see in an objdump. I am putting together a version
which hopefully corrects that so that "in-kernel" performance can properly
be measured. (-O2 and no mmx/sse).

Steven


On Fri, Oct 25, 2013 at 3:21 AM, Richard Yao <[email protected]> wrote:

> Each context switch from ring 3 to ring 0 must copy the registers to the
> stack and they must be read off the stack when returning. Most (all?)
> production kernels avoid the use of x87, MMX and SSE registers to
> registers to save time on the copies.
>
> On 10/24/2013 02:39 PM, WebDawg wrote:
> > Why are the MMX, SSE/SSE2 instructions not permissible?
> >
> >
> > On Thu, Oct 24, 2013 at 8:25 AM, Richard Yao <[email protected]> wrote:
> >
> >> GCC can generate code using MMX, SSE and/or SSE2 instructions on x86_64.
> >> That could explain the discrepancy between the benchmark results for the
> >> original ZFS lzjb implementation and the BSD version and claims by the
> >> FreeBSD developers. To my knowledge, those instructions are not
> >> permissible on any current Open ZFS platform. That could mean that the
> >> benchmark numbers for Strontium/Justin's version could be lower than
> >> what the benchmark numbers suggest.
> >>
> >> I tried invoking `make all CFLAGS='-mno-mmx -mno-sse -mno-sse2'` to see
> >> what the difference is, but doing that triggered the following GCC bug:
> >>
> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55185
> >>
> >> My curiosity has led me to start a GCC 4.4.7 build under the assumption
> >> that this is a recent regression in GCC so that I could try again with
> >> that.
> >>
> >> With that said, I like the approach Steven took to benchmarking various
> >> implementations and I think it is a step in the right direction. It
> >> should become ideal when it can be done with MMX, SSE and SSE2
> >> instructions disabled.
> >>
> >> On 10/24/2013 11:15 AM, Strontium wrote:
> >>> Hi all,
> >>>
> >>> After a conversation on IRC with Ryao about lzjb performance and the
> >>> proposed BSD version  LZJB decompressor.  I decided to modify the lz4
> >>> benchmark code and wedge in lzjb from ZFS to compare them.
> >>>
> >>> I have published code and the result here:
> >>> https://github.com/stevenj/lzjbbench
> >>>
> >>> In the process i hacked up an experimental lzjb decompression
> >>> implementation.  It is not based on the existing code, its from scratch
> >>> decoding of the bit stream.
> >>>
> >>> In the results my decoder is identified as "HAX_lzjb_decompress"
> >>>
> >>> Sample results:
> >>> *ALGORITHM**FILE NAME**FILE SIZE**COMPRESSED SIZE**BLOCK
> >> SIZE**MB/s**DIFF*
> >>> HAX_lzjb_decompressenwik8100000000687210361048576443.8133.71%
> >>> ZFS_lzjb_decompressenwik8100000000786363371024331.9
> >>> HAX_lzjb_decompresssilesia.zip681827447652923510242635579.50%
> >>> ZFS_lzjb_decompresssilesia.zip68182744764865714194304454.7
> >>> HAX_lzjb_decompressmozilla51220480298534044096616.9150.68%
> >>> ZFS_lzjb_decompressmozilla51220480288685914194304409.4
> >>> HAX_lzjb_decompresswebster41458703265665964096466.6138.37%
> >>> ZFS_lzjb_decompresswebster41458703301354651024337.2
> >>> HAX_lzjb_decompressenwik8.zip364454754098524010485762792.3614.64%
> >>> ZFS_lzjb_decompressenwik8.zip364454754098548965536454.3
> >>>
> >>
> HAX_lzjb_decompressnci33553445110884971024736.7120.91%BSD_lzjb_decompressnci
> >>> 3355344587148924194304609.3
> >>>
> >>> Each of these is my algorithms WORST result vs the alternatives BEST.
> >>> This is built with -O3 and run on a AMD FX 8150 and is pure C.
> >>>
> >>> My github has the full spreadsheet with all the data if anyone is
> >>> interested.
> >>>
> >>> Things i would like to qualify.  My algorithm has had no substantial
> >> speed
> >>> tweaking, its just a first attempt at a faster method.
> >>> It primarily works by overcopying and using 8 byte transfers wherever
> >>> possible.  Basically, the theory is its just as expensive to write one
> >> byte
> >>> to memory as it is to write 8 (at least on a 64bit machine), so i
> write 8
> >>> and then adjust the pointers (which are cheap register operations).
>  But
> >> it
> >>> also picks up some easy to optimize corner cases as well, which is why
> it
> >>> performs so well on decompressing un-compressable data. I know there is
> >>> room for improvement still.
> >>> Its hacky and i haven't cleaned it, its a single days coding, so i am
> >> sure
> >>> it can be a lot nicer.
> >>>
> >>> The LZ4 test suite is good, it try's to, as much as it can, test ONLY
> the
> >>> speed of decompression or compression and to eliminate IO.  This is
> good,
> >>> because IO is a variable but the efficiency of the algorithm is not.
>  An
> >>> inefficient algorithm may look much better than it really is if slow IO
> >> is
> >>> allowed to cloud the result.
> >>>
> >>> I adapted the benchmark code to make it more useful for me when testing
> >> new
> >>> algorithms.
> >>>
> >>> I also tested the new changes to lzjb decompression BSD made.  Except
> in
> >>> very few cases, in this test, classical lzjb beats it.  nci above is
> one
> >>> case where the BSD one beats it.   My experimental decoder beats them
> >> both
> >>> by a long margin.
> >>>
> >>> I also believe LZJB compression should be able to be made significantly
> >>> faster.  Experiments in that regard are on my"todo" list.
> >>>
> >>> Ideally when this is clean i would propose it or an improved successor
> >> as a
> >>> replacement or supplement to the existing implementation of lzjb
> >>> decompression.
> >>>
> >>> Steven (Strontium)
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> developer mailing list
> >>> [email protected]
> >>> http://lists.open-zfs.org/mailman/listinfo/developer
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> developer mailing list
> >> [email protected]
> >> http://lists.open-zfs.org/mailman/listinfo/developer
> >>
> >>
> >
>
>
>

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] Experimental LZJB decompressor ~150% faster than stock AND BSD Compressor seems slower than stock!

Reply via email to