Why are the MMX, SSE/SSE2 instructions not permissible?

On Thu, Oct 24, 2013 at 8:25 AM, Richard Yao <[email protected]> wrote:

> GCC can generate code using MMX, SSE and/or SSE2 instructions on x86_64.
> That could explain the discrepancy between the benchmark results for the
> original ZFS lzjb implementation and the BSD version and claims by the
> FreeBSD developers. To my knowledge, those instructions are not
> permissible on any current Open ZFS platform. That could mean that the
> benchmark numbers for Strontium/Justin's version could be lower than
> what the benchmark numbers suggest.
>
> I tried invoking `make all CFLAGS='-mno-mmx -mno-sse -mno-sse2'` to see
> what the difference is, but doing that triggered the following GCC bug:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55185
>
> My curiosity has led me to start a GCC 4.4.7 build under the assumption
> that this is a recent regression in GCC so that I could try again with
> that.
>
> With that said, I like the approach Steven took to benchmarking various
> implementations and I think it is a step in the right direction. It
> should become ideal when it can be done with MMX, SSE and SSE2
> instructions disabled.
>
> On 10/24/2013 11:15 AM, Strontium wrote:
> > Hi all,
> >
> > After a conversation on IRC with Ryao about lzjb performance and the
> > proposed BSD version  LZJB decompressor.  I decided to modify the lz4
> > benchmark code and wedge in lzjb from ZFS to compare them.
> >
> > I have published code and the result here:
> > https://github.com/stevenj/lzjbbench
> >
> > In the process i hacked up an experimental lzjb decompression
> > implementation.  It is not based on the existing code, its from scratch
> > decoding of the bit stream.
> >
> > In the results my decoder is identified as "HAX_lzjb_decompress"
> >
> > Sample results:
> > *ALGORITHM**FILE NAME**FILE SIZE**COMPRESSED SIZE**BLOCK
> SIZE**MB/s**DIFF*
> > HAX_lzjb_decompressenwik8100000000687210361048576443.8133.71%
> > ZFS_lzjb_decompressenwik8100000000786363371024331.9
> > HAX_lzjb_decompresssilesia.zip681827447652923510242635579.50%
> > ZFS_lzjb_decompresssilesia.zip68182744764865714194304454.7
> > HAX_lzjb_decompressmozilla51220480298534044096616.9150.68%
> > ZFS_lzjb_decompressmozilla51220480288685914194304409.4
> > HAX_lzjb_decompresswebster41458703265665964096466.6138.37%
> > ZFS_lzjb_decompresswebster41458703301354651024337.2
> > HAX_lzjb_decompressenwik8.zip364454754098524010485762792.3614.64%
> > ZFS_lzjb_decompressenwik8.zip364454754098548965536454.3
> >
> HAX_lzjb_decompressnci33553445110884971024736.7120.91%BSD_lzjb_decompressnci
> > 3355344587148924194304609.3
> >
> > Each of these is my algorithms WORST result vs the alternatives BEST.
> > This is built with -O3 and run on a AMD FX 8150 and is pure C.
> >
> > My github has the full spreadsheet with all the data if anyone is
> > interested.
> >
> > Things i would like to qualify.  My algorithm has had no substantial
> speed
> > tweaking, its just a first attempt at a faster method.
> > It primarily works by overcopying and using 8 byte transfers wherever
> > possible.  Basically, the theory is its just as expensive to write one
> byte
> > to memory as it is to write 8 (at least on a 64bit machine), so i write 8
> > and then adjust the pointers (which are cheap register operations).  But
> it
> > also picks up some easy to optimize corner cases as well, which is why it
> > performs so well on decompressing un-compressable data. I know there is
> > room for improvement still.
> > Its hacky and i haven't cleaned it, its a single days coding, so i am
> sure
> > it can be a lot nicer.
> >
> > The LZ4 test suite is good, it try's to, as much as it can, test ONLY the
> > speed of decompression or compression and to eliminate IO.  This is good,
> > because IO is a variable but the efficiency of the algorithm is not.  An
> > inefficient algorithm may look much better than it really is if slow IO
> is
> > allowed to cloud the result.
> >
> > I adapted the benchmark code to make it more useful for me when testing
> new
> > algorithms.
> >
> > I also tested the new changes to lzjb decompression BSD made.  Except in
> > very few cases, in this test, classical lzjb beats it.  nci above is one
> > case where the BSD one beats it.   My experimental decoder beats them
> both
> > by a long margin.
> >
> > I also believe LZJB compression should be able to be made significantly
> > faster.  Experiments in that regard are on my"todo" list.
> >
> > Ideally when this is clean i would propose it or an improved successor
> as a
> > replacement or supplement to the existing implementation of lzjb
> > decompression.
> >
> > Steven (Strontium)
> >
> >
> >
> >
> > _______________________________________________
> > developer mailing list
> > [email protected]
> > http://lists.open-zfs.org/mailman/listinfo/developer
> >
>
>
>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to