Why are the MMX, SSE/SSE2 instructions not permissible?
On Thu, Oct 24, 2013 at 8:25 AM, Richard Yao <[email protected]> wrote: > GCC can generate code using MMX, SSE and/or SSE2 instructions on x86_64. > That could explain the discrepancy between the benchmark results for the > original ZFS lzjb implementation and the BSD version and claims by the > FreeBSD developers. To my knowledge, those instructions are not > permissible on any current Open ZFS platform. That could mean that the > benchmark numbers for Strontium/Justin's version could be lower than > what the benchmark numbers suggest. > > I tried invoking `make all CFLAGS='-mno-mmx -mno-sse -mno-sse2'` to see > what the difference is, but doing that triggered the following GCC bug: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55185 > > My curiosity has led me to start a GCC 4.4.7 build under the assumption > that this is a recent regression in GCC so that I could try again with > that. > > With that said, I like the approach Steven took to benchmarking various > implementations and I think it is a step in the right direction. It > should become ideal when it can be done with MMX, SSE and SSE2 > instructions disabled. > > On 10/24/2013 11:15 AM, Strontium wrote: > > Hi all, > > > > After a conversation on IRC with Ryao about lzjb performance and the > > proposed BSD version LZJB decompressor. I decided to modify the lz4 > > benchmark code and wedge in lzjb from ZFS to compare them. > > > > I have published code and the result here: > > https://github.com/stevenj/lzjbbench > > > > In the process i hacked up an experimental lzjb decompression > > implementation. It is not based on the existing code, its from scratch > > decoding of the bit stream. > > > > In the results my decoder is identified as "HAX_lzjb_decompress" > > > > Sample results: > > *ALGORITHM**FILE NAME**FILE SIZE**COMPRESSED SIZE**BLOCK > SIZE**MB/s**DIFF* > > HAX_lzjb_decompressenwik8100000000687210361048576443.8133.71% > > ZFS_lzjb_decompressenwik8100000000786363371024331.9 > > HAX_lzjb_decompresssilesia.zip681827447652923510242635579.50% > > ZFS_lzjb_decompresssilesia.zip68182744764865714194304454.7 > > HAX_lzjb_decompressmozilla51220480298534044096616.9150.68% > > ZFS_lzjb_decompressmozilla51220480288685914194304409.4 > > HAX_lzjb_decompresswebster41458703265665964096466.6138.37% > > ZFS_lzjb_decompresswebster41458703301354651024337.2 > > HAX_lzjb_decompressenwik8.zip364454754098524010485762792.3614.64% > > ZFS_lzjb_decompressenwik8.zip364454754098548965536454.3 > > > HAX_lzjb_decompressnci33553445110884971024736.7120.91%BSD_lzjb_decompressnci > > 3355344587148924194304609.3 > > > > Each of these is my algorithms WORST result vs the alternatives BEST. > > This is built with -O3 and run on a AMD FX 8150 and is pure C. > > > > My github has the full spreadsheet with all the data if anyone is > > interested. > > > > Things i would like to qualify. My algorithm has had no substantial > speed > > tweaking, its just a first attempt at a faster method. > > It primarily works by overcopying and using 8 byte transfers wherever > > possible. Basically, the theory is its just as expensive to write one > byte > > to memory as it is to write 8 (at least on a 64bit machine), so i write 8 > > and then adjust the pointers (which are cheap register operations). But > it > > also picks up some easy to optimize corner cases as well, which is why it > > performs so well on decompressing un-compressable data. I know there is > > room for improvement still. > > Its hacky and i haven't cleaned it, its a single days coding, so i am > sure > > it can be a lot nicer. > > > > The LZ4 test suite is good, it try's to, as much as it can, test ONLY the > > speed of decompression or compression and to eliminate IO. This is good, > > because IO is a variable but the efficiency of the algorithm is not. An > > inefficient algorithm may look much better than it really is if slow IO > is > > allowed to cloud the result. > > > > I adapted the benchmark code to make it more useful for me when testing > new > > algorithms. > > > > I also tested the new changes to lzjb decompression BSD made. Except in > > very few cases, in this test, classical lzjb beats it. nci above is one > > case where the BSD one beats it. My experimental decoder beats them > both > > by a long margin. > > > > I also believe LZJB compression should be able to be made significantly > > faster. Experiments in that regard are on my"todo" list. > > > > Ideally when this is clean i would propose it or an improved successor > as a > > replacement or supplement to the existing implementation of lzjb > > decompression. > > > > Steven (Strontium) > > > > > > > > > > _______________________________________________ > > developer mailing list > > [email protected] > > http://lists.open-zfs.org/mailman/listinfo/developer > > > > > > _______________________________________________ > developer mailing list > [email protected] > http://lists.open-zfs.org/mailman/listinfo/developer > >
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
