Am Sat, 14 Apr 2012 21:05:36 +0200 schrieb "ReneSac" <renedu...@yahoo.com.br>:
> I have this simple binary arithmetic coder in C++ by Mahoney and > translated to D by Maffi. I added "notrow", "final" and "pure" > and "GC.disable" where it was possible, but that didn't made much > difference. Adding "const" to the Predictor.p() (as in the C++ > version) gave 3% higher performance. Here the two versions: > > http://mattmahoney.net/dc/ <-- original zip > > http://pastebin.com/55x9dT9C <-- Original C++ version. > http://pastebin.com/TYT7XdwX <-- Modified D translation. > > The problem is that the D version is 50% slower: > > test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes) > > Lang| Comp | Binary size | Time (lower is better) > C++ (g++) - 13kb - 2.42s (100%) -O3 -s > D (DMD) - 230kb - 4.46s (184%) -O -release -inline > D (GDC) - 1322kb - 3.69s (152%) -O3 -frelease -s > > > The only diference I could see between the C++ and D versions is > that C++ has hints to the compiler about which functions to > inline, and I could't find anything similar in D. So I manually > inlined the encode and decode functions: > > http://pastebin.com/N4nuyVMh - Manual inline > > D (DMD) - 228kb - 3.70s (153%) -O -release -inline > D (GDC) - 1318kb - 3.50s (144%) -O3 -frelease -s > > Still, the D version is slower. What makes this speed diference? > Is there any way to side-step this? > > Note that this simple C++ version can be made more than 2 times > faster with algoritimical and io optimizations, (ab)using > templates, etc. So I'm not asking for generic speed > optimizations, but only things that may make the D code "more > equal" to the C++ code. I noticed the thread just now. I ported fast paq8 (fp8) to D, and with some careful D-ification and optimization it runs a bit faster than the original C program when compiled with the GCC on Linux x86_64, Core 2 Duo. As others said the files are cached in RAM anyway if there is enough available, so you should not be bound by your hard drive speed anyway. I don't know about this version of paq you ported the coder from, but I try to give you some hints on what I did to optimize the code. - time portions of your main() is the time actually spent at start up or in the compression? - use structs, where you don't classes don't make your code cleaner - where ever you have large arrays that you don't need initialized to .init, write: int[<large number>] arr = void; double[<large number>] arr = void; This disables default initialization, which may help you in inner loops. Remember that C++ doesn't default initialize at all, so this is an obvious way to lose performance against that language. Also keep in mind that the .init for floating point types is NaN: struct Foo { double[999999] bar; } Is not a block of binary zeroes and hence cannot be stored in a .bss section in the executable, where it would not take any space at all. struct Foo { double[999999] bar = void; } On the contrary will not bloat your executable by 7,6 MB! Be cautious with: class Foo { double[999999] bar = void; } Classes' .init don't go into .bss either way. Another reason to use a struct where appropriate. (WARNING: Usage of .bss on Linux/MacOS is currently broken in the compiler front-end. You'll only see the effect on Windows) - Mahoney used an Array class in my version of paq, which allocates via calloc. Do this as well. You can't win otherwise. Read up a bit on calloc if you want. It generally 'allocates' a special zeroed out memory page multiple times. No matter how much memory you ask for, it wont really allocate anything until you *write* to it, at which point new memory is allocated for you and the zero-page is copied into it. The D GC on the other hand allocates that memory and writes zeroes to it immediately. The effect is two fold: First, the calloc version will use much less RAM, if the 'allocated' buffers aren't fully used (e.g. you compressed a small file). Second, the D GC version is slowed down by writing zeroes to all that memory. At high compression levels, paq8 uses ~2 GB of memory that is calloc'ed. You should _not_ try to use GC memory for that. - If there are data structures that are designed to fit into a CPU cache-line (I had one of those in paq8), make sure it still has the correct size in your D version. "static assert(Foo.sizeof == 64);" helped me find a bug there that resulted from switching from C bitfields to the D version (which is a library solution in Phobos). I hope that gives you some ideas what to look for. Good luck! -- Marco