Re: benchmark on binary trees

2015-12-11 Thread visitor via Digitalmars-d-learn
ok, i have a working version (memory is nice, twice the speed as 
non parallel) ;

http://dpaste.dzfl.pl/504a652c6c47

real0m14.427s
user1m19.347s
sys 0m0.124s

i've got similar performances, without Allocators, using directly 
malloc and free


i had to recursively deallocate ...

though, still far from competing with the fastest, any advice ?
i guess FreeList allocator is not the better tool here ?


Re: benchmark on binary trees

2015-12-09 Thread visitor via Digitalmars-d-learn

version with apr (like in c version)
http://dpaste.dzfl.pl/68c0157225e7
compiled with ldc it's indeed a bit faster on average :
real0m1.999s
user0m9.810s
sys 0m0.148

btw Rust version is even faster than my little bit outdated gcc 
(4.9)


latest try with allocators :
http://dpaste.dzfl.pl/86b9b3c4ad71
swallows huge memory, slow !

what am i doing wrong ? (concurrency problem ?)


Re: benchmark on binary trees

2015-12-08 Thread visitor via Digitalmars-d-learn


using Apache Portable Runtime(APR) like in the C version :
http://dpaste.dzfl.pl/6ca8b5ffd6dc

works like a charm, 2.061s on my machine !

if file name is binarytrees.d
dmd -w -inline -O -release -I/usr/include/apr-1.0 
-L/usr/lib/x86_64-linux-gnu/libapr-1.so -of"binarytrees" 
"binarytrees.d"


measured with "time ./binarytrees 20" on command line:
real0m2.061s
user0m8.156s
sys 0m0.181s

C version gives :
real0m1.892s
user0m9.087s
sys 0m0.188s

Not bad !!! :-D

Still i would like to know what i'm doing wrong with the 
allocator's version, besides the fact that apr tools are doing 
more sophisticated work than my naive approach :-D ?


Re: benchmark on binary trees

2015-12-08 Thread visitor via Digitalmars-d-learn

C++ version :
real0m3.587s
user0m9.211s
sys 0m7.341s




Re: benchmark on binary trees

2015-12-07 Thread Alex via Digitalmars-d-learn

On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote:

Hello, interesting exercise for me to learn about allocators :-)

Nice to know, a novice can inspire someone :)

i managed to parallelize the code reaching similar performance, 
in terms of speed, as the non parallel version :

http://dpaste.dzfl.pl/6c3e6edcff59

Cool! This is what I looked for!

BUT it consumes insane memory, don't try with argument more 
than 17 !!!


so either i'm doing something stupid (in parallel code, i 
guess) or the FreeList allocator is totally not the right tool 
... (both ?)
I assume, the allocator itself is something, that is not really 
needed in this case. Maybe, there is a more straight forward 
access to the problem. Even a simpler then in all the versions on 
the benchgame site, but I don't see it right now.
And with the allocator attempt I had a chance to experiment with 
the experimental module and to write a very quick copy of a 
program, which I want to have...


But your solution is really impressive, it reduces the factor 
from 5 to 3 immediately :) And I'm going to read your code 
carefully...



some light-shedding would be really appreciated, Thanks


Re: benchmark on binary trees

2015-12-07 Thread Alex via Digitalmars-d-learn
On Sunday, 6 December 2015 at 08:45:10 UTC, Rikki Cattermole 
wrote:

Why is TreeNode not final?
This is an interesting hint! Just after adding final the program 
takes two seconds less... This is roughly 5%. Do you have another 
hints of this kind? ;)



Also yours does not use threads in any way.

If you cannot add multithreading on D, remove it from c++ 
before comparing. They are not equal.
Yes... I'm aware of this discrepancy, and I tried how the time 
statistics differ with and without the #pragma statement. With 
the statement it is half a second slower then without. This seems 
strange to me, and I don't have a clue why it is so.
But as this doesn't change very much for the D/C++ comparison and 
I think the correct goal is to have a parallel version in D I let 
the #pragma statement inside.


Re: benchmark on binary trees

2015-12-07 Thread visitor via Digitalmars-d-learn

On Monday, 7 December 2015 at 10:55:25 UTC, Alex wrote:

On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote:
Hello, interesting exercise for me to learn about allocators 
:-)

Nice to know, a novice can inspire someone :)

i managed to parallelize the code reaching similar 
performance, in terms of speed, as the non parallel version :

http://dpaste.dzfl.pl/6c3e6edcff59

Cool! This is what I looked for!

BUT it consumes insane memory, don't try with argument more 
than 17 !!!


I assume, the allocator itself is something, that is not really 
needed in this case. Maybe, there is a more straight forward 
access to the problem. Even a simpler then in all the versions 
on the benchgame site, but I don't see it right now.
And with the allocator attempt I had a chance to experiment 
with the experimental module and to write a very quick copy of 
a program, which I want to have...


i've got more speed improvement with "taskPool.parallel(depthind, 
2)" in the foreach parallel loop : second argument are workUnits 
(2 for me, on a quad core gave best results)
Also using directly "FreeList!(Mallocator, Tree_node.sizeof)" 
without wrapping it in an allocatorObject gives speed improvement 
(with changes to makeTree method)


i took inspiration from the C code, they use a memory pool 
management, like anonymous already pointed in c++ version, which 
i think could (must?) be achieved with allocators, to gain speed 
i think it's a key point, no GC !! FreeList allocator appears (to 
me) as a good candidate for this.


but as i'm new to this, i'm sure to not doing it the right way !

i tried the exact same FreeList allocator but backed with the 
GCAllocator (not the Mallocator used in my code), then memory 
consumption is very good but of course it"s slow !


i tried a lot of other allocators, variations on the presented 
code, but memory management is awful :(




Re: benchmark on binary trees

2015-12-06 Thread Alex via Digitalmars-d-learn
Ok, lets conclude this post for now. Did some comparison runs 
with the original C++ code. Missed this at the beginning.

The results so far are as following:

Here is the original code in C++.
http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees=gpp=6

With modifications to be able to run it on my mac os machine this 
results in the code available here:

http://pastebin.com/G5cYESdX
compilation done with
g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp 
main.cpp -o main.o && g++ main.o -o main.run -fopenmp 
-lboost_system


Here you can find the last version of my D code:
http://dpaste.dzfl.pl/8c8ab00699b5
compilation done with
dmd -release -O -boundscheck=off -inline main.d

time comparison, just with
time ./main
yields

for C++
real 0m8.255s
user 0m7.342s
sys 0m0.905s

for D
real 0m35.356s
user 0m35.149s
sys 0m0.154s

so the overall factor is round about 5.

Thanks for commenting to everyone! If anybody has further ideas - 
all of them would be appreciated :)
The original site is not interested in any further languages to 
be tested, so my experiment ends here for now...


Re: benchmark on binary trees

2015-12-06 Thread Rikki Cattermole via Digitalmars-d-learn

On 06/12/15 9:35 PM, Alex wrote:

Ok, lets conclude this post for now. Did some comparison runs with the
original C++ code. Missed this at the beginning.
The results so far are as following:

Here is the original code in C++.
http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees=gpp=6


With modifications to be able to run it on my mac os machine this
results in the code available here:
http://pastebin.com/G5cYESdX
compilation done with
g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp main.cpp -o
main.o && g++ main.o -o main.run -fopenmp -lboost_system

Here you can find the last version of my D code:
http://dpaste.dzfl.pl/8c8ab00699b5
compilation done with
dmd -release -O -boundscheck=off -inline main.d

time comparison, just with
time ./main
yields

for C++
real 0m8.255s
user 0m7.342s
sys 0m0.905s

for D
real 0m35.356s
user 0m35.149s
sys 0m0.154s

so the overall factor is round about 5.

Thanks for commenting to everyone! If anybody has further ideas - all of
them would be appreciated :)
The original site is not interested in any further languages to be
tested, so my experiment ends here for now...


Why is TreeNode not final?
Also yours does not use threads in any way.

If you cannot add multithreading on D, remove it from c++ before 
comparing. They are not equal.


Re: benchmark on binary trees

2015-12-06 Thread visitor via Digitalmars-d-learn

On Sunday, 6 December 2015 at 08:35:33 UTC, Alex wrote:
Thanks for commenting to everyone! If anybody has further ideas 
- all of them would be appreciated :)
The original site is not interested in any further languages to 
be tested, so my experiment ends here for now...


Hello, interesting exercise for me to learn about allocators :-)
i managed to parallelize the code reaching similar performance, 
in terms of speed, as the non parallel version :

http://dpaste.dzfl.pl/6c3e6edcff59

BUT it consumes insane memory, don't try with argument more than 
17 !!!


so either i'm doing something stupid (in parallel code, i guess) 
or the FreeList allocator is totally not the right tool ... (both 
?)


some light-shedding would be really appreciated, Thanks


Re: benchmark on binary trees

2015-12-05 Thread JR via Digitalmars-d-learn

On Friday, 4 December 2015 at 14:06:26 UTC, Alex wrote:
[...]

hoping it would be faster. This was not the case. Why?

[...]

Is there anything else to improve performance significantly?


Profile. Profile profile profile. Callgrind. Find bottlenecks 
instead of guessing them.





Re: benchmark on binary trees

2015-12-05 Thread anonymous via Digitalmars-d-learn

On 05.12.2015 01:40, Alex wrote:

found and tried out the -vgc option...
Is there a way to deactivate the GC, if it stands in way?


You can call core.memory.GC.disable to disable automatic collections. 
.enable to turn them on again.


http://dlang.org/phobos/core_memory.html#.GC


Yes, I thought in the same direction. That's why I tried to reimplement
the c++ version. The idea was: as I can't compete with the GC of C#, I
could try to compete by applying another approach. I don't try to write
something which compete with c++ either (I would have to take c++, then?
;) ), but something which clearly outperforms the languages with a
virtual machine...


Your C++ inspired version still allocated via the GC, though. If that 
eats performance, then you'd have to mirror more closely what the C++ 
version actually does. It most probably doesn't use a GC.


I presume this is the C++ version you took as inspiration:
http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees=gpp=6

That uses a boost::object_pool for the nodes. Assuming that that's being 
used for a reason, you could probably get a performance boost by doing 
something similar. I'm not really familiar with 
std.experimental.allocator, maybe there's something like object_pool in 
there. Otherwise, you'd have to implement it yourself.


Generally, I think most of the time you can write a program in D that's 
as fast as one written in C++. But you may have to give up on some 
convenience, and the libraries may not be there to support you.


benchmark on binary trees

2015-12-04 Thread Alex via Digitalmars-d-learn

Hi everybody,
this is going to be a learning by doing a benchmark test - post.

Found this "game" on the web
http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=binarytrees
and wanted to experiment on my self, I tried to reimplement some 
code in D.


This:
http://dpaste.dzfl.pl/4d8e0ceb867c
I did by reimplementing the C# version
and this:
http://dpaste.dzfl.pl/2fd3841d85ef
by reimplementing the C++ version.

Besides the questions of improving both of them, which are 
important but only because I would... maybe... at some point... 
upload my code. For some DLang advertising effect ;)


The learning effect - questions are the following:

1. I wrote the C++ inspired version after the C# inspired, hoping 
it would be faster. This was not the case. Why?
This is a big question, as there could be just some silly 
mistakes of kind
"do not use this call, use this, and your program will be twice 
as fast."

to some mistakes of general kind like
"why were you so foolish, and didn't wrote the whole code in just 
one line."


2. I failed trying to implement some kind of parallelism in the 
second version. I tried something like


auto depthind = iota(min_depth, max_depth+1, 2);
foreach(dummy_i, d; taskPool.parallel(depthind))

for the for-loop in the main function, but then, the program 
never ends. Do you have a hint what was wrong?


3. The compilation was done by:
dmd -O -release -boundscheck=off [filename.d]
Is there anything else to improve performance significantly?

4. This is some ambiguous question. I come originally from the C# 
corner, so I natively think in terms of the first approach. Can 
one general make the statement, that for D one of the approaches 
above will be always faster then the other?


Re: benchmark on binary trees

2015-12-04 Thread anonymous via Digitalmars-d-learn

On 04.12.2015 15:06, Alex wrote:

3. The compilation was done by:
dmd -O -release -boundscheck=off [filename.d]
Is there anything else to improve performance significantly?


You forgot -inline.

By the way, I'm not a fan of using -boundscheck=off like a general 
optimization flag. It undermines @safe, and as the documentation says, 
it should be used with caution.


http://dlang.org/dmd-windows.html#switch-boundscheck


Re: benchmark on binary trees

2015-12-04 Thread anonymous via Digitalmars-d-learn

On 04.12.2015 21:30, Alex wrote:

Yes, I missed this, sorry. The main part of the question was probably
about the class and struct difference. I thought handling with structs
and pointers would be faster then with classes.


When you use a struct directly, without going through a pointer, that 
can be significantly faster than using a class. But structs through 
pointers are pretty much the same as classes, performance-wise.


[...]

Why the parallel version is slower then the sequential?
If you set
int n = 14 in the main function
the parallel version is MUCH slower then the sequential. At my machine
7x slower. Shouldn't it be the other way round?


I don't know what's going on here. You're allocating a lot of 
`TreeNode`s, though. That's probably not very parallelizable. The GC 
could also play a role.


[...]

As ldc doesn't have the experimental part of the includes, compared on
the first version. The result was: program compiled with ldc2, same
params, was 5% slower... nothing crucial, I think...


"The experimental part" is std.experimental.allocator, right? I may be 
wrong here, but I think the default allocator is essentially just `new`. 
So that wouldn't give you any performance boost.


[...]

Yeah... so the answer here for me, is that I can stay with my way of
thinking in c# style. :)


In this case, I think you're fine. Generally, be aware that D doesn't 
shine when you create lots of throw-away objects. The GC can't compete 
with those of C# or Java, so when you translate code from those 
languages too closely, performance may be worse.


Re: benchmark on binary trees

2015-12-04 Thread Alex via Digitalmars-d-learn

On Friday, 4 December 2015 at 19:31:22 UTC, anonymous wrote:
Why did you expect the C++ inspired version to be faster? Just 
because the original was written in C++?


From a quick skim the two versions seem to be pretty much 
identical, aside from names and struct vs class.


Names don't make a difference of course. It would be easier to 
compare the two versions if you used the same names, though.


The differences between structs on the heap and classes are 
subtle. It's not surprising that they don't have an impact here.


If there are substantial differences between the two versions, 
please point them out.




Yes, I missed this, sorry. The main part of the question was 
probably about the class and struct difference. I thought 
handling with structs and pointers would be faster then with 
classes.



2. auto depthind = iota(min_depth, max_depth+1, 2);
foreach(dummy_i, d; taskPool.parallel(depthind))


Works for me. Maybe show the exact full program you tried, and 
tell what compiler version you're using.




Ok, this was strange, but I found the crux. The correct question 
is:

Why the parallel version is slower then the sequential?
If you set
int n = 14 in the main function
the parallel version is MUCH slower then the sequential. At my 
machine 7x slower. Shouldn't it be the other way round?



3. The compilation was done by:
dmd -O -release -boundscheck=off [filename.d]
Is there anything else to improve performance significantly?


The other compilers, ldc and gdc, usually produce faster code 
than dmd.


Thanks for the hint!
As ldc doesn't have the experimental part of the includes, 
compared on the first version. The result was: program compiled 
with ldc2, same params, was 5% slower... nothing crucial, I 
think...




Just reiterating what I said re the first question: I don't 
really see a difference. If you think there is, please point it 
out. Or if you're not sure, feel free to ask about specific 
parts.


Yeah... so the answer here for me, is that I can stay with my way 
of thinking in c# style. :)


Re: benchmark on binary trees

2015-12-04 Thread Alex via Digitalmars-d-learn

On Friday, 4 December 2015 at 23:23:37 UTC, anonymous wrote:

Why the parallel version is slower then the sequential?
If you set
int n = 14 in the main function
the parallel version is MUCH slower then the sequential. At my 
machine

7x slower. Shouldn't it be the other way round?


I don't know what's going on here. You're allocating a lot of 
`TreeNode`s, though. That's probably not very parallelizable. 
The GC could also play a role.




found and tried out the -vgc option...
Is there a way to deactivate the GC, if it stands in way?

In this case, I think you're fine. Generally, be aware that D 
doesn't shine when you create lots of throw-away objects. The 
GC can't compete with those of C# or Java, so when you 
translate code from those languages too closely, performance 
may be worse.


Yes, I thought in the same direction. That's why I tried to 
reimplement the c++ version. The idea was: as I can't compete 
with the GC of C#, I could try to compete by applying another 
approach. I don't try to write something which compete with c++ 
either (I would have to take c++, then? ;) ), but something which 
clearly outperforms the languages with a virtual machine...


tried the -inline option... no time win...


Re: benchmark on binary trees

2015-12-04 Thread CraigDillabaugh via Digitalmars-d-learn

On Friday, 4 December 2015 at 14:06:26 UTC, Alex wrote:

Hi everybody,
this is going to be a learning by doing a benchmark test - post.


clip


3. The compilation was done by:
dmd -O -release -boundscheck=off [filename.d]
Is there anything else to improve performance significantly?



If you have access to LDC or GDC they typically produce 
significantly faster code.





Re: benchmark on binary trees

2015-12-04 Thread anonymous via Digitalmars-d-learn

On 04.12.2015 15:06, Alex wrote:

1. I wrote the C++ inspired version after the C# inspired, hoping it
would be faster. This was not the case. Why?

[...]

Why did you expect the C++ inspired version to be faster? Just because 
the original was written in C++?


From a quick skim the two versions seem to be pretty much identical, 
aside from names and struct vs class.


Names don't make a difference of course. It would be easier to compare 
the two versions if you used the same names, though.


The differences between structs on the heap and classes are subtle. It's 
not surprising that they don't have an impact here.


If there are substantial differences between the two versions, please 
point them out.



2. I failed trying to implement some kind of parallelism in the second
version. I tried something like

auto depthind = iota(min_depth, max_depth+1, 2);
foreach(dummy_i, d; taskPool.parallel(depthind))

for the for-loop in the main function, but then, the program never ends.
Do you have a hint what was wrong?


Works for me. Maybe show the exact full program you tried, and tell what 
compiler version you're using.



3. The compilation was done by:
dmd -O -release -boundscheck=off [filename.d]
Is there anything else to improve performance significantly?


The other compilers, ldc and gdc, usually produce faster code than dmd.


4. This is some ambiguous question. I come originally from the C#
corner, so I natively think in terms of the first approach. Can one
general make the statement, that for D one of the approaches above will
be always faster then the other?


Just reiterating what I said re the first question: I don't really see a 
difference. If you think there is, please point it out. Or if you're not 
sure, feel free to ask about specific parts.