Re: Array fill performance differences between for, foreach, slice

2020-04-08 Thread data pulverizer via Digitalmars-d-learn
In all honesty till now I haven't really thought deeply about 
memory allocation, I just assumed that malloc, free, and so on 
where low level operating system functions and that was that. 
I've heard people in the D community talk about garbage 
collection, and memory allocation but I didn't think it was 
something I had to worry about.


A lot of the work I do is statistics and data science based, 
often speed is important, calculations on arrays involve a lot of 
array instantiation, moving and copying elements and so forth. As 
far as all that stuff goes the speed of memory operations is 
important. The fact that I could get those kinds of operations to 
run faster using D's GC.malloc and different types of iteration 
is significant. Today it hit me that I actually need to study 
memory allocation and possibly garbage collection as topics - not 
so that I can implement a garbage collector, but so that I can 
write faster code.


I know to everyone here that's just basic stuff, but it's a bit 
of a revelation for me. :-)




Re: Array fill performance differences between for, foreach, slice

2020-04-01 Thread Steven Schveighoffer via Digitalmars-d-learn

On 4/1/20 11:23 AM, data pulverizer wrote:
Thanks for all the suggestions made so far. I am still interested in 
looking at the implementation details of the slice assign `arr[] = x` 
which I can't seem to find. Before I made my initial post, I tried doing 
a `memcpy` and `memmove` under a `for` loop but it did not change the 
performance or get the same kind of performance as the initial slice 
performance so I didn't bother to mention them, I haven't tried it with 
the suggested compiler flags though.


Using disassembly, on run.dlang.io, it says it's using __memsetDouble.



@StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) 
please correct me if I am wrong but it looks as if `memset` can only 
write from an `int` sized source and I need the source size to be any 
potential size (T).


Again, the compiler uses whatever tools are available. It might be 
memset, it might be something else.



In the case of your code, it's using __memsetDouble, which I have no 
idea where it's defined (probably libc).



--

On a related aside I noticed that the timing was reduced across the 
board so much so that the initial slice time halved when initialising with:


```
auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | 
GC.BlkAttr.APPENDABLE))[0..n];

```

Instead of:

```
auto arr = new T[n];
```


What this means is, don't scan the block for pointers during a GC 
collect cycle. If you have pointers in your T, this is a very bad idea. 
Not only that, but this does not initialize the appendable data at the 
end of the block.


In addition, GC.malloc just zero-initializes the data. If you do new 
T[n], and T has an initializer, it's going to be a lot more expensive.


If you are going to use this, remove the GC.BlkAttr.APPENDABLE.

In the case of double, it is initialized to NaN.

This could explain the difference in timing.



I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the 
bit mask option that makes it  faster than `core.stdc.stdlib: malloc`. 
Is `gc_malloc` OS dependent? I can't find it in the standard C library, 
the only reference I found for it is 
[here](https://linux.die.net/man/3/gc) and it is named slightly 
differently but appears to be the same function. In `core.memory`, it is 
specified by the `extern (C)` declaration 
(https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I 
guess it must be somewhere on my system?


It's in the D garbage collector, here: 
https://github.com/dlang/druntime/blob/2eec30b35bab308a37298331353bdce5fee1b657/src/gc/proxy.d#L166


extern(C) functions can be implemented in D. The major difference 
between standard D functions and extern(C) is that the latter does not 
do name mangling.


-Steve


Re: Array fill performance differences between for, foreach, slice

2020-04-01 Thread data pulverizer via Digitalmars-d-learn
Thanks for all the suggestions made so far. I am still interested 
in looking at the implementation details of the slice assign 
`arr[] = x` which I can't seem to find. Before I made my initial 
post, I tried doing a `memcpy` and `memmove` under a `for` loop 
but it did not change the performance or get the same kind of 
performance as the initial slice performance so I didn't bother 
to mention them, I haven't tried it with the suggested compiler 
flags though.


@StevenSchveighoffer also suggested using `memset` (as well as 
`memcpy`) please correct me if I am wrong but it looks as if 
`memset` can only write from an `int` sized source and I need the 
source size to be any potential size (T).



--


On a related aside I noticed that the timing was reduced across 
the board so much so that the initial slice time halved when 
initialising with:


```
auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | 
GC.BlkAttr.APPENDABLE))[0..n];

```

Instead of:

```
auto arr = new T[n];
```

If I am filling the array with something anyway for example 
`arr[] = x` is there anything wrong with this approach? Should I 
care if the memory is not "scanned" - I have no idea what that 
means but I assume it's some kind of check that takes time and is 
switched off by `GC.BlkAttr.NO_SCAN`. Are there any downsides to 
this that I should be aware of?


I noticed that `GC.malloc()` is based on `gc_malloc()` which 
gives the bit mask option that makes it  faster than 
`core.stdc.stdlib: malloc`. Is `gc_malloc` OS dependent? I can't 
find it in the standard C library, the only reference I found for 
it is [here](https://linux.die.net/man/3/gc) and it is named 
slightly differently but appears to be the same function. In 
`core.memory`, it is specified by the `extern (C)` declaration 
(https://github.com/dlang/druntime/blob/master/src/core/memory.d) 
so I guess it must be somewhere on my system?


Thanks



Re: Array fill performance differences between for, foreach, slice

2020-04-01 Thread Adam D. Ruppe via Digitalmars-d-learn

On Wednesday, 1 April 2020 at 15:04:44 UTC, Jesse Phillips wrote:
It is nice that bounds checks remain in place when using 
release and the code is @safe.


Yeah, it used to be even worse than it is now, but it is still a 
terrible switch that should NEVER be used. There are always 
better ways and we should stop telling people to use it.


The newer `-check` and `-boundscheck` switches give more 
fine-grained control if you need it (and you again never should, 
since you can control on a per-expression basis with @trusted and 
the .ptr attribute.)


Re: Array fill performance differences between for, foreach, slice

2020-04-01 Thread Jesse Phillips via Digitalmars-d-learn

On Wednesday, 1 April 2020 at 12:22:48 UTC, Adam D. Ruppe wrote:
On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg 
wrote:
You have not enabled optimizations. You should compile with 
`-O -release -inline` to enable all optimizations.


-release should *never* be used. You're trading memory safety 
and security holes for a few microseconds of execution time. 
Not worth the risk.




It is nice that bounds checks remain in place when using release 
and the code is @safe.


Re: Array fill performance differences between for, foreach, slice

2020-04-01 Thread lithium iodate via Digitalmars-d-learn

On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
I you care about performance you should really compile using 
LDC (with `-O5 -release -flto=full 
-defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually 
produces much better code:


Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621
Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039
For: Mean time(usecs): 39.89, Standard Deviation: 7.80041

Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014
Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672
For: Mean time(usecs): 40.02, Standard Deviation: 7.67388


The benchmark is highly flawed, the results for LDC are 
completely meaningless. LDC produces identical code for every 
case and the filling procedures are all elided.


Re: Array fill performance differences between for, foreach, slice

2020-04-01 Thread Adam D. Ruppe via Digitalmars-d-learn

On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
You have not enabled optimizations. You should compile with `-O 
-release -inline` to enable all optimizations.


-release should *never* be used. You're trading memory safety and 
security holes for a few microseconds of execution time. Not 
worth the risk.


For realistic code, try to make it fast without using that 
switch, there's plenty of other ways, like isolated @trusted 
sections as needed. If you have significant cost from asserts you 
might turn them off with `-check=assert=off` to leave them out of 
the compile without affecting the other safety features.


I you care about performance you should really compile using 
LDC (with `-O5 -release -flto=full


again forget the -release switch. the rest cool though.


Re: Array fill performance differences between for, foreach, slice

2020-04-01 Thread Jacob Carlborg via Digitalmars-d-learn

On 2020-03-31 23:30, data pulverizer wrote:


$ dmd fill.d && ./fill


You have not enabled optimizations. You should compile with `-O -release 
-inline` to enable all optimizations.


Without optimizations I get numbers like these:

Slice: Mean time(usecs): 92.91, Standard Deviation: 49.8002
Foreach: Mean time(usecs): 183.95, Standard Deviation: 30.749
For: Mean time(usecs): 239.74, Standard Deviation: 30.753

With optimizations turned on I get quite varying results:

Slice: Mean time(usecs): 96.84, Standard Deviation: 52.5676
Foreach: Mean time(usecs): 95.84, Standard Deviation: 21.9244
For: Mean time(usecs): 106.73, Standard Deviation: 31.2545

Slice: Mean time(usecs): 192.03, Standard Deviation: 115.971
Foreach: Mean time(usecs): 177.06, Standard Deviation: 128.012
For: Mean time(usecs): 89.14, Standard Deviation: 25.681

Slice: Mean time(usecs): 97.74, Standard Deviation: 53.2058
Foreach: Mean time(usecs): 79.72, Standard Deviation: 11.0088
For: Mean time(usecs): 80.54, Standard Deviation: 12.06

I you care about performance you should really compile using LDC (with 
`-O5 -release -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), 
which usually produces much better code:


Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621
Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039
For: Mean time(usecs): 39.89, Standard Deviation: 7.80041

Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014
Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672
For: Mean time(usecs): 40.02, Standard Deviation: 7.67388

--
/Jacob Carlborg


Re: Array fill performance differences between for, foreach, slice

2020-03-31 Thread Steven Schveighoffer via Digitalmars-d-learn

On 3/31/20 5:30 PM, data pulverizer wrote:
I've observed large differences in timing performance while filling 
arrays using different methods (for vs foreach vs arr[] = x) and don't 
know why. I've looked at array.d 
(https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/src/dmd/root/array.d) 
but I'm still none the wiser.


for: you are indexing an array on each step, incurring a bounds check, 
and also doing a pointer calculation on each access.


foreach: Better, iterates a pointer over the whole array, so it's going 
to be faster.


slice assign: Best, this uses whatever tricks are possible in the given 
architecture. Most likely memset, or memcpy. These are insanely 
optimized functions, which is why they perform so well.


-Steve