Re: gdc or ldc for faster programs?

2022-03-10 Thread Chris Piker via Digitalmars-d-learn

On Tuesday, 25 January 2022 at 20:04:04 UTC, Adam D Ruppe wrote:
Not surprising at all: gdc is excellent and underrated in the 
community.


The performance metrics are just a bonus.  Gdc is the main reason 
I can get my worksite to take D seriously since we're a 
traditional unix shop (solaris -> linux).  The gcd crew are doing 
a *huge* service for the community.





Re: gdc or ldc for faster programs?

2022-03-09 Thread Iain Buclaw via Digitalmars-d-learn
On Monday, 31 January 2022 at 10:33:49 UTC, Siarhei Siamashka 
wrote:

I wonder if GDC can do the same?


GDC as a front-end doesn't dictate what the optimization passes 
are doing, nor does it have any real control what each level 
means.  It is only ensured that semantic doesn't break because of 
an optimization pass.


Re: gdc or ldc for faster programs?

2022-01-31 Thread Siarhei Siamashka via Digitalmars-d-learn
On Monday, 31 January 2022 at 08:54:16 UTC, Patrick Schluter 
wrote:
-O3 often chooses longer code and unrollsmore agressively 
inducing higher miss rates in the instruction caches.

-O2 can beat -O3 in some cases when code size is important.


One of the historical reasons for favoring -O2 optimization level 
over -O3 was the necessity for Linux distributions to fit on a CD 
or DVD. Also if everyone is using -O2 optimizations, then -O3 
optimizations get a lot less testing coverage and are more likely 
to have compiler bugs. This makes -O2 even more attractive for 
those, who prefer safety and stability...


I think that it's a good thing that LDC is breaking out of this 
-O2 vs. -O3 dilemma by just mapping "-O" option to -O3 
("aggressive optimizations"):


Setting the optimization level:
  -O   - Equivalent to -O3
  --O0  - No optimizations 
(default)

  --O1  - Simple optimizations
  --O2  - Good optimizations
  --O3  - Aggressive 
optimizations

  --O4  - Equivalent to -O3
  --O5  - Equivalent to -O3
  --Os  - Like -O2 with extra 
optimizations for size
  --Oz  - Like -Os but 
reduces code size further


I wonder if GDC can do the same?


Re: gdc or ldc for faster programs?

2022-01-31 Thread Elronnd via Digitalmars-d-learn
On Monday, 31 January 2022 at 08:54:16 UTC, Patrick Schluter 
wrote:
-O3 often chooses longer code and unrollsmore agressively 
inducing higher miss rates in the instruction caches.

-O2 can beat -O3 in some cases when code size is important.


That is generally true.  My point is that GCC and Clang make 
different tradeoffs when told '-O2'; Clang is more aggressive 
than GCC at -O2.  I don't know if that still holds at -O3 (I 
expect probably not).


Re: gdc or ldc for faster programs?

2022-01-31 Thread Patrick Schluter via Digitalmars-d-learn

On Tuesday, 25 January 2022 at 22:41:35 UTC, Elronnd wrote:

On Tuesday, 25 January 2022 at 22:33:37 UTC, H. S. Teoh wrote:
interesting because idivl is known to be one of the slower 
instructions, but gdc nevertheless considered it not 
worthwhile to replace it, whereas ldc seems obsessed about 
avoid idivl at all costs.


Interesting indeed.  Two remarks:

1. Actual performance cost of div depends a lot on hardware.  
IIRC on my old intel laptop it's like 40-60 cycles; on my newer 
amd chip it's more like 20; on my mac it's ~10.  GCC may be 
assuming newer hardware than llvm.  Could be worth popping on a 
-march=native -mtune=native.  Also could depend on how many 
ports can do divs; i.e. how many of them you can have running 
at a time.


2. LLVM is more aggressive wrt certain optimizations than gcc, 
by default.  Though I don't know how relevant that is at -O3.


-O3 often chooses longer code and unrollsmore agressively 
inducing higher miss rates in the instruction caches.

-O2 can beat -O3 in some cases when code size is important.


Re: gdc or ldc for faster programs?

2022-01-30 Thread Salih Dincer via Digitalmars-d-learn

On Saturday, 29 January 2022 at 18:28:06 UTC, Ali Çehreli wrote:

On 1/29/22 10:04, Salih Dincer wrote:

> Could you also try the following
> code with the same configurations?

The program you posted with 2 million random values:

ldc 1.9 seconds
gdc 2.3 seconds
dmd 2.8 seconds

I understand such short tests are not definitive but to have a 
rough idea between two programs, the last version of my program 
that used sprintf with 2 million numbers takes less time...




sprintf() might be really fast, but your algorithm is definitely 
2.5x faster than mine! (with LDC) I couldn't compile with GDC. 
Theoretically, I might have lost the challenge :)


With love and respect...


Re: gdc or ldc for faster programs?

2022-01-29 Thread Siarhei Siamashka via Digitalmars-d-learn

On Saturday, 29 January 2022 at 18:28:06 UTC, Ali Çehreli wrote:
(And now we know gdc can go about 7% faster with additional 
command line switches.)


No, we don't know this yet ;-) That's just what I said and I may 
be bullshitting. Or the configuration of my computer is 
significantly different from yours and the exact speedup/slowdown 
number may be different. So please verify it yourself. You can 
edit your `dub.json` file to add the following line to it:


"dflags-gdc": ["-fno-weak-templates"],

Then rebuild your spellout test program with gdc (just like you 
did before), run benchmarks and report results. The 
'-fno-weak-templates' option should show up in the gdc invocation 
command line.


Re: gdc or ldc for faster programs?

2022-01-29 Thread max haughton via Digitalmars-d-learn

On Saturday, 29 January 2022 at 18:28:06 UTC, Ali Çehreli wrote:

On 1/29/22 10:04, Salih Dincer wrote:

> Could you also try the following code with the same
configurations?

The program you posted with 2 million random values:

ldc 1.9 seconds
gdc 2.3 seconds
dmd 2.8 seconds

I understand such short tests are not definitive but to have a 
rough idea between two programs, the last version of my program 
that used sprintf with 2 million numbers takes less time:


ldc 0.4 seconds
gdc 0.5 seconds
dmd 0.5 seconds

(And now we know gdc can go about 7% faster with additional 
command line switches.)


Ali


You need to be compiling with PGO to test the compilers optimizer 
to the maximum. Without PGO they have to assume a fairly 
conservative flow through the code which means things like 
inlining and register allocation are effectively flying blind.




Re: gdc or ldc for faster programs?

2022-01-29 Thread Ali Çehreli via Digitalmars-d-learn

On 1/29/22 10:04, Salih Dincer wrote:

> Could you also try the following code with the same configurations?

The program you posted with 2 million random values:

ldc 1.9 seconds
gdc 2.3 seconds
dmd 2.8 seconds

I understand such short tests are not definitive but to have a rough 
idea between two programs, the last version of my program that used 
sprintf with 2 million numbers takes less time:


ldc 0.4 seconds
gdc 0.5 seconds
dmd 0.5 seconds

(And now we know gdc can go about 7% faster with additional command line 
switches.)


Ali



Re: gdc or ldc for faster programs?

2022-01-29 Thread Salih Dincer via Digitalmars-d-learn

On Wednesday, 26 January 2022 at 18:00:41 UTC, Ali Çehreli wrote:



For completeness (and noise :/) here is the final version of 
the program:




Could you also try the following code with the same 
configurations?


```d
struct LongScale {
  struct ShortStack {
short[] stack;
size_t index;

@property back() {
  return this.stack[0];
}

@property push(short data) {
  this.stack ~= data;
  this.index++;
}

@property pop() {
 return this.stack[--this.index];
}
  }

  ShortStack stack;

  this(long i) {
long s, t = i;
for(long e = 3; e <= 18; e += 3) {
  s = 10^^e;
  stack.push = cast(short)((t % s) / (s/1000L));
  t -= t % s;
}
stack.push = cast(short)(t / s);
  }

  string toString() {
string[] scale = [" zero", "thousand", "million",
"billion", "trillion", "quadrillion", "quintillion"];
string r;
for(long e = 6; e > 0; e--) {
  auto t = stack.pop;
  r ~= t > 1 ? " " ~to!string(t) : t ? " one" : "";
  r ~= t ? " " ~scale[e] : "";
}
r ~= stack.back ? " " ~to!string(stack.back) : "";
return r.length ? r : scale[0];
  }
}

import std.conv, std.stdio;
void main()
{
  long[] inputs = [ 741, 1_500, 2_001,
  5_005, 1_250_000, 3_000_042, 10_000_000,
  1_000_000, 2_000_000, 100_000, 200_000,
  10_000, 20_000, 1_000, 2_000, 74, 7, 0,
  1_999_999_999_999];

  foreach(long i; inputs) {
auto OUT = LongScale(i);
auto STR = OUT.toString[1..$];
writefln!"%s"(STR);
  }
}
```


Re: gdc or ldc for faster programs?

2022-01-28 Thread Siarhei Siamashka via Digitalmars-d-learn

On Friday, 28 January 2022 at 18:02:27 UTC, Iain Buclaw wrote:

For example, druntime depends on this behaviour.

Template: 
https://github.com/dlang/druntime/blob/a0ad8c42c15942faeeafb016e81a360113ae1b6b/src/rt/config.d#L46-L58


Ouch. From where I stand, this looks like some really ugly hack 
abusing both the template keyword and mangle pragma. Presumably 
intended to implement this part of the spec: 
https://dlang.org/library/rt/config.html


Moreover, these are even global variables rather than functions. 
Wouldn't it make more sense to use a special "weak" attribute for 
this particular use case? I see that there was a related 
discussion here: 
https://forum.dlang.org/post/rgmp5d$198g$1...@digitalmars.com


Regular symbol: 
https://github.com/dlang/druntime/blob/a17bb23b418405e1ce8e4a317651039758013f39/test/config/src/test19433.d#L1


If we can rely on instantiated symbols to not violate ODR, then 
you would be able to put symbols in the .link-once section.  
However all duplicates must also be in the .link-once section, 
else you'll get duplicate definition errors.


Duplicate definition errors are surely better than something 
fishy silently happening under the hood. They can be solved 
when/if we encounter them. That said, I can confirm that GDC 10 
indeed fails with `multiple definition of 'rt_cmdline_enabled'` 
linker error when trying to compile:


```D
extern(C) __gshared bool rt_cmdline_enabled = false;
void main() { }
```

But can't GDC just use something like this in `rt/config.d` to 
solve the problem?

```D
version(GNU) {
import gcc.attribute;
pragma(mangle, "rt_envvars_enabled") @attribute("weak") 
__gshared bool rt_envvars_enabled_ = false;
pragma(mangle, "rt_cmdline_enabled") @attribute("weak") 
__gshared bool rt_cmdline_enabled_ = true;
pragma(mangle, "rt_options") @attribute("weak") __gshared 
string[] rt_options_ = [];

bool rt_envvars_enabled()() { return rt_envvars_enabled_; }
bool rt_cmdline_enabled()() { return rt_cmdline_enabled_; }
bool rt_options()() { return rt_options_; }
} else {
// put each variable in its own COMDAT by making them 
template instances

template rt_envvars_enabled()
{
pragma(mangle, "rt_envvars_enabled") __gshared bool 
rt_envvars_enabled = false;

}
template rt_cmdline_enabled()
{
pragma(mangle, "rt_cmdline_enabled") __gshared bool 
rt_cmdline_enabled = true;

}
template rt_options()
{
pragma(mangle, "rt_options") __gshared string[] 
rt_options = [];

}
}
```


Re: gdc or ldc for faster programs?

2022-01-28 Thread Iain Buclaw via Digitalmars-d-learn
On Thursday, 27 January 2022 at 20:28:40 UTC, Siarhei Siamashka 
wrote:
On Thursday, 27 January 2022 at 18:12:18 UTC, Johan Engelen 
wrote:
But the language requires ODR, so we can emit templates as 
weak_odr, telling the optimizer and linker that the symbols 
should be merged _and_ that ODR can be assumed to hold (i.e. 
inlining is OK).


Thanks! This was also my impression. But the problem is that 
Iain Buclaw seems to disagree with us. He claims that template 
functions must be overridable by global functions and this is 
supposed to inhibit template functions inlining. Is there any 
independent source to back up your or Iain's claim?




For example, druntime depends on this behaviour.

Template: 
https://github.com/dlang/druntime/blob/a0ad8c42c15942faeeafb016e81a360113ae1b6b/src/rt/config.d#L46-L58


Regular symbol: 
https://github.com/dlang/druntime/blob/a17bb23b418405e1ce8e4a317651039758013f39/test/config/src/test19433.d#L1


If we can rely on instantiated symbols to not violate ODR, then 
you would be able to put symbols in the .link-once section.  
However all duplicates must also be in the .link-once section, 
else you'll get duplicate definition errors.


Re: gdc or ldc for faster programs?

2022-01-27 Thread Siarhei Siamashka via Digitalmars-d-learn

On Thursday, 27 January 2022 at 18:12:18 UTC, Johan Engelen wrote:
But the language requires ODR, so we can emit templates as 
weak_odr, telling the optimizer and linker that the symbols 
should be merged _and_ that ODR can be assumed to hold (i.e. 
inlining is OK).


Thanks! This was also my impression. But the problem is that Iain 
Buclaw seems to disagree with us. He claims that template 
functions must be overridable by global functions and this is 
supposed to inhibit template functions inlining. Is there any 
independent source to back up your or Iain's claim?


The onus of honouring ODR is on the user - not the compiler - 
because we allow the user to do separate compilation.


My own limited experiments with various code snippets convinced 
me that D compilers actually try their best to prevent ODR 
violation, so it isn't like users can easily hurt themselves: 
https://forum.dlang.org/thread/cstjhjvmmibonbajw...@forum.dlang.org


Also module names are added as a part of function names mangling. 
Having an accidental clash of symbol names shouldn't be very 
likely in a valid D project. Though I'm not absolutely sure 
whether this provides a sufficient safety net.


Re: gdc or ldc for faster programs?

2022-01-27 Thread Johan Engelen via Digitalmars-d-learn

On Thursday, 27 January 2022 at 16:46:59 UTC, Ali Çehreli wrote:


What I know is that weak symbols can be overridden by strong 
symbols during linking. Which means, if a function body is 
inlined which also has a weak symbol, some part of the program 
may be using the inlined definition and some other parts may be 
using the overridden definition. Thanks to separate 
compilation, they need not match hence the violation of the 
one-definition rule (ODR).


But the language requires ODR, so we can emit templates as 
weak_odr, telling the optimizer and linker that the symbols 
should be merged _and_ that ODR can be assumed to hold (i.e. 
inlining is OK).
The onus of honouring ODR is on the user - not the compiler - 
because we allow the user to do separate compilation. Some more 
detailed explanation and example:

https://stackoverflow.com/questions/44335046/how-does-the-linker-handle-identical-template-instantiations-across-translation/44346057

-Johan



Re: gdc or ldc for faster programs?

2022-01-27 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 27, 2022 at 08:46:59AM -0800, Ali Çehreli via Digitalmars-d-learn 
wrote:
[...]
> I see that template instantiations are linked through weak symbols:
> 
> $ nm deneme | grep foo
> [...]
> 00021380 W _D6deneme__T3fooTiZQhFNaNbNiNfZv
> 
> What I know is that weak symbols can be overridden by strong symbols
> during linking.
[...]

Yes, and it also means that only one copy of the symbol will make it
into the executable. This is one of the ways we leverage the linker to
eliminate (merge) duplicate template instantiations.


T

-- 
Claiming that your operating system is the best in the world because more 
people use it is like saying McDonalds makes the best food in the world. -- 
Carl B. Constantine


Re: gdc or ldc for faster programs?

2022-01-27 Thread Ali Çehreli via Digitalmars-d-learn

On 1/26/22 11:07, Siarhei Siamashka wrote:
> On Wednesday, 26 January 2022 at 18:41:51 UTC, Iain Buclaw wrote:
>> The D language shot itself in the foot by requiring templates to have
>> weak semantics.
>>
>> If DMD and LDC inline weak functions, that's their bug.
>
> As I already mentioned in the bugzilla, it would be really useful to see
> a practical example of DMD and LDC running into troubles because of
> mishandling weak templates.

I am not experienced enough to answer but the way I understand weak 
symbols, it is possible to run into trouble but it will probably never 
happen. When it happens, I suspect people can find workarounds like 
disabling inlining.


> I was never able to find anything about
> "requiring templates to have weak semantics" anywhere in the Dlang
> documentation or on the Internet.

The truth is some part of D's spec is the implementation. When I compile 
the following program (with dmd)


void foo(T)() {}

void main() {
  foo!int();
}

I see that template instantiations are linked through weak symbols:

$ nm deneme | grep foo
[...]
00021380 W _D6deneme__T3fooTiZQhFNaNbNiNfZv

What I know is that weak symbols can be overridden by strong symbols 
during linking. Which means, if a function body is inlined which also 
has a weak symbol, some part of the program may be using the inlined 
definition and some other parts may be using the overridden definition. 
Thanks to separate compilation, they need not match hence the violation 
of the one-definition rule (ODR).


Ali



Re: gdc or ldc for faster programs?

2022-01-26 Thread Siarhei Siamashka via Digitalmars-d-learn

On Wednesday, 26 January 2022 at 18:41:51 UTC, Iain Buclaw wrote:
The D language shot itself in the foot by requiring templates 
to have weak semantics.


If DMD and LDC inline weak functions, that's their bug.


As I already mentioned in the bugzilla, it would be really useful 
to see a practical example of DMD and LDC running into troubles 
because of mishandling weak templates. I was never able to find 
anything about "requiring templates to have weak semantics" 
anywhere in the Dlang documentation or on the Internet. Asking 
for clarification in this forum yielded no results either. Maybe 
I'm missing something obvious when reading the 
https://dlang.org/spec/template.html page?


I have no doubt that you have your own opinion about how this 
stuff is supposed to work, but I have no crystal ball and don't 
know what's happening in your head.


Re: gdc or ldc for faster programs?

2022-01-26 Thread Iain Buclaw via Digitalmars-d-learn
On Wednesday, 26 January 2022 at 18:39:07 UTC, Siarhei Siamashka 
wrote:


It's not DMD doing a good job here, but GDC11 shooting itself 
in the foot by requiring additional  esoteric command line 
options if you really want to produce optimized binaries.


The D language shot itself in the foot by requiring templates to 
have weak semantics.


If DMD and LDC inline weak functions, that's their bug.


Re: gdc or ldc for faster programs?

2022-01-26 Thread Siarhei Siamashka via Digitalmars-d-learn

On Wednesday, 26 January 2022 at 18:00:41 UTC, Ali Çehreli wrote:
ldc shines with sprintf. And dmd suprises by being a little bit 
faster than gdc! (?)


ldc (2.098.0): ~6.2 seconds
dmd (2.098.1): ~7.4 seconds
gdc (2.076.?): ~7.5 seconds

Again, here are the versions of the compilers that are readily 
available on my system:


> ldc: LDC - the LLVM D compiler (1.28.0):
>based on DMD v2.098.0 and LLVM 13.0.0
>
> gdc: dc (GCC) 11.1.0 (Uses dmd 2.076 front end)


It's not DMD doing a good job here, but GDC11 shooting itself in 
the foot by requiring additional  esoteric command line options 
if you really want to produce optimized binaries. See 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102765 for more 
details.


You can try to re-run your benchmark after adding '-flto' or 
'-fno-weak-templates' to GDC command line. I see a ~7% speedup 
for your code on my computer.


Re: gdc or ldc for faster programs?

2022-01-26 Thread Iain Buclaw via Digitalmars-d-learn

On Wednesday, 26 January 2022 at 11:43:39 UTC, forkit wrote:
On Wednesday, 26 January 2022 at 11:25:47 UTC, Iain Buclaw 
wrote:


Whenever I've watched talks/demos where benchmarks were the 
central topic, GDC has always blown LDC out the water when it 
comes to matters of math.

..


https://dlang.org/blog/2020/05/14/lomutos-comeback/


Andrei forgot to do a follow up where one weird trick makes the 
gdc compiled lumutos same speed as C++ (and faster than ldc).


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96429


Re: gdc or ldc for faster programs?

2022-01-26 Thread Ali Çehreli via Digitalmars-d-learn
ldc shines with sprintf. And dmd suprises by being a little bit faster 
than gdc! (?)


ldc (2.098.0): ~6.2 seconds
dmd (2.098.1): ~7.4 seconds
gdc (2.076.?): ~7.5 seconds

Again, here are the versions of the compilers that are readily available 
on my system:


> ldc: LDC - the LLVM D compiler (1.28.0):
>based on DMD v2.098.0 and LLVM 13.0.0
>
> gdc: dc (GCC) 11.1.0 (Uses dmd 2.076 front end)
>
> dmd: DMD64 D Compiler v2.098.1

They were compiled with

  dub run --compiler= --build=release-nobounds --verbose

where  was ldc, dmd, or gdc.

I replaced formattedWrite in the code with sprintf. For example, the 
inner loop became


  foreach (divider; dividers!T.retro) {
const quotient = number / divider.value;

if (quotient) {
  output += sprintf(output, fmt!T.ptr, quotient, divider.word.ptr);
}

number %= divider.value;
  }
}

For completeness (and noise :/) here is the final version of the program:

module spellout.spellout;

// This program was written as a programming kata to spell out
// certain parts of integers as in "1 million 2 thousand
// 42". Note that this way of spelling-out numbers is not
// grammatically correct in English.

// Returns a string that contains the partly spelled-out version
// of the parameter.
//
// You must copy the returned string when needed as this function
// uses the same internal buffer for all invocations of the same
// template instance.
auto spellOut(T)(in T number_) {
  import std.string : strip;
  import std.traits : Unqual;
  import std.meta : AliasSeq;
  import core.stdc.stdio : sprintf;

  enum longestString =
"negative 9 quintillion 223 quadrillion 372 trillion" ~
" 36 billion 854 million 775 thousand 808";

  static char[longestString.length + 1] buffer;
  auto output = buffer.ptr;

  // We treat these specially because the algorithm below does
  // 'number = -number' and calls the same implementation
  // function. The trouble is, for example, -int.min is still a
  // negative number.
  alias problematics = AliasSeq!(
byte, "negative 128",
short, "negative 32 thousand 768",
int, "negative 2 billion 147 million 483 thousand 648",
long, longestString);

  static assert((problematics.length % 2) == 0);

  static foreach (i, P; problematics) {
static if (i % 2) {
  // This is a string; skip

} else {
  // This is a problematic type
  static if (is (T == P)) {
// Our T happens to be this problematic type
if (number_ == T.min) {
  // and we are dealing with a problematic value
  output += sprintf(output, problematics[i + 1].ptr);
  return buffer[0 .. (output - buffer.ptr)];
}
  }
}
  }

  auto number = cast(Unqual!T)number_; // Thanks 'in'! :p

  if (number == 0) {
output += sprintf(output, "zero");

  } else {
if (number < 0) {
  output += sprintf(output, "negative");
  static if (T.sizeof < int.sizeof) {
// Being careful with implicit conversions. (See the dmd
// command line switch -preview=intpromote)
number = cast(T)(-cast(int)number);

  } else {
number = -number;
  }
}

spellOutImpl(number, output);
  }

  return buffer[0 .. (output - buffer.ptr)].strip;
}

unittest {
  assert(1_001_500.spellOut == "1 million 1 thousand 500");
  assert((-1_001_500).spellOut ==
 "negative 1 million 1 thousand 500");
  assert(1_002_500.spellOut == "1 million 2 thousand 500");
}

template fmt(T) {
  static if (is (T == long)||
 is (T == ulong)) {
static fmt = " %lld %s";

  } else {
static fmt = " %u %s";
  }
}

import std.format : format;

void spellOutImpl(T)(T number, ref char * output)
in (number > 0, format!"Invalid number: %s"(number)) {
  import std.range : retro;
  import core.stdc.stdio : sprintf;

  foreach (divider; dividers!T.retro) {
const quotient = number / divider.value;

if (quotient) {
  output += sprintf(output, fmt!T.ptr, quotient, divider.word.ptr);
}

number %= divider.value;
  }
}

struct Divider(T) {
  T value;// 1_000, 1_000_000, etc.
  string word;// "thousand", etc
}

// Returns the words related with the provided size of an
// integral type. The parameter is number of bytes
// e.g. int.sizeof
auto words(size_t typeSize) {
  // This need not be recursive at all but it was fun using
  // recursion.
  final switch (typeSize) {
  case 1: return [ "" ];
  case 2: return words(1) ~ [ "thousand" ];
  case 4: return words(2) ~ [ "million", "billion" ];
  case 8: return words(4) ~ [ "trillion", "quadrillion", "quintillion" ];
  }
}

unittest {
  // These are relevant words for 'int' and 'uint' values:
  assert(words(4) == [ "", "thousand", "million", "billion" ]);
}

// Returns a Divider!T array associated with T
auto dividers(T)() {
  import std.range : array, enumerate;
  import std.algorithm : map;

  static const(Divider!T[]) result =
words(T.sizeof)
.enumerate!T
.map!(t => 

Re: gdc or ldc for faster programs?

2022-01-26 Thread Steven Schveighoffer via Digitalmars-d-learn

On 1/26/22 7:06 AM, Johan wrote:

Couldn't test with LDC 1.6 (dlang2.076), because it is too old and not 
running on M1/Monterey (?).


There was a range of macos dmd binaries that did not work after a 
certain MacOS. I think it had to do with the hack for TLS that apple 
changed, so it no longer worked.


-Steve


Re: gdc or ldc for faster programs?

2022-01-26 Thread Ali Çehreli via Digitalmars-d-learn

On 1/26/22 04:06, Johan wrote:

> The stdlib makes a huge difference in performance.
> Ali's program uses string manipulation,

Yes, on the surface, I thought my inner loop had just / and % but of 
course there is that formattedWrite. I will change the code to use 
sprintf into a static buffer (instead of the current Appender).


> GC

That shouldn't affect it because there are just about 8 allocations to 
be shared in the Appender.


> , ... much more than to()

Not in the 2 million loop.

> and
> map().

Only in the initialization.

> Quick test on my M1 macbook:
> LDC1.27, arm64 binary (native): ~0.83s
> LDC1.21, x86_64 binary (rosetta, not native to CPU instruction set): 
~0.75s


I think std.format gained abilities over the years. I will report back.

Ali



Re: gdc or ldc for faster programs?

2022-01-26 Thread Johan via Digitalmars-d-learn

On Wednesday, 26 January 2022 at 11:25:47 UTC, Iain Buclaw wrote:
On Wednesday, 26 January 2022 at 04:28:25 UTC, Ali Çehreli 
wrote:

On 1/25/22 16:15, Johan wrote:
> On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli
wrote:
>>
>> I am using compilers installed by Manjaro Linux's package
system:
>>
>> ldc: LDC - the LLVM D compiler (1.28.0):
>>   based on DMD v2.098.0 and LLVM 13.0.0
>>
>> gdc: dc (GCC) 11.1.0
>>
>> dmd: DMD64 D Compiler v2.098.1
>
> What phobos version is gdc using?

Oh! Good question. Unfortunately, I don't think Phobos modules 
contain that information. The following line outputs 2076L:


pragma(msg, __VERSION__);

So, I guess I've been comparing apples to oranges but in this 
case an older gdc is doing pretty well.




Doubt it.  Functions such as to(), map(), etc. have pretty much 
remained unchanged for the last 6-7 years.


The stdlib makes a huge difference in performance.
Ali's program uses string manipulation, GC, ... much more than 
to() and map().


Quick test on my M1 macbook:
LDC1.27, arm64 binary (native): ~0.83s
LDC1.21, x86_64 binary (rosetta, not native to CPU instruction 
set): ~0.75s
Couldn't test with LDC 1.6 (dlang2.076), because it is too old 
and not running on M1/Monterey (?).


-Johan



Re: gdc or ldc for faster programs?

2022-01-26 Thread forkit via Digitalmars-d-learn

On Wednesday, 26 January 2022 at 11:25:47 UTC, Iain Buclaw wrote:


Whenever I've watched talks/demos where benchmarks were the 
central topic, GDC has always blown LDC out the water when it 
comes to matters of math.

..


https://dlang.org/blog/2020/05/14/lomutos-comeback/



Re: gdc or ldc for faster programs?

2022-01-26 Thread Iain Buclaw via Digitalmars-d-learn

On Wednesday, 26 January 2022 at 04:28:25 UTC, Ali Çehreli wrote:

On 1/25/22 16:15, Johan wrote:
> On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli
wrote:
>>
>> I am using compilers installed by Manjaro Linux's package
system:
>>
>> ldc: LDC - the LLVM D compiler (1.28.0):
>>   based on DMD v2.098.0 and LLVM 13.0.0
>>
>> gdc: dc (GCC) 11.1.0
>>
>> dmd: DMD64 D Compiler v2.098.1
>
> What phobos version is gdc using?

Oh! Good question. Unfortunately, I don't think Phobos modules 
contain that information. The following line outputs 2076L:


pragma(msg, __VERSION__);

So, I guess I've been comparing apples to oranges but in this 
case an older gdc is doing pretty well.




Doubt it.  Functions such as to(), map(), etc. have pretty much 
remained unchanged for the last 6-7 years.


Whenever I've watched talks/demos where benchmarks were the 
central topic, GDC has always blown LDC out the water when it 
comes to matters of math.


Even in more recent examples where I've been pushing for native 
complex to be replaced with std.complex, LDC was found to be 
slower with std.complex, but GDC was either equal, or faster than 
native (and GDC std.complex was faster than LDC).


Re: gdc or ldc for faster programs?

2022-01-25 Thread Ali Çehreli via Digitalmars-d-learn

On 1/25/22 16:15, Johan wrote:
> On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:
>>
>> I am using compilers installed by Manjaro Linux's package system:
>>
>> ldc: LDC - the LLVM D compiler (1.28.0):
>>   based on DMD v2.098.0 and LLVM 13.0.0
>>
>> gdc: dc (GCC) 11.1.0
>>
>> dmd: DMD64 D Compiler v2.098.1
>
> What phobos version is gdc using?

Oh! Good question. Unfortunately, I don't think Phobos modules contain 
that information. The following line outputs 2076L:


pragma(msg, __VERSION__);

So, I guess I've been comparing apples to oranges but in this case an 
older gdc is doing pretty well.


Ali



Re: gdc or ldc for faster programs?

2022-01-25 Thread Johan via Digitalmars-d-learn

On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:


I am using compilers installed by Manjaro Linux's package 
system:


ldc: LDC - the LLVM D compiler (1.28.0):
  based on DMD v2.098.0 and LLVM 13.0.0

gdc: dc (GCC) 11.1.0

dmd: DMD64 D Compiler v2.098.1


What phobos version is gdc using?

-Johan



Re: gdc or ldc for faster programs?

2022-01-25 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 25, 2022 at 11:01:57PM +, forkit via Digitalmars-d-learn wrote:
> On Tuesday, 25 January 2022 at 20:01:18 UTC, Johan wrote:
> > 
> > Tough to say. Of course DMD is not a serious contender, but I
> > believe the difference between GDC and LDC is very small and really
> > in the details, i.e. you'll have to look at assembly to find out the
> > delta.  Have you tried `--enable-cross-module-inlining` with LDC?
[...]
> dmd is the best though, in terms of compilation speed without
> optimisation.
> 
> As I write/test A LOT of code, that time saved is very much
> appreciated ;-)
[...]

My general approach is: use dmd for iterating the code - compile - test
cycle, and use LDC for release/production builds.


T

-- 
Chance favours the prepared mind. -- Louis Pasteur


Re: gdc or ldc for faster programs?

2022-01-25 Thread forkit via Digitalmars-d-learn

On Tuesday, 25 January 2022 at 20:01:18 UTC, Johan wrote:


Tough to say. Of course DMD is not a serious contender, but I 
believe the difference between GDC and LDC is very small and 
really in the details, i.e. you'll have to look at assembly to 
find out the delta.

Have you tried `--enable-cross-module-inlining` with LDC?

-Johan


dmd is the best though, in terms of compilation speed without 
optimisation.


As I write/test A LOT of code, that time saved is very much 
appreciated ;-)


I hope it remains that way.


Re: gdc or ldc for faster programs?

2022-01-25 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 25, 2022 at 10:41:35PM +, Elronnd via Digitalmars-d-learn wrote:
> On Tuesday, 25 January 2022 at 22:33:37 UTC, H. S. Teoh wrote:
> > interesting because idivl is known to be one of the slower
> > instructions, but gdc nevertheless considered it not worthwhile to
> > replace it, whereas ldc seems obsessed about avoid idivl at all
> > costs.
> 
> Interesting indeed.  Two remarks:
> 
> 1. Actual performance cost of div depends a lot on hardware.  IIRC on
> my old intel laptop it's like 40-60 cycles; on my newer amd chip it's
> more like 20; on my mac it's ~10.  GCC may be assuming newer hardware
> than llvm.  Could be worth popping on a -march=native -mtune=native.
> Also could depend on how many ports can do divs; i.e. how many of them
> you can have running at a time.

I tried `ldc2 -mcpu=native` but that did not significantly change the
performance.


> 2. LLVM is more aggressive wrt certain optimizations than gcc, by
> default.  Though I don't know how relevant that is at -O3.

Yeah, I've noted in the past that LDC seems to be pretty aggressive with
inlining / loop unrolling, whereas GDC has a thing for vectorization and
SIMD/XMM usage.  The exact outcomes are a toss-up, though. Sometimes LDC
wins, sometimes GDC wins.  Depends on what exactly the code is doing.


T

-- 
"Outlook not so good." That magic 8-ball knows everything! I'll ask about 
Exchange Server next. -- (Stolen from the net)


Re: gdc or ldc for faster programs?

2022-01-25 Thread Ali Çehreli via Digitalmars-d-learn

On 1/25/22 14:33, H. S. Teoh wrote:

> This is very interesting

Fascinating code generation and investigation! :)

Ali



Re: gdc or ldc for faster programs?

2022-01-25 Thread Elronnd via Digitalmars-d-learn

On Tuesday, 25 January 2022 at 22:33:37 UTC, H. S. Teoh wrote:
interesting because idivl is known to be one of the slower 
instructions, but gdc nevertheless considered it not worthwhile 
to replace it, whereas ldc seems obsessed about avoid idivl at 
all costs.


Interesting indeed.  Two remarks:

1. Actual performance cost of div depends a lot on hardware.  
IIRC on my old intel laptop it's like 40-60 cycles; on my newer 
amd chip it's more like 20; on my mac it's ~10.  GCC may be 
assuming newer hardware than llvm.  Could be worth popping on a 
-march=native -mtune=native.  Also could depend on how many ports 
can do divs; i.e. how many of them you can have running at a time.


2. LLVM is more aggressive wrt certain optimizations than gcc, by 
default.  Though I don't know how relevant that is at -O3.


Re: gdc or ldc for faster programs?

2022-01-25 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 25, 2022 at 01:30:59PM -0800, Ali Çehreli via Digitalmars-d-learn 
wrote:
[...]
> I posted the program to have more eyes on the assembly. ;)
[...]

I tested the code locally, and observed, just like Ali did, that the LDC
version is unambiguously slower than the gdc version by a small margin.

So I decided to compare the disassembly.  Due to the large number of
templates in the main spellOut/spellOutImpl functions, I didn't have the
time to look at all of them; I just arbitrarily picked the !(int)
instantiation. And I'm seeing something truly fascinating:

- The GDC version has at its core a single idivl instruction for the /
  and %= operators (I surmise that the optimizer realized that both
  could share the same instruction because it yields both results).  The
  function is short and compact.

- The LDC version, however, seems to go out of its way to avoid the
  idivl instruction, having instead a whole bunch of shr instructions
  and imul instructions involving magic constants -- the kind of stuff
  you see in bit-twiddling hacks when people try to ultra-optimize their
  code.  There also appears to be some loop unrolling, and the function
  is markedly longer than the GDC version because of this.

This is very interesting because idivl is known to be one of the slower
instructions, but gdc nevertheless considered it not worthwhile to
replace it, whereas ldc seems obsessed about avoid idivl at all costs.

I didn't check the other instantiations, but it would appear that in
this case the simpler route of just using idivl won over the complexity
of trying to replace it with shr+mul.


T

-- 
Guns don't kill people. Bullets do.


Re: gdc or ldc for faster programs?

2022-01-25 Thread Ali Çehreli via Digitalmars-d-learn

On 1/25/22 12:42, H. S. Teoh wrote:

>> For a test run for 2 million numbers:
>>
>> ldc: ~0.95 seconds
>> gdc: ~0.79 seconds
>> dmd: ~1.77 seconds
>
> For measurements under 1 second, I'm skeptical of the accuracy, because
> there could be all kinds of background noise, CPU interrupts and stuff
> that could be skewing the numbers.  What about do a best-of-3-runs with
> 20 million numbers (expected <20 seconds per run) and see how the
> numbers look?

Makes sense. The results are similar to the 2 million run.

> But these sorts of statements are just generalizations. The best way to
> find out for sure is to disassemble the executable and see for yourself
> what the assembly looks like. :-)

I posted the program to have more eyes on the assembly. ;)

Ali



Re: gdc or ldc for faster programs?

2022-01-25 Thread Ali Çehreli via Digitalmars-d-learn

On 1/25/22 12:59, Daniel N wrote:


Maybe you can try --ffast-math on ldc.


Did not make a difference.

Ali



Re: gdc or ldc for faster programs?

2022-01-25 Thread Ali Çehreli via Digitalmars-d-learn

On 1/25/22 12:01, Johan wrote:


Have you tried `--enable-cross-module-inlining` with LDC?


Tried now. Makes no difference that I can sense, likely because there is 
only one module anyway. :) (But I guess it works over Phobos modules too.)


Ali


Re: gdc or ldc for faster programs?

2022-01-25 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 25, 2022 at 08:04:04PM +, Adam D Ruppe via Digitalmars-d-learn 
wrote:
> On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:
> > ldc: ~0.95 seconds
> > gdc: ~0.79 seconds
> > dmd: ~1.77 seconds
> 
> Not surprising at all: gdc is excellent and underrated in the
> community.

The GCC optimizer is actually pretty darned good, comparable to LDC's. I
only prefer LDC because of easier cross-compilation and more up-to-date
language version (due to GDC being tied to GCC's release cycle). But I
wouldn't hesitate to use gdc if I didn't need to cross-compile or use
features from the latest language version.

DMD's optimizer is miles behind LDC/GDC, sad to say. About the only
thing that keeps me using dmd is its lightning-fast compilation times,
ideal for iterative development. For anything performance related, DMD
isn't even on my radar.


T

-- 
Doubtless it is a good thing to have an open mind, but a truly open mind should 
be open at both ends, like the food-pipe, with the capacity for excretion as 
well as absorption. -- Northrop Frye


Re: gdc or ldc for faster programs?

2022-01-25 Thread Ali Çehreli via Digitalmars-d-learn

On 1/25/22 11:52, Ali Çehreli wrote:

> a program I wrote about spelling-out parts of a number

Here is the program as a single module:

module spellout.spellout;

// This program was written as a code kata to spell out
// certain parts of integers as in "1 million 2 thousand
// 42". Note that this way of spelling-out numbers is not
// grammatically correct in English.

// Returns a string that contains the partly spelled-out version
// of the parameter.
//
// You must copy the returned string when needed as this function
// uses the same internal buffer for all invocations of the same
// template instance.
auto spellOut(T)(in T number_) {
  import std.array : Appender;
  import std.string : strip;
  import std.traits : Unqual;
  import std.meta : AliasSeq;

  static Appender!(char[]) result;
  result.clear;

  // We treat these specially because the algorithm below does
  // 'number = -number' and calls the same implementation
  // function. The trouble is, for example, -int.min is still a
  // negative number.
  alias problematics = AliasSeq!(
byte, "negative 128",
short, "negative 32 thousand 768",
int, "negative 2 billion 147 million 483 thousand 648",
long, "negative 9 quintillion 223 quadrillion 372 trillion" ~
  " 36 billion 854 million 775 thousand 808");

  static assert((problematics.length % 2) == 0);

  static foreach (i, P; problematics) {
static if (i % 2) {
  // This is a string; skip

} else {
  // This is a problematic type
  static if (is (T == P)) {
// Our T happens to be this problematic type
if (number_ == T.min) {
  // and we are dealing with a problematic value
  result ~= problematics[i + 1];
  return result.data;
}
  }
}
  }

  auto number = cast(Unqual!T)number_; // Thanks 'in'! :p

  if (number == 0) {
result ~= "zero";

  } else {
if (number < 0) {
  result ~= "negative";
  static if (T.sizeof < int.sizeof) {
// Being careful with implicit conversions. (See the dmd
// command line switch -preview=intpromote)
number = cast(T)(-cast(int)number);

  } else {
number = -number;
  }
}

spellOutImpl(number, result);
  }

  return result.data.strip;
}

unittest {
  assert(1_001_500.spellOut == "1 million 1 thousand 500");
  assert((-1_001_500).spellOut ==
 "negative 1 million 1 thousand 500");
  assert(1_002_500.spellOut == "1 million 2 thousand 500");
}

import std.format : format;
import std.range : isOutputRange;

void spellOutImpl(T, O)(T number, ref O output)
if (isOutputRange!(O, char))
in (number > 0, format!"Invalid number: %s"(number)) {
  import std.range : retro;
  import std.format : formattedWrite;

  foreach (divider; dividers!T.retro) {
const quotient = number / divider.value;

if (quotient) {
  output.formattedWrite!" %s %s"(quotient, divider.word);
}

number %= divider.value;
  }
}

struct Divider(T) {
  T value;// 1_000, 1_000_000, etc.
  string word;// "thousand", etc
}

// Returns the words related with the provided size of an
// integral type. The parameter is number of bytes
// e.g. int.sizeof
auto words(size_t typeSize) {
  // This need not be recursive at all but it was fun using
  // recursion.
  final switch (typeSize) {
  case 1: return [ "" ];
  case 2: return words(1) ~ [ "thousand" ];
  case 4: return words(2) ~ [ "million", "billion" ];
  case 8: return words(4) ~ [ "trillion", "quadrillion", "quintillion" ];
  }
}

unittest {
  // These are relevant words for 'int' and 'uint' values:
  assert(words(4) == [ "", "thousand", "million", "billion" ]);
}

// Returns a Divider!T array associated with T
auto dividers(T)() {
  import std.range : array, enumerate;
  import std.algorithm : map;

  static const(Divider!T[]) result =
words(T.sizeof)
.enumerate!T
.map!(t => Divider!T(cast(T)(10^^(t.index * 3)), t.value))
.array;

  return result;
}

unittest {
  // Test a few entries
  assert(dividers!int[1] == Divider!int(1_000, "thousand"));
  assert(dividers!ulong[3] == Divider!ulong(1_000_000_000, "billion"));
}

void main() {
  version (test) {
return;
  }

  import std.meta : AliasSeq;
  import std.stdio : writefln;
  import std.random : Random, uniform;
  import std.conv : to;

  static foreach (T; AliasSeq!(byte, ubyte, short, ushort,
   int, uint, long, ulong)) {{
  // A few numbers for each type
  report(T.min);
  report((T.max / 4).to!T);  // Overcome int promotion for
 // shorter types because I want
 // to test with the exact type
 // e.g. for byte.
  report(T.max);
}}

  enum count = 2_000_000;
  writefln!"Testing with %,s random numbers"(spellOut(count));

  // Use the same seed to be fair between compilations
  enum seed = 0;
  auto rnd = Random(seed);

  ulong totalLength;
  foreach (i; 0 

Re: gdc or ldc for faster programs?

2022-01-25 Thread Daniel N via Digitalmars-d-learn

On Tuesday, 25 January 2022 at 20:04:04 UTC, Adam D Ruppe wrote:

On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:

ldc: ~0.95 seconds
gdc: ~0.79 seconds
dmd: ~1.77 seconds




Maybe you can try --ffast-math on ldc.


Re: gdc or ldc for faster programs?

2022-01-25 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 25, 2022 at 11:52:17AM -0800, Ali Çehreli via Digitalmars-d-learn 
wrote:
> Sorry for being vague and not giving the code here but a program I
> wrote about spelling-out parts of a number (in Turkish) as in "1
> milyon 42" runs much faster with gdc.
> 
> The program integer-divides the number in a loop to find quotients and
> adds the word next to it. One obvious optimization might be to use
> POSIX div() and friends to get the quotient and the remainder at one
> shot but I made myself believe that the compilers already do that.
> (But still not sure. :o))

Don't guess at what the compilers are doing; disassemble the binary and
see for yourself exactly what the difference is. Use run.dlang.io for a
convenient interface that shows you exactly how the compilers translated
your code. Or if you're macho, use `objdump -d` and search for _Dmain
(or the specific function if you know how it's mangled).


> I am not experienced with dub but I used --build=release-nobounds and
> verified that -O3 is used for both compilers. (I also tried building
> manually with GNU 'make' with e.g. -O5 and the results were similar.)
> 
> For a test run for 2 million numbers:
> 
> ldc: ~0.95 seconds
> gdc: ~0.79 seconds
> dmd: ~1.77 seconds

For measurements under 1 second, I'm skeptical of the accuracy, because
there could be all kinds of background noise, CPU interrupts and stuff
that could be skewing the numbers.  What about do a best-of-3-runs with
20 million numbers (expected <20 seconds per run) and see how the
numbers look?

Though having said all that, I can say at least that dmd's relatively
poor performance seems in line with my previous observations. :-P The
difference between ldc and gdc is harder to pinpoint; they each have
different optimizers that could work better or worse than the other
depending on the specifics of what the program is doing.


[...]
> I've been mainly a dmd person for various reasons and was under the
> impression that ldc was the clear winner among the three. What is your
> experience? Does gdc compile faster programs in general? Would ldc win
> if I took advantage of e.g. link-time optimizations?
[...]

I'm not sure LDC is the clear winner.  I only prefer LDC because LDC's
architecture makes it easier for cross-compilation (with GCC/GDC you
need to jump through a lot more hoops to get a working cross compiler).
GDC is also tied to the GCC release cycle, and tends to be several
language versions behind LDC.  But both compilers have excellent
optimizers, but they are definitely different so for some things GDC
will beat LDC, for other things LDC will beat GDC. It may depend on the
specific optimization flags you use as well.

But these sorts of statements are just generalizations. The best way to
find out for sure is to disassemble the executable and see for yourself
what the assembly looks like. :-)


T

-- 
Public parking: euphemism for paid parking. -- Flora


Re: gdc or ldc for faster programs?

2022-01-25 Thread Adam D Ruppe via Digitalmars-d-learn

On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:

ldc: ~0.95 seconds
gdc: ~0.79 seconds
dmd: ~1.77 seconds


Not surprising at all: gdc is excellent and underrated in the 
community.


Re: gdc or ldc for faster programs?

2022-01-25 Thread Johan via Digitalmars-d-learn

On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:


I am not experienced with dub but I used 
--build=release-nobounds and verified that -O3 is used for both 
compilers. (I also tried building manually with GNU 'make' with 
e.g. -O5 and the results were similar.)


`-O5` does not do anything different than `-O3` for LDC.


For a test run for 2 million numbers:

ldc: ~0.95 seconds
gdc: ~0.79 seconds
dmd: ~1.77 seconds

I am using compilers installed by Manjaro Linux's package 
system:


ldc: LDC - the LLVM D compiler (1.28.0):
  based on DMD v2.098.0 and LLVM 13.0.0

gdc: dc (GCC) 11.1.0

dmd: DMD64 D Compiler v2.098.1

I've been mainly a dmd person for various reasons and was under 
the impression that ldc was the clear winner among the three. 
What is your experience? Does gdc compile faster programs in 
general? Would ldc win if I took advantage of e.g. link-time 
optimizations?


Tough to say. Of course DMD is not a serious contender, but I 
believe the difference between GDC and LDC is very small and 
really in the details, i.e. you'll have to look at assembly to 
find out the delta.

Have you tried `--enable-cross-module-inlining` with LDC?

-Johan