Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread H. S. Teoh via Digitalmars-d-learn
On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via Digitalmars-d-learn 
wrote:
[...]
>> Try addressing the points I wrote above and see if it makes a
>> difference.
> 
>I have tried it (all of it) even before you wrote it here, because
>I have completely the same ideas, but to be fair it has almost zero
>effect on speed.
>There is my version (It still use OOP, but I have try it wit
>Printer and Counter to be structs and it has no effect at
>all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
>The only difference in speed in the end is caused by hash
>implementation of dlang associative arrays and rust HashMap,
>actually if you modify rust to not used ahash it has almost same
>speed as D
[...]

I'm confused by the chained hashing of the digits. Why is that
necessary?  I would have thought it'd be faster to hash the entire key
instead of computing the hash of each digit and chaining them together.

I looked up Rust's ahash algorithm. Apparently they leverage the CPU's
hardware AES instruction to compute a collision-resistant hash very
quickly.

Somebody should file a bug on druntime to implement this where the
hardware supports it, instead of the current hashOf. For relatively
small keys this would be a big performance boost.


T

-- 
Valentine's Day: an occasion for florists to reach into the wallets of nominal 
lovers in dire need of being reminded to profess their hypothetical love for 
their long-forgotten.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Daniel Kozak via Digitalmars-d-learn
On Fri, Jan 19, 2024 at 4:44 PM H. S. Teoh via Digitalmars-d-learn <
digitalmars-d-learn@puremagic.com> wrote:

> Taking a look at this code:
> ...


> Try addressing the points I wrote above and see if it makes a
> difference.
>
>
I have tried it (all of it) even before you wrote it here, because I have
completely the same ideas, but to be fair it has almost zero effect on
speed.
There is my version (It still use OOP, but I have try it wit Printer and
Counter to be structs and it has no effect at all)
https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY

The only difference in speed in the end is caused by hash implementation of
dlang associative arrays and rust HashMap, actually if you modify rust to
not used ahash it has almost same speed as D


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread ryuukk_ via Digitalmars-d-learn

On Friday, 19 January 2024 at 17:18:36 UTC, evilrat wrote:

On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:


You do hash map lookup for every character in D, it's slow, 
whereas in Rust you do it via pattern matching, java does the 
same, pattern matching



Yet another reason to advocate for pattern matching in D and 
switch as expression


There is another important difference, i quickly looked up D 
associative array implementation and the search looks like 
nlog(n) complexity with plain loop iteration of hashes, whereas 
rust's implementation is using "swisstable" algorithm 
implementation that has packed SIMD optimized lookups, this is 
likely where the 3x speed difference comes from.


Tried to look up rust implementation and it is SOOO generic 
that I was unable to decipher it to find the actual key search 
and stores.


Anyway here is an interesting article about rust implementation
https://faultlore.com/blah/hashbrown-tldr/


I'm not talking about the difference between the hashmap 
implementation, but the difference between the algorithm used to 
lookup the characters


https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/9b1d7f026943841638a2729922cf000b1b3ce655/src/java/Main2.java#L106-L134

vs

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/d/src/dencoder.d#L10-L49


If D had pattern matching and switch as expression, the faster 
method would be:


1. the most obvious choice

2. the fastest by default

3. the most clean

To save from having to write a old-school verbose `switch`, i 
suspect he went with a hashmap, wich is slower in that case, 
hence why i keep advocate for that feature, or i should say, that 
upgrade to `switch`, wich java has adopted, as well as rust:


https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/rust/phone_encoder/src/main.rs#L146-L168




Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Jordan Wilson via Digitalmars-d-learn

On Friday, 19 January 2024 at 08:57:40 UTC, Renato wrote:
Do you know why the whole thread seems to have disappeared?? 
There's a lot of good stuff in the thread, it would be a huge 
shame to lose all that!


I agree! Thanks for posting your benchmarks, I thought your whole 
benching setup was pretty good, and learnt alot from your code 
and the resulting contributions in the thread and others.


Jordan



Re: User defined type and foreach

2024-01-19 Thread Jonathan M Davis via Digitalmars-d-learn
On Friday, January 19, 2024 3:49:29 AM MST Jim Balter via Digitalmars-d-learn 
wrote:
> On Friday, 17 November 2017 at 17:55:30 UTC, Jonathan M Davis
>
> wrote:
> > When you have
> >
> > foreach(e; range)
> >
> > it gets lowered to something like
> >
> > for(auto r = range; !r.empty; r.popFront())
> > {
> >
> > auto e = r.front;
> >
> > }
> >
> > So, the range is copied when you use it in a foreach.
>
> Indeed, and the language spec says so, but this is quite wrong as
> it violates the specification and design of ranges ... only
> forward ranges are copyable and only via their `save` function. I
> have an input range that can only be iterated once; if you try to
> do so again it's empty ... but the foreach implementation breaks
> that. You should be able to break out of the foreach statement,
> then run it again (or another foreach) and it should continue
> from where it left off, but copying breaks that. I need to know
> how many elements of my range were consumed; copying breaks that.
> I got around this by having a pointer to my state so only the
> pointer gets copied. I would also note that tutorials such as Ali
> Çehreli's "Programming in D – Tutorial and Reference" are unaware
> of this breakage:
>
> "
> Those three member functions must be named as empty, popFront,
> and front, respectively. The code that is generated by the
> compiler calls those functions:
>
>  for ( ; !myObject.empty(); myObject.popFront()) {
>
>  auto element = myObject.front();
>
>  // ... expressions ...
>  }
> "

No spec is being violated, and the behavior is completely expected. The core
problem is that the range API doesn't actually specify the semantics of
copying a range - and actually can't do so without making breaking changes.

D types in general fall into one of three categories with regards to their
copy semantics:

1. value types
2. reference types
3. pseudo-reference types

When you copy a value type, you get two fully independent copies of the
object (e.g. integers are a prime example of this; mutating a copy of an
integer has no effect whatsoever on the original).

When you copy a reference type, you get two fully dependent copies. The type
is either a pointer or a reference (or a struct that holds just a pointer or
a reference), and copying it results in another pointer or reference to the
same object. So, mutating the object affects all pointers or references to
that object.

When you copy a pseudo-reference type, you get a partial copy. Typically,
you're dealing with a struct which has both members which are value types
and members which are reference types. The result is that some operations
will affect only the object being mutated, whereas other operations will
affect other copies of that object. Dynamic arrays are one example of this.
They container a pointer and a size_t which is the length of the array, and
reducing the size of the array by mutating the pointer and the length has no
effect on other dynamic arrays which point to the same data, but mutating
the actual elements affects all dynamic arrays which point to the same data.

What this means for ranges is that depending on how they're implemented, you
get one of four different behaviors when they're copied:

1. If the range is a value type, then copying the range results in two
independent copies, so mutating the copy has no effect on the original. Code
can iterate through both ranges independently.

2. If the range is a reference type, then copying the range results in two
dependent copies, so mutating the copy mutates the original. Any code that
iterates one of the two ranges then affects any code which would try to
iterate the original, but the state is consistent across both ranges,
because rather than containing their data, the data is elsewhere, and they
both point to the same place.

3. If a range is a pseudo-reference type, and its iteration state is copied
by value, then copying the range gives you the same behavior as a value type
as far as iteration goes. Both the copy and the original can be iterated
independently (though depending on the implementation, mutating the elements
themselves could affect both ranges). Dynamic arrays fall in this category.

4. If a range is a pseudo-reference type, and its iteration state is not
fully copied by value, then you end up with the copy and the original being
partially dependent. This means that if you mutate one of them, it will only
partially mutate the other, which tends to mean that the other ends up in an
invalid state. A common situation where this can occur is if the range
stores its front as a member variable, but the rest of its state is stored
in another member variable which is a reference type. If you then call
popFront on the copy, you'd end up with the copy's front changing, but the
original's front wouldn't, and yet, the state they share for the rest of the
range would be mutated, so if you then called popFront on the original, you
wouldn't get the 

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread evilrat via Digitalmars-d-learn

On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:


You do hash map lookup for every character in D, it's slow, 
whereas in Rust you do it via pattern matching, java does the 
same, pattern matching



Yet another reason to advocate for pattern matching in D and 
switch as expression


There is another important difference, i quickly looked up D 
associative array implementation and the search looks like 
nlog(n) complexity with plain loop iteration of hashes, whereas 
rust's implementation is using "swisstable" algorithm 
implementation that has packed SIMD optimized lookups, this is 
likely where the 3x speed difference comes from.


Tried to look up rust implementation and it is SOOO generic that 
I was unable to decipher it to find the actual key search and 
stores.


Anyway here is an interesting article about rust implementation
https://faultlore.com/blah/hashbrown-tldr/


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread ryuukk_ via Digitalmars-d-learn

On Friday, 19 January 2024 at 13:40:39 UTC, Renato wrote:

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:


I forgot to mention: the Java version is using a Trie... and 
it consistently beats the Rust numeric algorithm (which means 
it's still faster than your D solution), but the Java version 
that's equivalent to Rust's implementation is around 3x 
slower... i.e. it runs at about the same speed as my current 
fastest numeric algorithm in D as well.




Additionally if you comparing D by measuring DMD performance - 
don't.
It is valuable in developing for fast iterations, but it lacks 
many modern optimization techniques, for that we have LDC and 
GDC.


I tried with DMD again, and yeah, it's much slower.

Here's the [current implementation in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d), and the roughly [equivalent Rust implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).


The only "significant" difference is that in Rust, an enum 
`WordOrDigit` is used to represent currently known "words"... I 
[did try using that in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d), but it made the algorithm slower.


If you see anything in D that's not as efficient as it should 
be, or somehow "inferior" to what the Rust version is doing , 
please let me know.


Notice that almost all of the time is spent in the for-loop 
inside `printTranslations` (which is misnamed as it doesn't 
necessarily "print" anything, like it did earlier) - the rest 
of the code almost doesn't matter.


Current performance comparison:

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,31
./rust,24018944,149
./rust,24084480,601
./rust,24248320,1176
./rust,7798784,2958
./rust,7815168,15009
===> src/d/dencoder
src/d/dencoder,14254080,36
src/d/dencoder,24477696,368
src/d/dencoder,24510464,1740
src/d/dencoder,24559616,3396
src/d/dencoder,11321344,6740
src/d/dencoder,11321344,36787
```

So , it's not really 3x slower anymore, here's the "D overhead" 
considering Rust as the baseline:


```
1.161290323
2.469798658
2.895174709
2.887755102
2.278566599
2.450996069
```


You do hash map lookup for every character in D, it's slow, 
whereas in Rust you do it via pattern matching, java does the 
same, pattern matching



Yet another reason to advocate for pattern matching in D and 
switch as expression




Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Jan 19, 2024 at 01:40:39PM +, Renato via Digitalmars-d-learn wrote:
> On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
[...]
> > Additionally if you comparing D by measuring DMD performance -
> > don't.  It is valuable in developing for fast iterations, but it
> > lacks many modern optimization techniques, for that we have LDC and
> > GDC.
> 
> I tried with DMD again, and yeah, it's much slower.

For anything where performance is even remotely important, I wouldn't
even consider DMD.  It's a well-known fact that it produces suboptimal
executables.  Its only redeeming factor is really only its fast
turnaround time.  If fast turnaround is not important, I would always
use LDC or GDC instead.


> Here's the [current implementation in
> D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d),
> and the roughly [equivalent Rust
> implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).

Taking a look at this code:

One of the most important thing I found is that every call to
printTranslations allocates a new array (`auto keyValue = new
ubyte[...];`).  Given the large number of recursions involved in this
algorithm, this will add up to quite a lot.  If I were optimizing this
code, I'd look into ways of reducing, if not eliminating, this
allocation.  Observe that this allocation is needed each time
printTranslations recurses, so instead of making separate allocations,
you could put it on a stack. Either with alloca, or with my appendPath()
trick in my version of the code: preallocate a reasonably large buffer
and take slices of it each time you need a new keyValue array.

Secondly, your hash function looks suspicious. Why are you chaining your
hash per digit?  That's a lot of hash computations.  Shouldn't you just
hash the entire key each time?  That would eliminate the need to store a
custom hash in your key, you could just lookup the entire key at once.

Next, what stood out is ISolutionHandler.  If I were to write this, I
wouldn't use OO for this at all, and especially not interfaces, because
they involve a double indirection.  I'd just return a delegate instead
(single indirection, no object lookup).  This is a relatively small
matter, but when it's being used inside a hot inner loop, it could be
important.

Then a minor point: I wouldn't use Array in printTranslations. It's
overkill for what you need to do; a built-in array would work just fine.
Take a look at the implementation of Array and you'll see lots of
function calls and locks and GC root-adding and all that stuff. Most of
it doesn't apply here, of course, and is compiled out. Nevertheless, it
uses wrapped integer operations and range checks, etc.. Again, these are
all minor issues, but in a hot inner loop they do add up. Built-in
arrays let you literally just bump the pointer when adding an element.
Just a couple of instructions as opposed to several function calls.
Important difference when you're on the hot path.  Now, as I mentioned
earlier w.r.t. my own code, appending to built-in arrays comes with a
cost. So here's where you'd optimize by creating your own buffer and
custom push/pop operations. Something like appendPath() in my version of
the code would do the job.

Finally, a very a small point: in loadDictionary, you do an AA lookup
with `n in result`, and then if that returns null, you create a new
entry. This does two AA lookups, once unsuccessfully, and the second
time to insert the missing key.  You could use the .require operation
with a delegate instead of `in` followed by `if (... is null)`, which
only requires a single lookup.  Probably not an important point, but for
a large dictionary this might make a difference.


> The only "significant" difference is that in Rust, an enum
> `WordOrDigit` is used to represent currently known "words"... I [did
> try using that in
> D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d),
> but it made the algorithm slower.
> 
> If you see anything in D that's not as efficient as it should be, or
> somehow "inferior" to what the Rust version is doing , please let me
> know.

Couldn't tell you, I don't know Rust. :-D


> Notice that almost all of the time is spent in the for-loop inside
> `printTranslations` (which is misnamed as it doesn't necessarily
> "print" anything, like it did earlier) - the rest of the code almost
> doesn't matter.
[...]

Of course, that's where your hot path is.  And that loop makes recursive
calls to printTranslations, so the entire body of the function could use
some optimization. ;-)

Try addressing the points I wrote above and see if it makes a
difference.


T

-- 
The two rules of success: 1. Don't tell everything you know. -- YHL


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Renato via Digitalmars-d-learn

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:


I forgot to mention: the Java version is using a Trie... and 
it consistently beats the Rust numeric algorithm (which means 
it's still faster than your D solution), but the Java version 
that's equivalent to Rust's implementation is around 3x 
slower... i.e. it runs at about the same speed as my current 
fastest numeric algorithm in D as well.




Additionally if you comparing D by measuring DMD performance - 
don't.
It is valuable in developing for fast iterations, but it lacks 
many modern optimization techniques, for that we have LDC and 
GDC.


I tried with DMD again, and yeah, it's much slower.

Here's the [current implementation in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d), and the roughly [equivalent Rust implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).


The only "significant" difference is that in Rust, an enum 
`WordOrDigit` is used to represent currently known "words"... I 
[did try using that in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d), but it made the algorithm slower.


If you see anything in D that's not as efficient as it should be, 
or somehow "inferior" to what the Rust version is doing , please 
let me know.


Notice that almost all of the time is spent in the for-loop 
inside `printTranslations` (which is misnamed as it doesn't 
necessarily "print" anything, like it did earlier) - the rest of 
the code almost doesn't matter.


Current performance comparison:

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,31
./rust,24018944,149
./rust,24084480,601
./rust,24248320,1176
./rust,7798784,2958
./rust,7815168,15009
===> src/d/dencoder
src/d/dencoder,14254080,36
src/d/dencoder,24477696,368
src/d/dencoder,24510464,1740
src/d/dencoder,24559616,3396
src/d/dencoder,11321344,6740
src/d/dencoder,11321344,36787
```

So , it's not really 3x slower anymore, here's the "D overhead" 
considering Rust as the baseline:


```
1.161290323
2.469798658
2.895174709
2.887755102
2.278566599
2.450996069
```



Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Renato via Digitalmars-d-learn

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:


I forgot to mention: the Java version is using a Trie... and 
it consistently beats the Rust numeric algorithm (which means 
it's still faster than your D solution), but the Java version 
that's equivalent to Rust's implementation is around 3x 
slower... i.e. it runs at about the same speed as my current 
fastest numeric algorithm in D as well.


This is what I would like to be discussing in this thread: why 
is D running at Java speeds and not at D speeds when using the 
same algorithm? I know there's small differences in the 
implementations, they are different languages after all, but 
not enough IMO to justify anything like 3x difference from 
Rust.




My guess is that's because int128 is not that much optimized 
due to being not so popular type, though to answer what's wrong 
would require to look at assembly code produced for both D and 
Rust.


Additionally if you comparing D by measuring DMD performance - 
don't.
It is valuable in developing for fast iterations, but it lacks 
many modern optimization techniques, for that we have LDC and 
GDC.


I am not using int128 anymore. I explained why a few posts back. 
I am using a byte array and computing the hash incrementally when 
trying different input, so that partially computed hashes are 
re-used on each try (this is a bit cheating, as Rust is not doing 
that, but I consider that to be acceptable as it's still 
computing hashes and looking up entries in the associative array).


I used all D compilers and picked the fastest one (GDC in the 
case of int128, but LDC2 in the current case).


Re: User defined type and foreach

2024-01-19 Thread Jim Balter via Digitalmars-d-learn
On Friday, 17 November 2017 at 17:55:30 UTC, Jonathan M Davis 
wrote:



When you have

foreach(e; range)

it gets lowered to something like

for(auto r = range; !r.empty; r.popFront())
{
auto e = r.front;
}

So, the range is copied when you use it in a foreach.


Indeed, and the language spec says so, but this is quite wrong as 
it violates the specification and design of ranges ... only 
forward ranges are copyable and only via their `save` function. I 
have an input range that can only be iterated once; if you try to 
do so again it's empty ... but the foreach implementation breaks 
that. You should be able to break out of the foreach statement, 
then run it again (or another foreach) and it should continue 
from where it left off, but copying breaks that. I need to know 
how many elements of my range were consumed; copying breaks that. 
I got around this by having a pointer to my state so only the 
pointer gets copied. I would also note that tutorials such as Ali 
Çehreli's "Programming in D – Tutorial and Reference" are unaware 
of this breakage:


"
Those three member functions must be named as empty, popFront, 
and front, respectively. The code that is generated by the 
compiler calls those functions:


for ( ; !myObject.empty(); myObject.popFront()) {

auto element = myObject.front();

// ... expressions ...
}
"




Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Renato via Digitalmars-d-learn

On Friday, 19 January 2024 at 05:17:51 UTC, H. S. Teoh wrote:
On Thu, Jan 18, 2024 at 04:23:16PM +, Renato via 
Digitalmars-d-learn wrote: [...]

Ok, last time I'm running this for someone else :D

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,30
./rust,24018944,147
./rust,24068096,592
./rust,24150016,1187
./rust,7766016,4972
./rust,8011776,46101
===> src/d/dencoder
src/d/dencoder,44154880,42
src/d/dencoder,51347456,87
src/d/dencoder,51380224,273
src/d/dencoder,51462144,441
src/d/dencoder,18644992,4414
src/d/dencoder,18710528,43548
```


OK, this piqued my interest enough that I decided to install 
rust using rustup instead of my distro's package manager.  Here 
are the numbers I got for my machine:


===> ./rust
./rust,22896640,35
./rust,22896640,137
./rust,22384640,542
./rust,22896640,1034
./rust,8785920,2489
./rust,8785920,12157
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,72
src/d/dencoder,1066799104,198
src/d/dencoder,1066799104,344
src/d/dencoder,1035292672,2372
src/d/dencoder,1035292672,13867

Looks like we lost out to Rust for larger inputs. :-D  Probably 
due to environmental factors (and the fact that std.stdio is 
slow).  I re-ran it again and got this:


===> ./rust
./rust,22896640,30
./rust,22896640,131
./rust,22896640,511
./rust,22896640,983
./rust,8785920,3102
./rust,8785920,9912
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,71
src/d/dencoder,1066799104,197
src/d/dencoder,1066799104,355
src/d/dencoder,1035292672,3441
src/d/dencoder,1035292672,9471

Notice the significant discrepancy between the two runs; this 
seems to show that the benchmark is only accurate up to about 
±1.5 seconds.



That's not correct. The discrepancy is due to the benchmark 
always generating different input on each run - and the 
characteristics of the input affects runs significantly. This 
affects the last two runs much more due to them using the more 
challenging dictionary. The important is that the relative 
performance between languages is reliable.


If you stop re-generating the phone numbers and just run the 
benchmark multiple times using the same input, you'll see it's 
very close between runs.




Anyway, oddly enough, Java seems to beat Rust on larger inputs. 
 Maybe my Java compiler has a better JIT implementation? :-P


Again, in case I haven't made it clear by repeating this multiple 
times: The Java code is using a Trie, the Rust code is using a 
numeric solution. They are completely different algorithms. Much 
more important than which language is being used is the 
algorithm, as has been shown again and again.




Congratulations on beating Rust :D but remember: you're using 
a much more efficient algorithm! I must conclude that the Rust 
translation of the Trie algorithm would be much faster still, 
unfortunately (you may have noticed that I am on D's side 
here!).


At this point, it's not really about the difference between 
languages anymore; it's about the programmer's skill at 
optimizing his code.


Traditionally Java is thought to be the slowest, because it 
runs in a VM and generally tends to use more heap allocations.  
In recent times, however, JIT and advanced GC implementations 
have significantly levelled that out, so you're probably not 
going to see the difference unless you hand-tweak your code 
down to the bare metal.


Java has been very fast for over a decade now, this is not new at 
all.




Surprisingly, at least on my machine, Lisp actually performed 
the worst. I'd have thought it would at least beat Java, but I 
was quite wrong. :-D Perhaps the Lisp implementation I'm using 
is suboptimal, I don't know. Or perhaps modern JVMs have really 
overtaken Lisp.


Please don't compare different algorithms in different languages 
and make conclusions about each language's speed.




Now I'm really curious how a Rust version of the trie algorithm 
would perform.  Unfortunately I don't know Rust so I wouldn't 
be able to write it myself. (Hint, hint, nudge, nudge ;-)).


As far as the performance of my D version is concerned, I still 
haven't squeezed out all the performance I could yet.  Going 
into this, my intention was to take the lazy way of optimizing 
only what the profiler points out to me, with the slight 
ulterior motive of proving that a relatively small amount of 
targeted optimizations can go a long way at making the GC a 
non-problem in your typical D code. ;-)  I haven't pulled out 
all the optimization guns at my disposal yet.



You don't really have to... @ssvb's solution is incredibly fast 
at the "count" problem and I really don't think anyone can beat 
that implementation. The only problem is that the implementation 
is very C-like and nothing like D I would write.




If I were to go the next step, I'd split up the impl() function 
so that I get a better profile of where it's spending most of 
its time, and then optimize that.  My current suspicion is that 
the traversal of the