Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-21 Thread Daniel Kozak via Digitalmars-d-learn
Dne so 20. 1. 2024 21:21 uživatel Renato via Digitalmars-d-learn <
digitalmars-d-learn@puremagic.com> napsal:

> On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote:
> > On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn
> > < digitalmars-d-learn@puremagic.com> wrote:
> >
> >> Wow, fantastic feedback from lots of people, thanks so much!
> >> ...
> >>
> >> > evilrat:
> >> > There is another important difference, i quickly looked up D
> >> > associative array implementation and the search looks like
> >> > nlog(n) complexity with plain loop iteration of hashes,
> >> > whereas
> >> > rust's implementation is using "swisstable" algorithm
> >> > implementation that has packed SIMD optimized lookups, this
> >> > is
> >> > likely where the 3x speed difference comes from.
> >>
> >> I am not using the default hash implementation in Rust (which
> >> is hashbrown as you've found out). That's too slow (because
> >> it's ddos secure - still Java's hash also is and it's still
> >> faster). I had to replace that with a library called `ahash`.
> >>
> >> If you're interested in knowing more about that, please [read
> >> my blog
> >> post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code)
> about optimising the Rust code.
> >>
> >> So, no, that's not where the difference is coming from... in
> >> fact, I am using a faster hashing function in D.
> >>
> >> You can [see my custom hashing function
> >> here](
> >>
> https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3
> ).
> >> This function is not good for the general purpose case, but
> >> it's probably
> >> as good as it gets for this problem (see my int128 trick in a
> >> previous post
> >> on this thread to understand why). I did try a few variations
> >> of hashing
> >> before settling on this one... Rust, if anything, is at a
> >> disadvantage here.
> >>
> >
> > And here you are wrong, it is the hashing when the slowdown
> > comes. It is not only about the speed of hashing function. It
> > is about the quality of hashing functiuon for this particular
> > problem and it is about how hashing table (associative array)
> > is implemented.
>
> I've explained why this function is almost a perfect hash
> function for this problem (there will be almost no collisions
> except for very large inputs where the shift-left will
> "overflow", and even then the probability of collisions should be
> very low). If you're going to claim I'm wrong, you must show
> exactly where I'm wrong, and preferrably you should provide a
> faster implementation. I will even accept a theoretical
> explanation if you can give one. What I will not accept, is for
> you to call me "wrong" just because you say so. That's childish
> behaviour and uncalled for on a technical discussion.
>


If you provided info that hash is perfect, than I am sorry. I have probably
missed that in this thread my fault. Than I will need to looked into this
more carefully and do much more than just assumption.

>


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-20 Thread Renato via Digitalmars-d-learn

On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote:
On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn 
< digitalmars-d-learn@puremagic.com> wrote:


Wow, fantastic feedback from lots of people, thanks so much! 
...


> evilrat:
> There is another important difference, i quickly looked up D
> associative array implementation and the search looks like
> nlog(n) complexity with plain loop iteration of hashes, 
> whereas

> rust's implementation is using "swisstable" algorithm
> implementation that has packed SIMD optimized lookups, this 
> is

> likely where the 3x speed difference comes from.

I am not using the default hash implementation in Rust (which 
is hashbrown as you've found out). That's too slow (because 
it's ddos secure - still Java's hash also is and it's still 
faster). I had to replace that with a library called `ahash`.


If you're interested in knowing more about that, please [read 
my blog 
post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about optimising the Rust code.


So, no, that's not where the difference is coming from... in 
fact, I am using a faster hashing function in D.


You can [see my custom hashing function
here](
https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3).
This function is not good for the general purpose case, but 
it's probably
as good as it gets for this problem (see my int128 trick in a 
previous post
on this thread to understand why). I did try a few variations 
of hashing
before settling on this one... Rust, if anything, is at a 
disadvantage here.




And here you are wrong, it is the hashing when the slowdown 
comes. It is not only about the speed of hashing function. It 
is about the quality of hashing functiuon for this particular 
problem and it is about how hashing table (associative array) 
is implemented.


I've explained why this function is almost a perfect hash 
function for this problem (there will be almost no collisions 
except for very large inputs where the shift-left will 
"overflow", and even then the probability of collisions should be 
very low). If you're going to claim I'm wrong, you must show 
exactly where I'm wrong, and preferrably you should provide a 
faster implementation. I will even accept a theoretical 
explanation if you can give one. What I will not accept, is for 
you to call me "wrong" just because you say so. That's childish 
behaviour and uncalled for on a technical discussion.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-20 Thread Daniel Kozak via Digitalmars-d-learn
On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn <
digitalmars-d-learn@puremagic.com> wrote:

> Wow, fantastic feedback from lots of people, thanks so much!
> ...
>
> > evilrat:
> > There is another important difference, i quickly looked up D
> > associative array implementation and the search looks like
> > nlog(n) complexity with plain loop iteration of hashes, whereas
> > rust's implementation is using "swisstable" algorithm
> > implementation that has packed SIMD optimized lookups, this is
> > likely where the 3x speed difference comes from.
>
> I am not using the default hash implementation in Rust (which is
> hashbrown as you've found out). That's too slow (because it's
> ddos secure - still Java's hash also is and it's still faster). I
> had to replace that with a library called `ahash`.
>
> If you're interested in knowing more about that, please [read my
> blog
> post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code)
> about optimising the Rust code.
>
> So, no, that's not where the difference is coming from... in
> fact, I am using a faster hashing function in D.
>
> You can [see my custom hashing function
> here](
> https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3).
> This function is not good for the general purpose case, but it's probably
> as good as it gets for this problem (see my int128 trick in a previous post
> on this thread to understand why). I did try a few variations of hashing
> before settling on this one... Rust, if anything, is at a disadvantage here.
>

And here you are wrong, it is the hashing when the slowdown comes. It is
not only about the speed of hashing function. It is about the quality of
hashing functiuon for this particular problem and it is about how hashing
table (associative array) is implemented.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-20 Thread Renato via Digitalmars-d-learn

On Saturday, 20 January 2024 at 13:07:39 UTC, Renato wrote:


D overhead with the fastest compiler, LDC, compared with Rust:

```
1.0
1.707627119
1.919527897
1.954595186
1.351342502
1.556217748
```



Oh sorry, I only posted the rates for the Linux benchmark...

On Mac M1, for some reason, D was a bit closer to Rust in some of 
the runs:


```
0.86
1.81944
2.018932874
2.0078125
1.069837447
1.165078285
```



Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-20 Thread Renato via Digitalmars-d-learn

Wow, fantastic feedback from lots of people, thanks so much!


I will try to answer all points raised in the several answers 
I've got here since yesterday.


On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:


I'm confused by the chained hashing of the digits. Why is that 
necessary?  I would have thought it'd be faster to hash the 
entire key instead of computing the hash of each digit and 
chaining them together.


I looked up Rust's ahash algorithm. Apparently they leverage 
the CPU's hardware AES instruction to compute a 
collision-resistant hash very quickly.


Somebody should file a bug on druntime to implement this where 
the hardware supports it, instead of the current hashOf. For 
relatively small keys this would be a big performance boost.



T


[This is the 
commit](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/e757f34c3ddd16240e3515bc91564f2a134659ff) I added "incremental hashing". What you're suggesting I also tried but it was much slower.


The reason incremental hashing is much faster than computing the 
hash of the whole key on each iteration is that you're doing a 
lot less work. To compute the hash of an array fully, you need to 
do something like this (it's just how hashing works):


```d
auto currentHash = ;
foreach(item: array) {
  currentHash = hashOf(item);
}
```

But by the structure of the problem, we know that we keep adding 
items to the `array` and re-computing the hash (in the 
`printTranslations` function, which is the hot path... so ignore 
the dictionary loading for now). What you're suggesting is do the 
above for the full array, on each iteration... what I did was 
optmise that so that we only re-compute the hash for each item 
instead per iteration - which is the only work that is required.


As Siarhei Siamashka mentioned: you could keep optimising this 
using heuristics and knowledge about the problem, at which point 
you might converge to a Trie implementation! But for the purpose 
of this comparison, I'm happy to stop at a good enough, kind of 
general purpose algorithm which is comparable in multiple 
languages... I might compare the D Trie impl with a Rust Trie 
impl in the future, but for now, I've had enough of that.


Hope that makes sense now... try to undo it and I believe it will 
run around 30% slower.


One of the most important thing I found is that every call to 
printTranslations allocates a new array (`auto keyValue = new 
ubyte[...];`).  Given the large number of recursions involved 
in this algorithm, this will add up to quite a lot.


Bingo! I found that before reading your suggestion :) you can see 
the commit where I changed that 
[here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/4e8508c5d16a2b623a4673d9c056a68e03b3bd87). This one-line change was the most effective change, making the D impl not only consume a lot less memory, but become quite a bit faster (I will double check later , but I believe this made the code run almost 50% faster in the longest runs)!


Secondly, your hash function looks suspicious. Why are you 
chaining your hash per digit?


I've answered that above... but I went further and instead of 
using the built-in `hashof` function, I decided to use [my own 
hashing 
function](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3) which takes advantage of the input characteristics... I mentioned before the trick I did to make the int128 solution faster (by the way, the int128 solution was not acceptable, as @ssvb pointed out, that's why I had to remove that). This is a very simple hashing function that works well enough for this particular problem, and this also boosted performance noticeably.


Then a minor point: I wouldn't use Array in printTranslations. 
It's overkill for what you need to do; a built-in array would 
work just fine.


Hm, sorry to break it to you but [changing 
that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b927bc9cc4e33f9f6ba457d8abc3bccbc17c7f1e) to an Array was the single greatest improvement I had. IIRC it made the code 2x faster.


Maybe I was using native arrays wrongly before... but it's 
unclear to me how I could've made that better (and the docs 
themselves suggest to use Array if performance matters, which I 
seem to have confirmed is good advice).


Next, what stood out is ISolutionHandler.  If I were to write 
this, I wouldn't use OO for this at all, and especially not 
interfaces, because they involve a double indirection.


As mentioned by Daniel Kozak, this doesn't matter. I tried 
rewriting it so that there was no indirection to the extent 
possible (I used delegate fields within a final class so the 
functions to use are selected at startup, so there's no overhead 
choosing which functions to call at runtime) and it was actually 
slightly slower (but too close to call really). My solution with 
interfaces had two final classes used, so 

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-20 Thread Siarhei Siamashka via Digitalmars-d-learn

On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:
On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via 
Digitalmars-d-learn wrote: [...]
   > Try addressing the points I wrote above and see if it 
makes a

   > difference.

   I have tried it (all of it) even before you wrote it here, 
because
   I have completely the same ideas, but to be fair it has 
almost zero

   effect on speed.
   There is my version (It still use OOP, but I have try it wit
   Printer and Counter to be structs and it has no effect at
   all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
   The only difference in speed in the end is caused by hash
   implementation of dlang associative arrays and rust HashMap,
   actually if you modify rust to not used ahash it has almost 
same

   speed as D

[...]

I'm confused by the chained hashing of the digits. Why is that 
necessary?  I would have thought it'd be faster to hash the 
entire key instead of computing the hash of each digit and 
chaining them together.


Hash which key? The problem of the naive hashtable-based 
algorithm is that we don't know the length of the key. So all 
lengths are tried starting from 1 and up to the remaining part of 
the phone number. An additional optimization would be to limit 
this search and terminate it early. For example, stop after we 
exceed the length of the longest word from the dictionary. Or 
stop the search upon encountering certain "leaf" words, but this 
effectively morphs the algorithm into something that partially 
resembles Trie. And further evolving it we may end up with a 
full-fledged Trie.


The only practical value of this algorithm is that it's easy to 
implement and has low LOC count. My initial non-optimized naive 
version was also similar 
https://gist.github.com/ssvb/abe821b3cdba7fcb7f3c3cecde864153 (66 
lines including blank lines and some comments). This is somewhat 
comparable to the Lisp results 
https://flownet.com/ron/papers/lisp-java/raw-results.html or to 
the Peter Norvig's solution http://www.norvig.com/java-lisp.html


The purpose of the initial implementation "prototype" is just to 
have some sort of a starting point. And if the runtime 
performance doesn't matter, then we are already done. But if the 
runtime performance does matter, then a lot of things can be 
rapidly improved by spending a bit of time on it. Eventually one 
runs out of ideas and further optimization efforts yield only 
diminishing returns. That's when it's a good time to clean up the 
code, add comments, etc. and label it as the "final" version.


I think that a good study, that intends to compare different 
programming languages against each other, could take all these 
factors into consideration: time to implement the "prototype", 
runtime performance of the prototype, time to implement the 
"final" version, runtime performance of the final version, the 
number of lines of code and its readability/maintainability.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread H. S. Teoh via Digitalmars-d-learn
On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via Digitalmars-d-learn 
wrote:
[...]
>> Try addressing the points I wrote above and see if it makes a
>> difference.
> 
>I have tried it (all of it) even before you wrote it here, because
>I have completely the same ideas, but to be fair it has almost zero
>effect on speed.
>There is my version (It still use OOP, but I have try it wit
>Printer and Counter to be structs and it has no effect at
>all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
>The only difference in speed in the end is caused by hash
>implementation of dlang associative arrays and rust HashMap,
>actually if you modify rust to not used ahash it has almost same
>speed as D
[...]

I'm confused by the chained hashing of the digits. Why is that
necessary?  I would have thought it'd be faster to hash the entire key
instead of computing the hash of each digit and chaining them together.

I looked up Rust's ahash algorithm. Apparently they leverage the CPU's
hardware AES instruction to compute a collision-resistant hash very
quickly.

Somebody should file a bug on druntime to implement this where the
hardware supports it, instead of the current hashOf. For relatively
small keys this would be a big performance boost.


T

-- 
Valentine's Day: an occasion for florists to reach into the wallets of nominal 
lovers in dire need of being reminded to profess their hypothetical love for 
their long-forgotten.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Daniel Kozak via Digitalmars-d-learn
On Fri, Jan 19, 2024 at 4:44 PM H. S. Teoh via Digitalmars-d-learn <
digitalmars-d-learn@puremagic.com> wrote:

> Taking a look at this code:
> ...


> Try addressing the points I wrote above and see if it makes a
> difference.
>
>
I have tried it (all of it) even before you wrote it here, because I have
completely the same ideas, but to be fair it has almost zero effect on
speed.
There is my version (It still use OOP, but I have try it wit Printer and
Counter to be structs and it has no effect at all)
https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY

The only difference in speed in the end is caused by hash implementation of
dlang associative arrays and rust HashMap, actually if you modify rust to
not used ahash it has almost same speed as D


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread ryuukk_ via Digitalmars-d-learn

On Friday, 19 January 2024 at 17:18:36 UTC, evilrat wrote:

On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:


You do hash map lookup for every character in D, it's slow, 
whereas in Rust you do it via pattern matching, java does the 
same, pattern matching



Yet another reason to advocate for pattern matching in D and 
switch as expression


There is another important difference, i quickly looked up D 
associative array implementation and the search looks like 
nlog(n) complexity with plain loop iteration of hashes, whereas 
rust's implementation is using "swisstable" algorithm 
implementation that has packed SIMD optimized lookups, this is 
likely where the 3x speed difference comes from.


Tried to look up rust implementation and it is SOOO generic 
that I was unable to decipher it to find the actual key search 
and stores.


Anyway here is an interesting article about rust implementation
https://faultlore.com/blah/hashbrown-tldr/


I'm not talking about the difference between the hashmap 
implementation, but the difference between the algorithm used to 
lookup the characters


https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/9b1d7f026943841638a2729922cf000b1b3ce655/src/java/Main2.java#L106-L134

vs

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/d/src/dencoder.d#L10-L49


If D had pattern matching and switch as expression, the faster 
method would be:


1. the most obvious choice

2. the fastest by default

3. the most clean

To save from having to write a old-school verbose `switch`, i 
suspect he went with a hashmap, wich is slower in that case, 
hence why i keep advocate for that feature, or i should say, that 
upgrade to `switch`, wich java has adopted, as well as rust:


https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/rust/phone_encoder/src/main.rs#L146-L168




Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Jordan Wilson via Digitalmars-d-learn

On Friday, 19 January 2024 at 08:57:40 UTC, Renato wrote:
Do you know why the whole thread seems to have disappeared?? 
There's a lot of good stuff in the thread, it would be a huge 
shame to lose all that!


I agree! Thanks for posting your benchmarks, I thought your whole 
benching setup was pretty good, and learnt alot from your code 
and the resulting contributions in the thread and others.


Jordan



Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread evilrat via Digitalmars-d-learn

On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:


You do hash map lookup for every character in D, it's slow, 
whereas in Rust you do it via pattern matching, java does the 
same, pattern matching



Yet another reason to advocate for pattern matching in D and 
switch as expression


There is another important difference, i quickly looked up D 
associative array implementation and the search looks like 
nlog(n) complexity with plain loop iteration of hashes, whereas 
rust's implementation is using "swisstable" algorithm 
implementation that has packed SIMD optimized lookups, this is 
likely where the 3x speed difference comes from.


Tried to look up rust implementation and it is SOOO generic that 
I was unable to decipher it to find the actual key search and 
stores.


Anyway here is an interesting article about rust implementation
https://faultlore.com/blah/hashbrown-tldr/


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread ryuukk_ via Digitalmars-d-learn

On Friday, 19 January 2024 at 13:40:39 UTC, Renato wrote:

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:


I forgot to mention: the Java version is using a Trie... and 
it consistently beats the Rust numeric algorithm (which means 
it's still faster than your D solution), but the Java version 
that's equivalent to Rust's implementation is around 3x 
slower... i.e. it runs at about the same speed as my current 
fastest numeric algorithm in D as well.




Additionally if you comparing D by measuring DMD performance - 
don't.
It is valuable in developing for fast iterations, but it lacks 
many modern optimization techniques, for that we have LDC and 
GDC.


I tried with DMD again, and yeah, it's much slower.

Here's the [current implementation in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d), and the roughly [equivalent Rust implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).


The only "significant" difference is that in Rust, an enum 
`WordOrDigit` is used to represent currently known "words"... I 
[did try using that in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d), but it made the algorithm slower.


If you see anything in D that's not as efficient as it should 
be, or somehow "inferior" to what the Rust version is doing , 
please let me know.


Notice that almost all of the time is spent in the for-loop 
inside `printTranslations` (which is misnamed as it doesn't 
necessarily "print" anything, like it did earlier) - the rest 
of the code almost doesn't matter.


Current performance comparison:

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,31
./rust,24018944,149
./rust,24084480,601
./rust,24248320,1176
./rust,7798784,2958
./rust,7815168,15009
===> src/d/dencoder
src/d/dencoder,14254080,36
src/d/dencoder,24477696,368
src/d/dencoder,24510464,1740
src/d/dencoder,24559616,3396
src/d/dencoder,11321344,6740
src/d/dencoder,11321344,36787
```

So , it's not really 3x slower anymore, here's the "D overhead" 
considering Rust as the baseline:


```
1.161290323
2.469798658
2.895174709
2.887755102
2.278566599
2.450996069
```


You do hash map lookup for every character in D, it's slow, 
whereas in Rust you do it via pattern matching, java does the 
same, pattern matching



Yet another reason to advocate for pattern matching in D and 
switch as expression




Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Jan 19, 2024 at 01:40:39PM +, Renato via Digitalmars-d-learn wrote:
> On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
[...]
> > Additionally if you comparing D by measuring DMD performance -
> > don't.  It is valuable in developing for fast iterations, but it
> > lacks many modern optimization techniques, for that we have LDC and
> > GDC.
> 
> I tried with DMD again, and yeah, it's much slower.

For anything where performance is even remotely important, I wouldn't
even consider DMD.  It's a well-known fact that it produces suboptimal
executables.  Its only redeeming factor is really only its fast
turnaround time.  If fast turnaround is not important, I would always
use LDC or GDC instead.


> Here's the [current implementation in
> D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d),
> and the roughly [equivalent Rust
> implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).

Taking a look at this code:

One of the most important thing I found is that every call to
printTranslations allocates a new array (`auto keyValue = new
ubyte[...];`).  Given the large number of recursions involved in this
algorithm, this will add up to quite a lot.  If I were optimizing this
code, I'd look into ways of reducing, if not eliminating, this
allocation.  Observe that this allocation is needed each time
printTranslations recurses, so instead of making separate allocations,
you could put it on a stack. Either with alloca, or with my appendPath()
trick in my version of the code: preallocate a reasonably large buffer
and take slices of it each time you need a new keyValue array.

Secondly, your hash function looks suspicious. Why are you chaining your
hash per digit?  That's a lot of hash computations.  Shouldn't you just
hash the entire key each time?  That would eliminate the need to store a
custom hash in your key, you could just lookup the entire key at once.

Next, what stood out is ISolutionHandler.  If I were to write this, I
wouldn't use OO for this at all, and especially not interfaces, because
they involve a double indirection.  I'd just return a delegate instead
(single indirection, no object lookup).  This is a relatively small
matter, but when it's being used inside a hot inner loop, it could be
important.

Then a minor point: I wouldn't use Array in printTranslations. It's
overkill for what you need to do; a built-in array would work just fine.
Take a look at the implementation of Array and you'll see lots of
function calls and locks and GC root-adding and all that stuff. Most of
it doesn't apply here, of course, and is compiled out. Nevertheless, it
uses wrapped integer operations and range checks, etc.. Again, these are
all minor issues, but in a hot inner loop they do add up. Built-in
arrays let you literally just bump the pointer when adding an element.
Just a couple of instructions as opposed to several function calls.
Important difference when you're on the hot path.  Now, as I mentioned
earlier w.r.t. my own code, appending to built-in arrays comes with a
cost. So here's where you'd optimize by creating your own buffer and
custom push/pop operations. Something like appendPath() in my version of
the code would do the job.

Finally, a very a small point: in loadDictionary, you do an AA lookup
with `n in result`, and then if that returns null, you create a new
entry. This does two AA lookups, once unsuccessfully, and the second
time to insert the missing key.  You could use the .require operation
with a delegate instead of `in` followed by `if (... is null)`, which
only requires a single lookup.  Probably not an important point, but for
a large dictionary this might make a difference.


> The only "significant" difference is that in Rust, an enum
> `WordOrDigit` is used to represent currently known "words"... I [did
> try using that in
> D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d),
> but it made the algorithm slower.
> 
> If you see anything in D that's not as efficient as it should be, or
> somehow "inferior" to what the Rust version is doing , please let me
> know.

Couldn't tell you, I don't know Rust. :-D


> Notice that almost all of the time is spent in the for-loop inside
> `printTranslations` (which is misnamed as it doesn't necessarily
> "print" anything, like it did earlier) - the rest of the code almost
> doesn't matter.
[...]

Of course, that's where your hot path is.  And that loop makes recursive
calls to printTranslations, so the entire body of the function could use
some optimization. ;-)

Try addressing the points I wrote above and see if it makes a
difference.


T

-- 
The two rules of success: 1. Don't tell everything you know. -- YHL


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Renato via Digitalmars-d-learn

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:


I forgot to mention: the Java version is using a Trie... and 
it consistently beats the Rust numeric algorithm (which means 
it's still faster than your D solution), but the Java version 
that's equivalent to Rust's implementation is around 3x 
slower... i.e. it runs at about the same speed as my current 
fastest numeric algorithm in D as well.




Additionally if you comparing D by measuring DMD performance - 
don't.
It is valuable in developing for fast iterations, but it lacks 
many modern optimization techniques, for that we have LDC and 
GDC.


I tried with DMD again, and yeah, it's much slower.

Here's the [current implementation in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d), and the roughly [equivalent Rust implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).


The only "significant" difference is that in Rust, an enum 
`WordOrDigit` is used to represent currently known "words"... I 
[did try using that in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d), but it made the algorithm slower.


If you see anything in D that's not as efficient as it should be, 
or somehow "inferior" to what the Rust version is doing , please 
let me know.


Notice that almost all of the time is spent in the for-loop 
inside `printTranslations` (which is misnamed as it doesn't 
necessarily "print" anything, like it did earlier) - the rest of 
the code almost doesn't matter.


Current performance comparison:

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,31
./rust,24018944,149
./rust,24084480,601
./rust,24248320,1176
./rust,7798784,2958
./rust,7815168,15009
===> src/d/dencoder
src/d/dencoder,14254080,36
src/d/dencoder,24477696,368
src/d/dencoder,24510464,1740
src/d/dencoder,24559616,3396
src/d/dencoder,11321344,6740
src/d/dencoder,11321344,36787
```

So , it's not really 3x slower anymore, here's the "D overhead" 
considering Rust as the baseline:


```
1.161290323
2.469798658
2.895174709
2.887755102
2.278566599
2.450996069
```



Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Renato via Digitalmars-d-learn

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:


I forgot to mention: the Java version is using a Trie... and 
it consistently beats the Rust numeric algorithm (which means 
it's still faster than your D solution), but the Java version 
that's equivalent to Rust's implementation is around 3x 
slower... i.e. it runs at about the same speed as my current 
fastest numeric algorithm in D as well.


This is what I would like to be discussing in this thread: why 
is D running at Java speeds and not at D speeds when using the 
same algorithm? I know there's small differences in the 
implementations, they are different languages after all, but 
not enough IMO to justify anything like 3x difference from 
Rust.




My guess is that's because int128 is not that much optimized 
due to being not so popular type, though to answer what's wrong 
would require to look at assembly code produced for both D and 
Rust.


Additionally if you comparing D by measuring DMD performance - 
don't.
It is valuable in developing for fast iterations, but it lacks 
many modern optimization techniques, for that we have LDC and 
GDC.


I am not using int128 anymore. I explained why a few posts back. 
I am using a byte array and computing the hash incrementally when 
trying different input, so that partially computed hashes are 
re-used on each try (this is a bit cheating, as Rust is not doing 
that, but I consider that to be acceptable as it's still 
computing hashes and looking up entries in the associative array).


I used all D compilers and picked the fastest one (GDC in the 
case of int128, but LDC2 in the current case).


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-19 Thread Renato via Digitalmars-d-learn

On Friday, 19 January 2024 at 05:17:51 UTC, H. S. Teoh wrote:
On Thu, Jan 18, 2024 at 04:23:16PM +, Renato via 
Digitalmars-d-learn wrote: [...]

Ok, last time I'm running this for someone else :D

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,30
./rust,24018944,147
./rust,24068096,592
./rust,24150016,1187
./rust,7766016,4972
./rust,8011776,46101
===> src/d/dencoder
src/d/dencoder,44154880,42
src/d/dencoder,51347456,87
src/d/dencoder,51380224,273
src/d/dencoder,51462144,441
src/d/dencoder,18644992,4414
src/d/dencoder,18710528,43548
```


OK, this piqued my interest enough that I decided to install 
rust using rustup instead of my distro's package manager.  Here 
are the numbers I got for my machine:


===> ./rust
./rust,22896640,35
./rust,22896640,137
./rust,22384640,542
./rust,22896640,1034
./rust,8785920,2489
./rust,8785920,12157
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,72
src/d/dencoder,1066799104,198
src/d/dencoder,1066799104,344
src/d/dencoder,1035292672,2372
src/d/dencoder,1035292672,13867

Looks like we lost out to Rust for larger inputs. :-D  Probably 
due to environmental factors (and the fact that std.stdio is 
slow).  I re-ran it again and got this:


===> ./rust
./rust,22896640,30
./rust,22896640,131
./rust,22896640,511
./rust,22896640,983
./rust,8785920,3102
./rust,8785920,9912
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,71
src/d/dencoder,1066799104,197
src/d/dencoder,1066799104,355
src/d/dencoder,1035292672,3441
src/d/dencoder,1035292672,9471

Notice the significant discrepancy between the two runs; this 
seems to show that the benchmark is only accurate up to about 
±1.5 seconds.



That's not correct. The discrepancy is due to the benchmark 
always generating different input on each run - and the 
characteristics of the input affects runs significantly. This 
affects the last two runs much more due to them using the more 
challenging dictionary. The important is that the relative 
performance between languages is reliable.


If you stop re-generating the phone numbers and just run the 
benchmark multiple times using the same input, you'll see it's 
very close between runs.




Anyway, oddly enough, Java seems to beat Rust on larger inputs. 
 Maybe my Java compiler has a better JIT implementation? :-P


Again, in case I haven't made it clear by repeating this multiple 
times: The Java code is using a Trie, the Rust code is using a 
numeric solution. They are completely different algorithms. Much 
more important than which language is being used is the 
algorithm, as has been shown again and again.




Congratulations on beating Rust :D but remember: you're using 
a much more efficient algorithm! I must conclude that the Rust 
translation of the Trie algorithm would be much faster still, 
unfortunately (you may have noticed that I am on D's side 
here!).


At this point, it's not really about the difference between 
languages anymore; it's about the programmer's skill at 
optimizing his code.


Traditionally Java is thought to be the slowest, because it 
runs in a VM and generally tends to use more heap allocations.  
In recent times, however, JIT and advanced GC implementations 
have significantly levelled that out, so you're probably not 
going to see the difference unless you hand-tweak your code 
down to the bare metal.


Java has been very fast for over a decade now, this is not new at 
all.




Surprisingly, at least on my machine, Lisp actually performed 
the worst. I'd have thought it would at least beat Java, but I 
was quite wrong. :-D Perhaps the Lisp implementation I'm using 
is suboptimal, I don't know. Or perhaps modern JVMs have really 
overtaken Lisp.


Please don't compare different algorithms in different languages 
and make conclusions about each language's speed.




Now I'm really curious how a Rust version of the trie algorithm 
would perform.  Unfortunately I don't know Rust so I wouldn't 
be able to write it myself. (Hint, hint, nudge, nudge ;-)).


As far as the performance of my D version is concerned, I still 
haven't squeezed out all the performance I could yet.  Going 
into this, my intention was to take the lazy way of optimizing 
only what the profiler points out to me, with the slight 
ulterior motive of proving that a relatively small amount of 
targeted optimizations can go a long way at making the GC a 
non-problem in your typical D code. ;-)  I haven't pulled out 
all the optimization guns at my disposal yet.



You don't really have to... @ssvb's solution is incredibly fast 
at the "count" problem and I really don't think anyone can beat 
that implementation. The only problem is that the implementation 
is very C-like and nothing like D I would write.




If I were to go the next step, I'd split up the impl() function 
so that I get a better profile of where it's spending most of 
its time, and then optimize that.  My current suspicion is that 
the traversal of the 

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-18 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 18, 2024 at 04:23:16PM +, Renato via Digitalmars-d-learn wrote:
[...]
> Ok, last time I'm running this for someone else :D
> 
> ```
> Proc,Run,Memory(bytes),Time(ms)
> ===> ./rust
> ./rust,23920640,30
> ./rust,24018944,147
> ./rust,24068096,592
> ./rust,24150016,1187
> ./rust,7766016,4972
> ./rust,8011776,46101
> ===> src/d/dencoder
> src/d/dencoder,44154880,42
> src/d/dencoder,51347456,87
> src/d/dencoder,51380224,273
> src/d/dencoder,51462144,441
> src/d/dencoder,18644992,4414
> src/d/dencoder,18710528,43548
> ```

OK, this piqued my interest enough that I decided to install rust using
rustup instead of my distro's package manager.  Here are the numbers I
got for my machine:

===> ./rust
./rust,22896640,35
./rust,22896640,137
./rust,22384640,542
./rust,22896640,1034
./rust,8785920,2489
./rust,8785920,12157
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,72
src/d/dencoder,1066799104,198
src/d/dencoder,1066799104,344
src/d/dencoder,1035292672,2372
src/d/dencoder,1035292672,13867

Looks like we lost out to Rust for larger inputs. :-D  Probably due to
environmental factors (and the fact that std.stdio is slow).  I re-ran
it again and got this:

===> ./rust
./rust,22896640,30
./rust,22896640,131
./rust,22896640,511
./rust,22896640,983
./rust,8785920,3102
./rust,8785920,9912
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,71
src/d/dencoder,1066799104,197
src/d/dencoder,1066799104,355
src/d/dencoder,1035292672,3441
src/d/dencoder,1035292672,9471

Notice the significant discrepancy between the two runs; this seems to
show that the benchmark is only accurate up to about ±1.5 seconds.

Anyway, oddly enough, Java seems to beat Rust on larger inputs.  Maybe
my Java compiler has a better JIT implementation? :-P


> Congratulations on beating Rust :D but remember: you're using a much
> more efficient algorithm! I must conclude that the Rust translation of
> the Trie algorithm would be much faster still, unfortunately (you may
> have noticed that I am on D's side here!).

At this point, it's not really about the difference between languages
anymore; it's about the programmer's skill at optimizing his code.

Traditionally Java is thought to be the slowest, because it runs in a VM
and generally tends to use more heap allocations.  In recent times,
however, JIT and advanced GC implementations have significantly levelled
that out, so you're probably not going to see the difference unless you
hand-tweak your code down to the bare metal.

Surprisingly, at least on my machine, Lisp actually performed the worst.
I'd have thought it would at least beat Java, but I was quite wrong. :-D
Perhaps the Lisp implementation I'm using is suboptimal, I don't know.
Or perhaps modern JVMs have really overtaken Lisp.

Now I'm really curious how a Rust version of the trie algorithm would
perform.  Unfortunately I don't know Rust so I wouldn't be able to write
it myself. (Hint, hint, nudge, nudge ;-)).

As far as the performance of my D version is concerned, I still haven't
squeezed out all the performance I could yet.  Going into this, my
intention was to take the lazy way of optimizing only what the profiler
points out to me, with the slight ulterior motive of proving that a
relatively small amount of targeted optimizations can go a long way at
making the GC a non-problem in your typical D code. ;-)  I haven't
pulled out all the optimization guns at my disposal yet.

If I were to go the next step, I'd split up the impl() function so that
I get a better profile of where it's spending most of its time, and then
optimize that.  My current suspicion is that the traversal of the trie
could be improved by caching intermediate results to eliminate a good
proportion of recursive calls in impl().

Also, the `print` mode of operation is quite slow, probably because
writefln() allocates. (It allocates less than if I had used .format like
I did before, but it nevertheless still allocates.) To alleviate this
cost, I'd allocate an output buffer and write to that, flushing only
once it filled up.

Another thing I could do is to use std.parallelism.parallel to run
searches on batches of phone numbers in parallel. This is kinda
cheating, though, since it's the same algorithm with the same cost,
we're just putting more CPU cores to work. :-P  But in D this is quite
easy to do, often as easy as simply adding .parallel to your outer
foreach loop. In this particular case it will need some additional
refactoring due to the fact that the input is being read line by line.
But it's relatively easy to load the input into a buffer by chunks
instead, and just run the searches on all the numbers found in the
buffer in parallel.


On Thu, Jan 18, 2024 at 04:25:45PM +, Renato via Digitalmars-d-learn wrote:
[...]
> BTW here's you main function so it can run on the benchmark:
[...]

Thanks, I've adapted my code accordingly and pushed to my github repo.


T

-- 
This is a tpyo.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-18 Thread Renato via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 16:54:00 UTC, H. S. Teoh wrote:
On Wed, Jan 17, 2024 at 07:57:02AM -0800, H. S. Teoh via 
Digitalmars-d-learn wrote: [...]

I'll push the code to github.

[...]

Here: 
https://github.com/quickfur/prechelt/blob/master/encode_phone.d



T


BTW here's you main function so it can run on the benchmark:

```d
int main(string[] args)
{
bool countOnly = args.length > 1 ? (() {
final switch (args[1])
{
case "count":
return true;
case "print":
return false;
}
})() : false;

auto dictfile = args.length > 2 ? args[2] : "tests/words.txt";
auto input = args.length > 3 ? args[3] : "tests/numbers.txt";

Trie dict = loadDictionary(File(dictfile).byLine);

if (countOnly)
{
size_t count;
encodePhoneNumbers(File(input).byLine, dict, (phone, 
match) { count++; });

writefln("%d", count);
}
else
{
encodePhoneNumbers(File(input).byLine, dict, (phone, 
match) {

writefln("%s: %s", phone, match);
});
}

return 0;
}
```


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-18 Thread Renato via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 16:54:00 UTC, H. S. Teoh wrote:
On Wed, Jan 17, 2024 at 07:57:02AM -0800, H. S. Teoh via 
Digitalmars-d-learn wrote: [...]

I'll push the code to github.

[...]

Here: 
https://github.com/quickfur/prechelt/blob/master/encode_phone.d



T


Ok, last time I'm running this for someone else :D

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,30
./rust,24018944,147
./rust,24068096,592
./rust,24150016,1187
./rust,7766016,4972
./rust,8011776,46101
===> src/d/dencoder
src/d/dencoder,44154880,42
src/d/dencoder,51347456,87
src/d/dencoder,51380224,273
src/d/dencoder,51462144,441
src/d/dencoder,18644992,4414
src/d/dencoder,18710528,43548
```

Congratulations on beating Rust :D but remember: you're using a 
much more efficient algorithm! I must conclude that the Rust 
translation of the Trie algorithm would be much faster still, 
unfortunately (you may have noticed that I am on D's side here!).


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread Renato via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 16:30:08 UTC, H. S. Teoh wrote:
On Wed, Jan 17, 2024 at 07:19:39AM +, Renato via 
Digitalmars-d-learn wrote: [...]
But pls run the benchmarks yourself as I am not going to keep 
running it for you, and would be nice if you posted your 
solution on a Gist for example, pasting lots of code in the 
forum makes it difficult to follow.


I can't. I spent half an hour trying to get ./benchmark.sh to 
run, but no matter what it could not compile benchmark_runner. 
It complains that my rustc is too old and some dependencies do 
not support it. I tried running the suggested cargo update 
command to pin the versions but none of them worked.  Since I'm 
not a Rust user, I'm not feeling particularly motivated right 
now to spend any more time on this.  Upgrading my rustc isn't 
really an option because that's the version currently in my 
distro and I really don't feel like spending more time to 
install a custom version of rustc just for this benchmark.



T


I've just updated the Rust version to the benchmark monitor could 
work on Linux (it only worked on Mac before) :D that's probably 
why your rustc didn't work, though as the project is still using 
edition2018 I would've thought even a very old compiler would 
have worked... anyway, if you ever find yourself actually using 
Rust, you should use `rustup` (https://rustup.rs/) which makes it 
trivial to update Rust.


About the "count" option: I had assumed you didn't call format on 
the count option as it's never needed, there's nothing to print.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread H. S. Teoh via Digitalmars-d-learn
On Wed, Jan 17, 2024 at 07:57:02AM -0800, H. S. Teoh via Digitalmars-d-learn 
wrote:
[...]
> I'll push the code to github.
[...]

Here: https://github.com/quickfur/prechelt/blob/master/encode_phone.d


T

-- 
Why do conspiracy theories always come from the same people??


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread H. S. Teoh via Digitalmars-d-learn
On Wed, Jan 17, 2024 at 07:19:39AM +, Renato via Digitalmars-d-learn wrote:
[...]
> But pls run the benchmarks yourself as I am not going to keep running
> it for you, and would be nice if you posted your solution on a Gist
> for example, pasting lots of code in the forum makes it difficult to
> follow.

I can't. I spent half an hour trying to get ./benchmark.sh to run, but
no matter what it could not compile benchmark_runner. It complains that
my rustc is too old and some dependencies do not support it. I tried
running the suggested cargo update command to pin the versions but none
of them worked.  Since I'm not a Rust user, I'm not feeling particularly
motivated right now to spend any more time on this.  Upgrading my rustc
isn't really an option because that's the version currently in my distro
and I really don't feel like spending more time to install a custom
version of rustc just for this benchmark.


T

-- 
Today's society is one of specialization: as you grow, you learn more and more 
about less and less. Eventually, you know everything about nothing.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread H. S. Teoh via Digitalmars-d-learn
On Wed, Jan 17, 2024 at 07:19:39AM +, Renato via Digitalmars-d-learn wrote:
> On Tuesday, 16 January 2024 at 22:13:55 UTC, H. S. Teoh wrote:
> > used for the recursive calls. Getting rid of the .format ought to
> > speed it up a bit. Will try that now...
> > 
> 
> That will make no difference for the `count` option which is where
> your solution was very slow.

Of course it will. Passing the data directly to the callback that bumps
a counter is faster than allocating a new string, formatting the data,
and then passing it to the callback that bumps a counter.  It may not
look like much, but avoiding unnecessary GC allocations means the GC
will have less work to do later when a collection is run, thus you save
time over the long term.


> To run the slow test manually use the `words_quarter.txt` dictionary
> (the phone numbers file doesn't matter much - it's all in the
> dictionary).
> 
> But pls run the benchmarks yourself as I am not going to keep running
> it for you, and would be nice if you posted your solution on a Gist
> for example, pasting lots of code in the forum makes it difficult to
> follow.

I'll push the code to github.


T

-- 
"No, John.  I want formats that are actually useful, rather than over-featured 
megaliths that address all questions by piling on ridiculous internal links in 
forms which are hideously over-complex." -- Simon St. Laurent on xml-dev


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread Renato via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 11:56:19 UTC, evilrat wrote:

On Wednesday, 17 January 2024 at 11:20:14 UTC, Renato wrote:


That means the input file is still not ASCII (or UTF-8) as it 
should. Java is reading files with the ASCII encoding so it 
should've worked fine.


It seems that it is only works with ASCII encoding though.


Yes, that's according to the rules - only ASCII for everything.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread evilrat via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 11:20:14 UTC, Renato wrote:


That means the input file is still not ASCII (or UTF-8) as it 
should. Java is reading files with the ASCII encoding so it 
should've worked fine.


It seems that it is only works with ASCII encoding though.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread Renato via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 10:50:26 UTC, evilrat wrote:

On Wednesday, 17 January 2024 at 10:43:22 UTC, Renato wrote:

On Wednesday, 17 January 2024 at 10:24:31 UTC, Renato wrote:


It's not Java writing the file, it's the bash script 
[`benchmark.sh`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/master/benchmark.sh#L31):


```
java -cp "build/util" util.GeneratePhoneNumbers 1000 > 
phones_1000.txt

```



Perhaps using this option when running Java will help:

```
java -DFile.Encoding=UTF-8 ...
```


I've used powershell env var to set output to utf8, D version 
now works but java doesn't.


```
java -Xms20M -Xmx100M -cp build/java Main print 
words-quarter.txt phones_1000.txt
Exception in thread "main" 
java.lang.ArrayIndexOutOfBoundsException: Index 65485 out of 
bounds for length 10

at Trie.completeSolution(Main.java:216)
at Trie.forEachSolution(Main.java:192)
at PhoneNumberEncoder.encode(Main.java:132)
at Main.lambda$main$1(Main.java:38)
at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at 
java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at 
java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
at 
java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939)
at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at 
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)

at Main.main(Main.java:38)
```


This is this line:

```
var digit = chars[ index ] - 48;
```

That means the input file is still not ASCII (or UTF-8) as it 
should. Java is reading files with the ASCII encoding so it 
should've worked fine.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread evilrat via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 10:43:22 UTC, Renato wrote:

On Wednesday, 17 January 2024 at 10:24:31 UTC, Renato wrote:


It's not Java writing the file, it's the bash script 
[`benchmark.sh`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/master/benchmark.sh#L31):


```
java -cp "build/util" util.GeneratePhoneNumbers 1000 > 
phones_1000.txt

```



Perhaps using this option when running Java will help:

```
java -DFile.Encoding=UTF-8 ...
```


I've used powershell env var to set output to utf8, D version now 
works but java doesn't.


```
java -Xms20M -Xmx100M -cp build/java Main print words-quarter.txt 
phones_1000.txt
Exception in thread "main" 
java.lang.ArrayIndexOutOfBoundsException: Index 65485 out of 
bounds for length 10

at Trie.completeSolution(Main.java:216)
at Trie.forEachSolution(Main.java:192)
at PhoneNumberEncoder.encode(Main.java:132)
at Main.lambda$main$1(Main.java:38)
at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at 
java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at 
java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
at 
java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939)
at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at 
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)

at Main.main(Main.java:38)
```


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread Renato via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 10:24:31 UTC, Renato wrote:


It's not Java writing the file, it's the bash script 
[`benchmark.sh`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/master/benchmark.sh#L31):


```
java -cp "build/util" util.GeneratePhoneNumbers 1000 > 
phones_1000.txt

```



Perhaps using this option when running Java will help:

```
java -DFile.Encoding=UTF-8 ...
```



Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread Renato via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 09:15:02 UTC, evilrat wrote:

On Wednesday, 17 January 2024 at 07:11:02 UTC, Renato wrote:


If you want to check your performance, you know you can run 
the `./benchmark.sh` yourself?


Out of curiosity I've tried to manually run this on Windows and 
it seems that Java generator for these numbers files is 
"broken", the resulting count or print runs fine for both Java 
and D versions provided in your D branch, but fails with 
generated files.


D version complains about bad utf8 encoding.
I've opened the generated file in text editor and it is UTF-16 
(little-endian with BOM).


Tried with adoptium jdk 17 and 21 (former openjdk), but I guess 
it doesn't matter since UTF-16 is default on Windows.


It's not Java writing the file, it's the bash script 
[`benchmark.sh`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/master/benchmark.sh#L31):


```
java -cp "build/util" util.GeneratePhoneNumbers 1000 > 
phones_1000.txt

```

Java is just printing to stdout. I wasn't aware that piping like 
this would use the OS default encoding. Unfortunately I don't 
have Windows here, but try to change the encoding used by the 
bash script maybe?


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread evilrat via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 07:06:25 UTC, Renato wrote:
On Tuesday, 16 January 2024 at 22:15:04 UTC, Siarhei Siamashka 
wrote:

On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:

It's a GC allocations fest. Things like this make it slow:

```diff
 {
-string digit = [digits[0]];
+string digit = digits[0 .. 1];
 words.insertBack(digit);
```


I was under the impression that `[digits[0]]` would just use a 
stack allocation??


The profiler does not show any GC anymore, are you sure it's a 
"GC allocations fest"???




nah, you are allocating new array out of single digit while the 
latter is just takes a slice.


there is 'scope' storage specifier for when you know your 
variable won't escape the scope to place it on stack (works for 
classes too), but I'm not sure if it will work with array.


`scope string digit = [digits[0]];`


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-17 Thread evilrat via Digitalmars-d-learn

On Wednesday, 17 January 2024 at 07:11:02 UTC, Renato wrote:


If you want to check your performance, you know you can run the 
`./benchmark.sh` yourself?


Out of curiosity I've tried to manually run this on Windows and 
it seems that Java generator for these numbers files is "broken", 
the resulting count or print runs fine for both Java and D 
versions provided in your D branch, but fails with generated 
files.


D version complains about bad utf8 encoding.
I've opened the generated file in text editor and it is UTF-16 
(little-endian with BOM).


Tried with adoptium jdk 17 and 21 (former openjdk), but I guess 
it doesn't matter since UTF-16 is default on Windows.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread Renato via Digitalmars-d-learn

On Tuesday, 16 January 2024 at 22:13:55 UTC, H. S. Teoh wrote:
used for the recursive calls. Getting rid of the .format ought 
to speed it up a bit. Will try that now...




That will make no difference for the `count` option which is 
where your solution was very slow. To run the slow test manually 
use the `words_quarter.txt` dictionary (the phone numbers file 
doesn't matter much - it's all in the dictionary).


But pls run the benchmarks yourself as I am not going to keep 
running it for you, and would be nice if you posted your solution 
on a Gist for example, pasting lots of code in the forum makes it 
difficult to follow.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread Renato via Digitalmars-d-learn
On Tuesday, 16 January 2024 at 22:15:04 UTC, Siarhei Siamashka 
wrote:

On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:
For the record (I already posted this on GitHub)... here's [my 
current fastest 
solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d) time using the same algorithm as Rust ...


[...]

... what I am really curious about is what the code I wrote is 
doing wrong that causes it to run 4x slower than Rust despite 
doing "the same thing"...


It's a GC allocations fest. Things like this make it slow:

```diff
 {
-string digit = [digits[0]];
+string digit = digits[0 .. 1];
 words.insertBack(digit);
```


I was under the impression that `[digits[0]]` would just use a 
stack allocation??


The profiler does not show any GC anymore, are you sure it's a 
"GC allocations fest"???



And at the top is the associative array lookup (when profiling 
the handling of the "phones_1_000_000_with_empty.txt" input 
file):


```
36.85%  dencoder  dencoder  [.] _aaInX
12.38%  dencoder  dencoder  [.] void



Well, I know, but I did everything to make the hash "cheap" to 
compute so I still don't see a way to improve it.





Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 16, 2024 at 10:15:04PM +, Siarhei Siamashka via 
Digitalmars-d-learn wrote:
> On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:
[...]
> > ... what I am really curious about is what the code I wrote is doing
> > wrong that causes it to run 4x slower than Rust despite doing "the
> > same thing"...
> 
> It's a GC allocations fest.

Indeed.

I have just completed 2 rounds of optimizations of my version of the
code, and both times the profiler also showed the problem to be
excessive allocations in the inner loop.  So, I did the following
optimizations:

1) Get rid of .format in the inner loop. Not only does .format cause a
lot of allocations, it is also a known performance hog.  So instead
of constructing the output string in the search function, I changed it
to take a delegate instead, and the delegate either counts the result or
prints it directly (bypassing the construction of an intermediate
string).  This improved performance quite a bit for the count-only runs,
but also wins some performance even when output is generated.  Overall,
however, this optimization only gave me some minor savings.

2) Changed the `path` parameter from string[] to string, since I didn't
really need it to be an array of strings anyway. This in itself only
improved performance marginally, barely noticeable, but it led to (3),
which gave a huge performance boost.

3) Basically, in the earlier version of the code, the `path` parameter
was appended to every time I recursed, and furthermore the same initial
segment gets appended to many times with different trailers as the
algorithm walks the trie. As a result, this triggers a lot of array
reallocations to store the new strings.  Most of these allocations are
unnecessary, because we already know that the initial segment of the
string will stay constant, only the tail end changes. Furthermore, we
only ever have a single instance of .path at any point in time in the
algorithm.  So we could use a single buffer to hold all of these
instances of .path, and simply return slices to it as we go along,
overwriting the tail end each time we need to append something.

This significantly cut down on the number of allocations, and along
with (1) and (2), performance improved by about 3x (!).  It didn't
completely remove all allocations, but I'm reasonably happy with the
performance now that I probably won't try to optimize it more unless
it's still losing out to another language. ;-)

(I'm especially curious to see if this beats the Rust version. :-P)

Optimized version of the code:

---snip
/**
 * Encoding phone numbers according to a dictionary.
 */
import std;

/**
 * Table of digit mappings.
 */
static immutable ubyte[dchar] digitOf;
shared static this()
{
digitOf = [
'E': 0,
'J': 1, 'N': 1, 'Q': 1,
'R': 2, 'W': 2, 'X': 2,
'D': 3, 'S': 3, 'Y': 3,
'F': 4, 'T': 4,
'A': 5, 'M': 5,
'C': 6, 'I': 6, 'V': 6,
'B': 7, 'K': 7, 'U': 7,
'L': 8, 'O': 8, 'P': 8,
'G': 9, 'H': 9, 'Z': 9,
];
}

/**
 * Trie for storing dictionary words according to the phone number mapping.
 */
class Trie
{
Trie[10] edges;
string[] words;

private void insert(string word, string suffix)
{
const(ubyte)* dig;
while (!suffix.empty &&
   (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null)
{
suffix = suffix[1 .. $];
}

if (suffix.empty)
{
words ~= word;
return;
}

auto node = new Trie;
auto idx = *dig;
if (edges[idx] is null)
{
edges[idx] = new Trie;
}
edges[idx].insert(word, suffix[1 .. $]);
}

/**
 * Insert a word into the Trie.
 *
 * Characters that don't map to any digit are ignored in building the Trie.
 * However, the original form of the word will be retained as-is in the
 * leaf node.
 */
void insert(string word)
{
insert(word, word[]);
}

/**
 * Iterate over all words stored in this Trie.
 */
void foreachEntry(void delegate(string path, string word) cb)
{
void impl(Trie node, string path = "")
{
if (node is null) return;
foreach (word; node.words)
{
cb(path, word);
}
foreach (i, child; node.edges)
{
impl(child, path ~ cast(char)('0' + i));
}
}
impl(this);
}
}

/**
 * Loads the given dictionary into a Trie.
 */
Trie loadDictionary(R)(R lines)
if (isInputRange!R & is(ElementType!R : const(char)[]))
{
Trie result = new Trie;
foreach (line; lines)
{
result.insert(line.idup);
}
return result;
}

///
unittest
{
auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread Siarhei Siamashka via Digitalmars-d-learn

On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:
For the record (I already posted this on GitHub)... here's [my 
current fastest 
solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d) time using the same algorithm as Rust ...


[...]

... what I am really curious about is what the code I wrote is 
doing wrong that causes it to run 4x slower than Rust despite 
doing "the same thing"...


It's a GC allocations fest. Things like this make it slow:

```diff
 {
-string digit = [digits[0]];
+string digit = digits[0 .. 1];
 words.insertBack(digit);
```

And at the top is the associative array lookup (when profiling 
the handling of the "phones_1_000_000_with_empty.txt" input file):


```
36.85%  dencoder  dencoder  [.] _aaInX
12.38%  dencoder  dencoder  [.] void 
dencoder.printTranslations(immutable(char)[][][dencoder.Key], 
dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], 
std.container.array.Array!(immutable(char)[]).Array)
 6.43%  dencoder  dencoder  [.] nothrow @nogc 
scope void core.internal.gc.impl.conservative.gc.Gcx.mark!(false, 
true, 
true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange)
 4.53%  dencoder  dencoder  [.] pure @safe void 
std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult.popFront()
 4.08%  dencoder  dencoder  [.] 
_d_array_slice_copy

 2.43%  dencoder  dencoder  [.] _d_newarrayU
 2.21%  dencoder  dencoder  [.] shared nothrow 
@nogc @trusted void core.internal.spinlock.SpinLock.lock()
 2.00%  dencoder  dencoder  [.] pure @safe void 
std.array.Appender!(immutable(char)[]).Appender.put!(dchar).put(dchar)
 1.91%  dencoder  dencoder  [.] nothrow void* 
core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong, ref 
ulong, uint, const(TypeInfo))
 1.67%  dencoder  dencoder  [.] 
_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_
 1.66%  dencoder  dencoder  [.] pure nothrow 
@nogc scope @trusted ulong 
core.internal.gc.bits.GCBits.setLocked(ulong)
 1.60%  dencoder  dencoder  [.] const pure 
nothrow @trusted ulong object.TypeInfo_Struct.getHash(scope 
const(void*))
 1.53%  dencoder  dencoder  [.] nothrow @safe 
void 
core.internal.util.array.enforceRawArraysConformable(const(char[]), const(ulong), const(void[]), const(void[]), const(bool))
 1.49%  dencoder  dencoder  [.] nothrow void* 
core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo))
 1.27%  dencoder  dencoder  [.] pure nothrow 
@safe void 
std.array.Appender!(immutable(char)[]).Appender.ensureAddable(ulong)
 0.81%  dencoder  dencoder  [.] pure @property 
@safe dchar 
std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult.front()
 0.79%  dencoder  dencoder  [.] nothrow @safe 
void core.internal.util.array._enforceNoOverlap(const(char[]), 
ulong, ulong, const(ulong))
 0.74%  dencoder  dencoder  [.] nothrow void 
core.internal.gc.impl.conservative.gc.Pool.setBits(ulong, uint)
 0.73%  dencoder  dencoder  [.] pure @safe void 
std.array.Appender!(immutable(char)[]).Appender.put!(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult).put(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult)
 0.70%  dencoder  dencoder  [.] pure @safe ulong 
std.range.primitives.walkLength!(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult).walkLength(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult)
 0.60%  dencoder  dencoder  [.] bool 
dencoder.lastItemIsDigit(std.container.array.Array!(immutable(char)[]).Array)

 0.54%  dencoder  dencoder  [.] _d_newarrayT
[...]
```


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 16, 2024 at 09:15:19PM +, Renato via Digitalmars-d-learn wrote:
> On Tuesday, 16 January 2024 at 20:34:48 UTC, H. S. Teoh wrote:
> > On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via
> > Digitalmars-d-learn wrote: [...]
> > > Anyway, I've fixed the problem, now my program produces the exact
> > > same output as Renato's repo. Code is posted below.
> > [...]
> > 
> 
> Great, I ran the benchmarks for you :)
> 
> I had to change how you accept arguments, even though you did "the
> right thing" using `getopt`, the solutions should just take a `count`
> or `print` argument first...

Oops, haha :-P


> Anyway, here's your result:
> 
> ```
> ===> ./rust
> ./rust,24133632,25
> ./rust,24739840,130
> ./rust,24477696,536
> ./rust,25247744,1064
> ./rust,8175616,6148
> ./rust,8306688,8315
> ===> src/d/dencoder
> src/d/dencoder,46055424,43
> src/d/dencoder,96337920,146
> src/d/dencoder,102350848,542
> src/d/dencoder,102268928,1032
> src/d/dencoder,40206336,99936
> ^C
> ```
> 
> It took too long with the `count` option, so I had to abort before the
> last run ended... there's probably some bug there, otherwise the Trie
> runs very fast, as I had expected.
[...]

Do you have the problematic data file handy?  I'd like to look into any
potential bugs.

Also, the profiler revealed that a lot of time was spent in the GC and
in small allocations.  The cause is in all likelihood the .format() call
for each found match, and array append being used for the recursive
calls. Getting rid of the .format ought to speed it up a bit. Will try
that now...


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. -- 
Christopher


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread Siarhei Siamashka via Digitalmars-d-learn

On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:
I can't explain why it's so incredibly fast, specially for the 
`count` case. I tried using the same hashing function on my 
solution, but that didn't really help me!


That's dynamic programming with memoization. Basically caching 
the already calculated results (in the `dp` array) and avoiding 
recalculations when the recursive function revisits the same 
state again.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread Renato via Digitalmars-d-learn

On Tuesday, 16 January 2024 at 20:34:48 UTC, H. S. Teoh wrote:
On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via 
Digitalmars-d-learn wrote: [...]
Anyway, I've fixed the problem, now my program produces the 
exact same output as Renato's repo. Code is posted below.

[...]



Great, I ran the benchmarks for you :)

I had to change how you accept arguments, even though you did 
"the right thing" using `getopt`, the solutions should just take 
a `count` or `print` argument first...


Anyway, here's your result:

```
===> ./rust
./rust,24133632,25
./rust,24739840,130
./rust,24477696,536
./rust,25247744,1064
./rust,8175616,6148
./rust,8306688,8315
===> src/d/dencoder
src/d/dencoder,46055424,43
src/d/dencoder,96337920,146
src/d/dencoder,102350848,542
src/d/dencoder,102268928,1032
src/d/dencoder,40206336,99936
^C
```

It took too long with the `count` option, so I had to abort 
before the last run ended... there's probably some bug there, 
otherwise the Trie runs very fast, as I had expected.


For the record (I already posted this on GitHub)... here's [my 
current fastest 
solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d) time using the same algorithm as Rust (after I fixed my solution that was using `int128`, which was not valid, as @ssvb pointed out):


```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23986176,24
./rust,24346624,118
./rust,24543232,525
./rust,24346624,1027
./rust,8011776,3830
./rust,8486912,18403
===> src/d/dencoder
src/d/dencoder,13615104,30
src/d/dencoder,24723456,374
src/d/dencoder,24592384,1750
src/d/dencoder,24788992,3472
src/d/dencoder,11517952,10235
src/d/dencoder,11517952,51211
```

Not great :(

But @ssvb came up with a really fast implementation, though 
totally different algorithm (so dont' take this as evidence of D 
being faster than Rust or anything like that):


```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23986176,24
./rust,24461312,135
./rust,24494080,514
./rust,24526848,981
./rust,8175616,3647
./rust,8011776,15404
===> src/d/dencoder
src/d/dencoder,16433152,30
src/d/dencoder,16613376,125
src/d/dencoder,16613376,468
src/d/dencoder,16613376,941
src/d/dencoder,5570560,12
src/d/dencoder,6701056,18
```

I can't explain why it's so incredibly fast, specially for the 
`count` case. I tried using the same hashing function on my 
solution, but that didn't really help me!


Pretty cool to see different solutions to the problem, but I'm 
still curious to know where the solution I wrote is being slow 
compared to Rust which is using identical, to my understanding, 
code! I'm sure people could write incredibly fast code to solve 
this problem, but what I am really curious about is what the code 
I wrote is doing wrong that causes it to run 4x slower than Rust 
despite doing "the same thing"...


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via Digitalmars-d-learn 
wrote:
[...]
> Anyway, I've fixed the problem, now my program produces the exact same
> output as Renato's repo. Code is posted below.
[...]

Oops, forgot to actually paste the code. Here it is:

snip
/**
 * Encoding phone numbers according to a dictionary.
 */
import std;

/**
 * Table of digit mappings.
 */
static immutable ubyte[dchar] digitOf;
shared static this()
{
digitOf = [
'E': 0,
'J': 1, 'N': 1, 'Q': 1,
'R': 2, 'W': 2, 'X': 2,
'D': 3, 'S': 3, 'Y': 3,
'F': 4, 'T': 4,
'A': 5, 'M': 5,
'C': 6, 'I': 6, 'V': 6,
'B': 7, 'K': 7, 'U': 7,
'L': 8, 'O': 8, 'P': 8,
'G': 9, 'H': 9, 'Z': 9,
];
}

/**
 * Trie for storing dictionary words according to the phone number mapping.
 */
class Trie
{
Trie[10] edges;
string[] words;

private void insert(string word, string suffix)
{
const(ubyte)* dig;
while (!suffix.empty &&
   (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null)
{
suffix = suffix[1 .. $];
}

if (suffix.empty)
{
words ~= word;
return;
}

auto node = new Trie;
auto idx = *dig;
if (edges[idx] is null)
{
edges[idx] = new Trie;
}
edges[idx].insert(word, suffix[1 .. $]);
}

/**
 * Insert a word into the Trie.
 *
 * Characters that don't map to any digit are ignored in building the Trie.
 * However, the original form of the word will be retained as-is in the
 * leaf node.
 */
void insert(string word)
{
insert(word, word[]);
}

/**
 * Iterate over all words stored in this Trie.
 */
void foreachEntry(void delegate(string path, string word) cb)
{
void impl(Trie node, string path = "")
{
if (node is null) return;
foreach (word; node.words)
{
cb(path, word);
}
foreach (i, child; node.edges)
{
impl(child, path ~ cast(char)('0' + i));
}
}
impl(this);
}
}

/**
 * Loads the given dictionary into a Trie.
 */
Trie loadDictionary(R)(R lines)
if (isInputRange!R & is(ElementType!R : const(char)[]))
{
Trie result = new Trie;
foreach (line; lines)
{
result.insert(line.idup);
}
return result;
}

///
unittest
{
auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

auto app = appender!(string[]);
dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); });
assert(app.data == [
"10: je",
"105513: jemand",
"107: neu",
"1550: Name",
"253302: Wasser",
"35: da",
"38: so",
"400: Fee",
"4021: fern",
"4034: Fest",
"482: Tor",
"4824: fort",
"4824: Torf",
"51: an",
"562: mir",
"562: Mix",
"56202: Mixer",
"78: Bo\"",
"783: bo\"s",
"7857: blau",
"7884: Boot",
"824: Ort",
"83: o\"d"
]);
}

/**
 * Find all encodings of the given phoneNumber according to the given
 * dictionary, and write each encoding to the given sink.
 */
void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink)
if (isOutputRange!(W, string))
{
bool impl(Trie node, const(char)[] suffix, string[] path, bool allowDigit)
{
if (node is null)
return false;

// Ignore non-digit characters in phone number
while (!suffix.empty && (suffix[0] < '0' || suffix[0] > '9'))
suffix = suffix[1 .. $];

if (suffix.empty)
{
// Found a match, print result
foreach (word; node.words)
{
put(sink, format("%s: %-(%s %)", phoneNumber,
 path.chain(only(word;
}
return !node.words.empty;
}

bool ret;
foreach (word; node.words)
{
// Found a matching word, try to match the rest of the phone
// number.
ret = true;
if (impl(dict, suffix, path ~ word, true))
allowDigit = false;
}

if (impl(node.edges[suffix[0] - '0'], suffix[1 .. $], path, false))
{
allowDigit = false;
ret = true;
}

if (allowDigit)
{
// If we got here, it means that if we take the current node as the
// next word choice, then the following digit will have no further
// matches, and we may encode it as a single digit.
auto nextSuffix = suffix[1 .. $];

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 16, 2024 at 06:54:56PM +, Renato via Digitalmars-d-learn wrote:
> On Tuesday, 16 January 2024 at 16:56:04 UTC, Siarhei Siamashka wrote:
[...]
> > You are not allowed to emit "1" as the first token in the output as
> > long as there are any dictionary word matches at that position. The
> > relevant paragraph from the problem statement:

Ohhh now I get it.  Initially I misunderstood that as saying that if the
rest of the phone number has at least one match, then a digit is not
allowed.  Now I see that what it's actually saying is that even if some
random dictionary word matches at that position, even if it does not
lead to any full matches, then a digit is excluded.


[...]
> > I also spent a bit of time trying to figure out this nuance when
> > implementing my solution. It doesn't make much sense visually (no
> > back-to-back digits in the output either way), but that's how it is.
> 
> Exactly, this is one of the things that make this problem a bit
> annoying to solve :)

It's a strange requirement, for sure, but I don't think it's annoying.
It makes the problem more Interesting(tm). ;-)

Anyway, I've fixed the problem, now my program produces the exact same
output as Renato's repo. Code is posted below. Interestingly enough, the
running time has now halved to about 0.9 seconds for 1 million phone
numbers. I guess that's caused by the more stringent requirement
excluding many more matching possibilities, effectively pruning away
large parts of the search tree.


> @"H. S. Teoh" you implemented the solution as a Trie!! Nice, that's
> also what I did when I "participated" in the study. Here's [my Trie
> solution in
> Java](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/java/Main.java).
> 
> These are basically the two common approaches to the problem: a Trie
> or a numeric-based table. According to the study, people who use
> scripting languages almost always go with the numeric approach, while
> people coming from lower level languages tend to use a data structure
> like Trie (if they don't know Trie, they come up with something
> similar which is fascinating), which is harder to implement but more
> efficient in general.

Interesting.  I guess my C/C++ background is showing. ;-)

I'm not sure what exactly motivated me to go this route; I guess it was
just my default preference of choosing the path of least work as far as
the algorithm is concerned: I chose the algorithm that did the least
amount of work needed to produce the right answer.  Scanning through
sections of the dictionary to find a match was therefore excluded; so my
first thought was an AA. But then how much of the initial prefix to look
up an in AA?  Since it can't be known beforehand, I'd have to gradually
lengthen the prefix to search for, which does a lot of repetitive work
(we keep looking up the first few digits repeatedly each time we search
for a longer prefix). Plus, multiple consecutive AA lookups is not
cache-friendly.  So my next thought was, what should I do such that I
don't have to look at the initial digits anymore once I already
processed it?  This line of thought naturally led to a trie structure.

Once I arrived at a trie structure, the next question was how exactly
dictionary entries would be stored in it.  Again, in the vein of doing
the least amount of work I could get away with, I thought, if I stored
words in the trie directly, with each edge encoding a letter, then
during the search I'd have to repeatedly convert letters to the
corresponding phone number digit and vice versa.  So why not do this
conversion beforehand, and store only phone digits in the trie?  This
also had the additional benefit of letting me effectively search
multiple letters simultaneously, since multiple letters map to the same
digit, so scanning a digit is equivalent to searching multiple letters
at the same time.  The output, of course, required the original form of
the words -- so the obvious solution was to attach the original words as
a list of words attached to the trie node representing the end of that
word.

Once this was all decided, the only remaining question was the search
algorithm. This turned out to take the most time in solving this
problem, due to the recursive nature of the search, I had to grapple
with where and how to make the recursive calls, and how to propagate
return values correctly.  The initial implementation only found word
matches, and did not allow the single digits.  Happily, the recursive
algorithm turned out to have enough structure to encode the single digit
requirements as well, although it took a bit of trial and error to find
the correct implementation.


> Can I ask you why didn't you use the [D stdlib
> Trie](https://dlang.org/phobos/std_uni.html#codepointTrie)? Not sure
> that would've worked, but did you consider that?

Haha, I didn't even think of that. :-D  I wouldn't have wanted to use it
anyway, because it was optimized for 

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread Renato via Digitalmars-d-learn
On Tuesday, 16 January 2024 at 16:56:04 UTC, Siarhei Siamashka 
wrote:

On Tuesday, 16 January 2024 at 15:50:35 UTC, H. S. Teoh wrote:
Unfortunately there seems to be some discrepancy between the 
output I got and the prescribed output in your repository. For 
example, in your output the number 1556/0 does not have an 
encoding, but isn't "1 Mai 0" a valid encoding according to 
your dictionary and the original problem description?


You are not allowed to emit "1" as the first token in the 
output as long as there are any dictionary word matches at that 
position. The relevant paragraph from the problem statement:


==snip==
Encodings of phone numbers can consist of a single word or of 
multiple
words separated by spaces. The encodings are built word by word 
from
left to right. If and only if at a particular point no word at 
all from
the dictionary can be inserted, a single digit from the phone 
number can
be copied to the encoding instead. Two subsequent digits are 
never
allowed, though. To put it differently: In a partial encoding 
that
currently covers k digits, digit k+1 is encoded by itself if 
and only if,
first, digit k was not encoded by a digit and, second, there is 
no word
in the dictionary that can be used in the encoding starting at 
digit k+1.

==snip==

I also spent a bit of time trying to figure out this nuance 
when implementing my solution. It doesn't make much sense 
visually (no back-to-back digits in the output either way), but 
that's how it is.


Exactly, this is one of the things that make this problem a bit 
annoying to solve :)


@"H. S. Teoh" you implemented the solution as a Trie!! Nice, 
that's also what I did when I "participated" in the study. Here's 
[my Trie solution in 
Java](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/java/Main.java).


These are basically the two common approaches to the problem: a 
Trie or a numeric-based table. According to the study, people who 
use scripting languages almost always go with the numeric 
approach, while people coming from lower level languages tend to 
use a data structure like Trie (if they don't know Trie, they 
come up with something similar which is fascinating), which is 
harder to implement but more efficient in general.


Can I ask you why didn't you use the [D stdlib 
Trie](https://dlang.org/phobos/std_uni.html#codepointTrie)? Not 
sure that would've worked, but did you consider that?


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread Siarhei Siamashka via Digitalmars-d-learn

On Tuesday, 16 January 2024 at 15:50:35 UTC, H. S. Teoh wrote:
Unfortunately there seems to be some discrepancy between the 
output I got and the prescribed output in your repository. For 
example, in your output the number 1556/0 does not have an 
encoding, but isn't "1 Mai 0" a valid encoding according to 
your dictionary and the original problem description?


You are not allowed to emit "1" as the first token in the output 
as long as there are any dictionary word matches at that 
position. The relevant paragraph from the problem statement:


==snip==
Encodings of phone numbers can consist of a single word or of 
multiple
words separated by spaces. The encodings are built word by word 
from
left to right. If and only if at a particular point no word at 
all from
the dictionary can be inserted, a single digit from the phone 
number can

be copied to the encoding instead. Two subsequent digits are never
allowed, though. To put it differently: In a partial encoding that
currently covers k digits, digit k+1 is encoded by itself if and 
only if,
first, digit k was not encoded by a digit and, second, there is 
no word
in the dictionary that can be used in the encoding starting at 
digit k+1.

==snip==

I also spent a bit of time trying to figure out this nuance when 
implementing my solution. It doesn't make much sense visually (no 
back-to-back digits in the output either way), but that's how it 
is.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread H. S. Teoh via Digitalmars-d-learn
P.S. Compiling my program with `ldc -O2`, it runs so fast that I
couldn't measure any meaningful running time that's greater than startup
overhead.  So I wrote a helper program to generate random phone numbers
up to 50 characters long, and found that it could encode 1 million phone
numbers in 2.2 seconds (using the 75,000 entry dictionary from your
repository).  Counting vs. printing the results made no significant
difference to this.


T

-- 
People tell me that I'm skeptical, but I don't believe them.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 16, 2024 at 07:50:35AM -0800, H. S. Teoh via Digitalmars-d-learn 
wrote:
[...]
> Unfortunately there seems to be some discrepancy between the output I
> got and the prescribed output in your repository. For example, in your
> output the number 1556/0 does not have an encoding, but isn't "1 Mai 0"
> a valid encoding according to your dictionary and the original problem
> description?
[...]

Also, found a bug in my program that misses some solutions when the
phone number has trailing non-digits. Here's the updated code.  It still
finds extra encodings from the output in your repo, though.  Maybe I
misunderstood part of the requirements?


--snip--
/**
 * Encoding phone numbers according to a dictionary.
 */
import std;

/**
 * Table of digit mappings.
 */
static immutable ubyte[dchar] digitOf;
shared static this()
{
digitOf = [
'E': 0,
'J': 1, 'N': 1, 'Q': 1,
'R': 2, 'W': 2, 'X': 2,
'D': 3, 'S': 3, 'Y': 3,
'F': 4, 'T': 4,
'A': 5, 'M': 5,
'C': 6, 'I': 6, 'V': 6,
'B': 7, 'K': 7, 'U': 7,
'L': 8, 'O': 8, 'P': 8,
'G': 9, 'H': 9, 'Z': 9,
];
}

/**
 * Trie for storing dictionary words according to the phone number mapping.
 */
class Trie
{
Trie[10] edges;
string[] words;

private void insert(string word, string suffix)
{
const(ubyte)* dig;
while (!suffix.empty &&
   (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null)
{
suffix = suffix[1 .. $];
}

if (suffix.empty)
{
words ~= word;
return;
}

auto node = new Trie;
auto idx = *dig;
if (edges[idx] is null)
{
edges[idx] = new Trie;
}
edges[idx].insert(word, suffix[1 .. $]);
}

/**
 * Insert a word into the Trie.
 *
 * Characters that don't map to any digit are ignored in building the Trie.
 * However, the original form of the word will be retained as-is in the
 * leaf node.
 */
void insert(string word)
{
insert(word, word[]);
}

/**
 * Iterate over all words stored in this Trie.
 */
void foreachEntry(void delegate(string path, string word) cb)
{
void impl(Trie node, string path = "")
{
if (node is null) return;
foreach (word; node.words)
{
cb(path, word);
}
foreach (i, child; node.edges)
{
impl(child, path ~ cast(char)('0' + i));
}
}
impl(this);
}
}

/**
 * Loads the given dictionary into a Trie.
 */
Trie loadDictionary(R)(R lines)
if (isInputRange!R & is(ElementType!R : const(char)[]))
{
Trie result = new Trie;
foreach (line; lines)
{
result.insert(line.idup);
}
return result;
}

///
unittest
{
auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

auto app = appender!(string[]);
dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); });
assert(app.data == [
"10: je",
"105513: jemand",
"107: neu",
"1550: Name",
"253302: Wasser",
"35: da",
"38: so",
"400: Fee",
"4021: fern",
"4034: Fest",
"482: Tor",
"4824: fort",
"4824: Torf",
"51: an",
"562: mir",
"562: Mix",
"56202: Mixer",
"78: Bo\"",
"783: bo\"s",
"7857: blau",
"7884: Boot",
"824: Ort",
"83: o\"d"
]);
}

/**
 * Find all encodings of the given phoneNumber according to the given
 * dictionary, and write each encoding to the given sink.
 */
void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink)
if (isOutputRange!(W, string))
{
bool impl(Trie node, const(char)[] suffix, string[] path, bool allowDigit)
{
if (node is null)
return false;

// Ignore non-digit characters in phone number
while (!suffix.empty && (suffix[0] < '0' || suffix[0] > '9'))
suffix = suffix[1 .. $];

if (suffix.empty)
{
// Found a match, print result
foreach (word; node.words)
{
put(sink, format("%s: %-(%s %)", phoneNumber,
 path.chain(only(word;
}
return !node.words.empty;
}

bool ret;
foreach (word; node.words)
{
// Found a matching word, try to match the rest of the phone
// number.
if (impl(dict, suffix, path ~ word, true))
{
allowDigit = false;
ret = true;
}
}

if (impl(node.edges[suffix[0] - 

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-16 Thread H. S. Teoh via Digitalmars-d-learn
On Mon, Jan 15, 2024 at 08:10:55PM +, Renato via Digitalmars-d-learn wrote:
> On Monday, 15 January 2024 at 01:10:14 UTC, Sergey wrote:
> > On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote:
> > > If anyone can find any flaw in my methodology or optmise my code so
> > > that it can still get a couple of times faster, approaching Rust's
> > > performance, I would greatly appreciate that! But for now, my
> > > understanding is that the most promising way to get there would be
> > > to write D in `betterC` style?!
> > 
> > I've added port from Rust in the PR comment. Can you please check
> > this solution?
> > Most probably it need to be optimized with profiler. Just
> > interesting how close-enough port will work.
> 
> As discussed on GitHub, the line-by-line port of the Rust code is 5x
> slower than [my latest solution using
> int128](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c/src/d/src/dencoder.d),
> which is itself 3 to 4x slower than the Rust implementation (at around
> the same order of magnitude as algorithm-equivalent Java and Common
> Lisp implementations, D is perhaps 15% faster).
> 
> I did the best I could to make D run faster, but we hit a limit that's
> a bit hard to get past now. Happy to be given suggestions (see
> profiling information in previous messages), but I've run out of ideas
> myself.

This problem piqued my interest, so yesterday and today I worked on it
and came up with my own solution (I did not look at existing solutions
in order to prevent bias).  I have not profiled it or anything, but the
runtime seems quite promising.

Here it is:

-snip--
/**
 * Encoding phone numbers according to a dictionary.
 */
import std;

/**
 * Table of digit mappings.
 */
static immutable ubyte[dchar] digitOf;
shared static this()
{
digitOf = [
'E': 0,
'J': 1, 'N': 1, 'Q': 1,
'R': 2, 'W': 2, 'X': 2,
'D': 3, 'S': 3, 'Y': 3,
'F': 4, 'T': 4,
'A': 5, 'M': 5,
'C': 6, 'I': 6, 'V': 6,
'B': 7, 'K': 7, 'U': 7,
'L': 8, 'O': 8, 'P': 8,
'G': 9, 'H': 9, 'Z': 9,
];
}

/**
 * Trie for storing dictionary words according to the phone number mapping.
 */
class Trie
{
Trie[10] edges;
string[] words;

private void insert(string word, string suffix)
{
const(ubyte)* dig;
while (!suffix.empty &&
   (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null)
{
suffix = suffix[1 .. $];
}

if (suffix.empty)
{
words ~= word;
return;
}

auto node = new Trie;
auto idx = *dig;
if (edges[idx] is null)
{
edges[idx] = new Trie;
}
edges[idx].insert(word, suffix[1 .. $]);
}

/**
 * Insert a word into the Trie.
 *
 * Characters that don't map to any digit are ignored in building the Trie.
 * However, the original form of the word will be retained as-is in the
 * leaf node.
 */
void insert(string word)
{
insert(word, word[]);
}

/**
 * Iterate over all words stored in this Trie.
 */
void foreachEntry(void delegate(string path, string word) cb)
{
void impl(Trie node, string path = "")
{
if (node is null) return;
foreach (word; node.words)
{
cb(path, word);
}
foreach (i, child; node.edges)
{
impl(child, path ~ cast(char)('0' + i));
}
}
impl(this);
}
}

/**
 * Loads the given dictionary into a Trie.
 */
Trie loadDictionary(R)(R lines)
if (isInputRange!R & is(ElementType!R : const(char)[]))
{
Trie result = new Trie;
foreach (line; lines)
{
result.insert(line.idup);
}
return result;
}

///
unittest
{
auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

auto app = appender!(string[]);
dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); });
assert(app.data == [
"10: je",
"105513: jemand",
"107: neu",
"1550: Name",
"253302: Wasser",
"35: da",
"38: so",
"400: Fee",
"4021: fern",
"4034: Fest",
"482: Tor",
"4824: fort",
"4824: Torf",
"51: an",
"562: mir",
"562: Mix",
"56202: Mixer",
"78: Bo\"",
"783: bo\"s",
"7857: blau",
"7884: Boot",
"824: Ort",
"83: o\"d"
]);
}

/**
 * Find all encodings of the given phoneNumber according to the given
 * dictionary, and write each encoding to the given sink.
 */
void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink)

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-15 Thread Renato via Digitalmars-d-learn

On Monday, 15 January 2024 at 01:10:14 UTC, Sergey wrote:

On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote:
If anyone can find any flaw in my methodology or optmise my 
code so that it can still get a couple of times faster, 
approaching Rust's performance, I would greatly appreciate 
that! But for now, my understanding is that the most promising 
way to get there would be to write D in `betterC` style?!


I've added port from Rust in the PR comment. Can you please 
check this solution?
Most probably it need to be optimized with profiler. Just 
interesting how close-enough port will work.


As discussed on GitHub, the line-by-line port of the Rust code is 
5x slower than [my latest solution using 
int128](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c/src/d/src/dencoder.d), which is itself 3 to 4x slower than the Rust implementation (at around the same order of magnitude as algorithm-equivalent Java and Common Lisp implementations, D is perhaps 15% faster).


I did the best I could to make D run faster, but we hit a limit 
that's a bit hard to get past now. Happy to be given suggestions 
(see profiling information in previous messages), but I've run 
out of ideas myself.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-14 Thread Sergey via Digitalmars-d-learn

On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote:
If anyone can find any flaw in my methodology or optmise my 
code so that it can still get a couple of times faster, 
approaching Rust's performance, I would greatly appreciate 
that! But for now, my understanding is that the most promising 
way to get there would be to write D in `betterC` style?!


I've added port from Rust in the PR comment. Can you please check 
this solution?
Most probably it need to be optimized with profiler. Just 
interesting how close-enough port will work.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-14 Thread Renato via Digitalmars-d-learn

On Sunday, 14 January 2024 at 10:02:58 UTC, Jordan Wilson wrote:

On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:
I like to use a phone encoding problem to determine the 
strenghtness and weaknesses of programming languages because 
this problem is easy enough I can write solutions in any 
language in a few hours, but complex enough to exercise lots 
of interesting parts of the language.


[...]


Hello Renato,

This seems to be quite a lot of calls:
```
 Timer frequency unknown, Times are in Megaticks 



  Num  TreeFuncPer
  CallsTimeTimeCall

1920496437613756   0 pure nothrow 
ref @trusted immutable(char)[][] 
core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong)


1920492489573474   0 @safe void 
dencoder.printTranslations(immutable(char)[][][dencoder.Key], 
dencoder.ISolutionHandler, immutable(char)[], 
immutable(char)[], immutable(char)[][])

```

This is when using the `words-quarter.txt` input (the 
`dictionary.txt` input seems to finish much faster, although 
still slower than `java`/`rust`).


I also used only 100 phone numbers as input.

My final observation is that `words-quarter.txt` contains some 
1-letter inputs, (for example, `i` or `m`)...this may result in 
a large number of encoding permutations, which may explain the 
high number of recursion calls?


Jordan


The characteristics of the dictionary impact the number of 
solutions greatly. I explored this in my blog post about [Common 
Lisp, Part 
2](https://renato.athaydes.com/posts/revenge_of_lisp-part-2).


The fact there's a ridiculous amount of function calls is why 
this problem can take minutes even without having to allocate 
much memory or print anything.


** I've managed to improve the D code enough that it is now 
faster than Common Lisp and the equivalent algorithm in Java.**


It took some profiling to do that, though... thanks to 
@Anonymouse for the suggestion to use Valgrind... with that, I 
was able to profile the code nicely (Valgrind works nicely with 
D, it doesn't even show mangled names!).


Here's what I did.

First: the solution using int128 was much faster than the BigInt 
solution, as I had already mentioned. But when profiling that, it 
was clear that for the problems with a very large number of 
solutions, GC became a problem:


```

23,044,096,944 (100.0%)  PROGRAM TOTALS


Ir  file:function

7,079,878,197 (30.72%)  
???:core.internal.gc.impl.conservative.gc.Gcx.mark!(false, true, 
true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
2,375,100,857 (10.31%)  
???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,971,210,820 ( 8.55%)  ???:_aaInX 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,922,961,924 ( 8.34%)  ???:_d_arraysetlengthT 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,298,747,622 ( 5.64%)  ???:core.int128.mul(core.int128.Cent, 
core.int128.Cent) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,134,644,706 ( 4.92%)  
???:core.internal.gc.bits.GCBits.setLocked(ulong) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
  849,587,834 ( 3.69%)  
???:core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong, 
ref ulong, uint, const(TypeInfo)) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
  827,407,988 ( 3.59%)  
./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]
  688,845,027 ( 2.99%)  
./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]
  575,415,884 ( 2.50%)  
???:_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_ [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
  562,146,592 ( 2.44%)  
???:core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, 

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-14 Thread Renato via Digitalmars-d-learn

On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote:

On Saturday, 13 January 2024 at 19:35:57 UTC, Renato wrote:

On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote:

On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote:

[...]
I will have to try it... I thought that `BigInt` was to blame 
for the slowness (from what I could read from the trace logs), 
but after replacing that with basically a byte array key (see 
[commit 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0e9025b9aacdcfef5a2649be4cc82b9bc607fd6c)) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust.


In the repo is hard to find the proper version.
I've checked the Rust from master branch and it looks a bit 
different from D implementation..


I explicitly linked to the Rust implementation in my question:

 the [Rust 
solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/rust/phone_encoder/src/main.rs) uses approx. the same algorithm as used in CL and D, but instead of BigInt, it uses a Vec as key


Can you be more specific about which part of the Rust 
implementation is not roughly equivalent to the D 
implementation?? There are obvious differences in style and 
syntax, but I hope that it's mostly equivalent algorithm-wise.


But to clarify: the branch where the fastest solutions are is 
called `fastest-implementations-print-or-count`.


The D branches with my alternative implementations are all called 
`dlang-*`.




I would suggest to rewrite in the same way as Rust implemented.
Probably you would like to try:
* do not use BigInt from std. It could be quite slow. Try to 
use GMP library from Dub instead


Right, but as I posted above, even using a byte-array just like 
in Rust, the performance was still very bad (but around 2x faster 
than using BigInt).


Also, using GMP is always suggested to me, but I can't accept 
that because my goal is not to use C libraries to achieve the 
fastest speed (which I could do by using C or FFI in any other 
language), it's to use D (and the other languages) to solve my 
problem in an idiomatic way.


I [ended up using 
`Int128`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/98dcbcf1c77d1ded3406cc03af9026e126df5b2d) (the problem description requires more than i64, but int128 is enough), which grealy improved performance over my D baseline AND on the byte-array solution.



* don't do "idup" every time
* instead of byLine, try byLineCopy


`idup` is necessary because the strings are stored in a Map after 
the iteration ends. Anyway, I replaced that with `byLineCopy` and 
it became sligthtly slower.


* instead of "arr ~= data" try to use Appender 
(https://dlang.org/library/std/array/appender.html)


I was worried about `~=` allocating too much, though IIUC it 
shouldn't be a problem the way I used it... in any case, I 
[replaced it with 
`Appender`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/f1364d0897f1f37882f1d39b92c16b84b1abdc31) and the performance was completely unchanged - which is good as I think it means I used `~=` correctly to avoid overallocation (or I messed up using `Appender` :D - pls have a look).


* also you could try to use splitter 
(https://dlang.org/library/std/algorithm/iteration/splitter.html) to lazily process each part of the data


This is not necessary because that would only affect the time 
spent loading the dictionary, which is the faster part of the 
problem... nearly all of the time is spent looking for solutions 
instead.


* isLastDigit function has many checks, but I think it could be 
implemented easier in a Rust way


The Rust solution uses sum types for that, but that had a very 
minor effect on the overall performance (though in Rust, that's 
"cleaner" to use anyway)... I know D does have SumType in the 
stdlib, but I thought it is not as "zero cost" as in Rust, and 
because both variants wrap a String, it's awkward to use that... 
so I instead tried using a struct like this:


```d
struct WordOrDigit {
string value;
bool isDigit;
}
```

You can check [the commit 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0f6b4ce83373c46b14b5bb40c53bb0e2d0905e66).


This change made the code slightly slower.

* also consider to use functions from Range (filter, map) as 
you use it in Rust, instead of using for loops


Why would for loops be slower? Or you're just saying I should 
make the D code nicer?


Anyway, thanks for the suggestions!


On Sunday, 14 January 2024 at 08:33:24 UTC, Anonymouse wrote:

On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote:

I would suggest to rewrite in the same way as Rust implemented.
Probably you would like to try:
[...]


I would strongly argue for profiling first instead of 
optimising based on conjecture. If you profile you have solid 
evidence on what is actually slow. 

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-14 Thread Jordan Wilson via Digitalmars-d-learn

On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:
I like to use a phone encoding problem to determine the 
strenghtness and weaknesses of programming languages because 
this problem is easy enough I can write solutions in any 
language in a few hours, but complex enough to exercise lots of 
interesting parts of the language.


[...]


Hello Renato,

This seems to be quite a lot of calls:
```
 Timer frequency unknown, Times are in Megaticks 

  Num  TreeFuncPer
  CallsTimeTimeCall

1920496437613756   0 pure nothrow ref 
@trusted immutable(char)[][] 
core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong)


1920492489573474   0 @safe void 
dencoder.printTranslations(immutable(char)[][][dencoder.Key], 
dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], 
immutable(char)[][])

```

This is when using the `words-quarter.txt` input (the 
`dictionary.txt` input seems to finish much faster, although 
still slower than `java`/`rust`).


I also used only 100 phone numbers as input.

My final observation is that `words-quarter.txt` contains some 
1-letter inputs, (for example, `i` or `m`)...this may result in a 
large number of encoding permutations, which may explain the high 
number of recursion calls?


Jordan



Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-14 Thread Anonymouse via Digitalmars-d-learn

On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote:

I would suggest to rewrite in the same way as Rust implemented.
Probably you would like to try:
[...]


I would strongly argue for profiling first instead of optimising 
based on conjecture. If you profile you have solid evidence on 
what is actually slow. If you're very good at analysing D, 
well-educated hypotheses *may* be enough, until they suddenly 
aren't and you will have spent a lot of time on the wrong problem.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-13 Thread Sergey via Digitalmars-d-learn

On Saturday, 13 January 2024 at 19:35:57 UTC, Renato wrote:

On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote:

On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote:

[...]
I will have to try it... I thought that `BigInt` was to blame 
for the slowness (from what I could read from the trace logs), 
but after replacing that with basically a byte array key (see 
[commit 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0e9025b9aacdcfef5a2649be4cc82b9bc607fd6c)) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust.


In the repo is hard to find the proper version.
I've checked the Rust from master branch and it looks a bit 
different from D implementation..


I would suggest to rewrite in the same way as Rust implemented.
Probably you would like to try:
* do not use BigInt from std. It could be quite slow. Try to use 
GMP library from Dub instead

* don't do "idup" every time
* instead of byLine, try byLineCopy
* instead of "arr ~= data" try to use Appender 
(https://dlang.org/library/std/array/appender.html)
* also you could try to use splitter 
(https://dlang.org/library/std/algorithm/iteration/splitter.html) 
to lazily process each part of the data
* isLastDigit function has many checks, but I think it could be 
implemented easier in a Rust way
* also consider to use functions from Range (filter, map) as you 
use it in Rust, instead of using for loops


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-13 Thread Renato via Digitalmars-d-learn

On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote:

On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote:

[...]
Not a great profiling experience :). Anyone has a better 
suggestion to "parse" the trace file?


As a drive-by suggestion and I hope it doesn't derail anything, 
but if you have the opportunity to run it on linux, have you 
tried profiling with callgrind instead, with {Q,K}Cachegrind to 
visualise things? Your repositories probably have them. 
(callgrind is a part of valgrind.)


The wiki only mentions callgrind in passing, but it has worked 
well for me. [(example)](https://i.imgur.com/WWZAwy3.png)


Thanks for the suggestion, this looks promising as I do have a 
Linux laptop (just not my main one).


I will have to try it... I thought that `BigInt` was to blame for 
the slowness (from what I could read from the trace logs), but 
after replacing that with basically a byte array key (see [commit 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0e9025b9aacdcfef5a2649be4cc82b9bc607fd6c)) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust.


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-13 Thread Anonymouse via Digitalmars-d-learn

On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote:

[...]
Not a great profiling experience :). Anyone has a better 
suggestion to "parse" the trace file?


As a drive-by suggestion and I hope it doesn't derail anything, 
but if you have the opportunity to run it on linux, have you 
tried profiling with callgrind instead, with {Q,K}Cachegrind to 
visualise things? Your repositories probably have them. 
(callgrind is a part of valgrind.)


The wiki only mentions callgrind in passing, but it has worked 
well for me. [(example)](https://i.imgur.com/WWZAwy3.png)


Re: Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-13 Thread Renato via Digitalmars-d-learn

On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:


I tried to profile the D code but the profiler seems to be 
broken on my OS (Mac):




I profiled it on Linux and stored [the trace.log 
file](https://gist.github.com/renatoathaydes/fd8752ed81b0cf792ed7ef5aa3d68acd) on a public Gist.


I tried using the HTML viewer [recommended in the 
wiki](https://wiki.dlang.org/Development_tools) but that doesn't 
work... first try:


```
Running 
../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/d-profile-viewer
Corrupt trace.log (can't compute ticks per second), please 
re-profile and try again

```

Second try with a longer trace (the one I saved in the gist):

```
Running 
../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/d-profile-viewer

std.conv.ConvException@/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d(2533):
 Unexpected '-' when converting from type char[] to type ulong

??:?
 [0x55c4be9bc99e]
??:?
 [0x55c4be9bc612]
??:?
 [0x55c4be9de81e]
??:?
 [0x55c4be9c4fbf]
/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:2533 
[0x55c4be98ba2f]
/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:2002 
[0x55c4be98b6fc]
/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:210 
[0x55c4be9665ec]

../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/source/app.d:1095
 [0x55c4be965e21]
../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/source/app.d:1138
 [0x55c4be96698a]
??:?
 [0x55c4be9c4c9c]
```

Not a great profiling experience :). Anyone has a better 
suggestion to "parse" the trace file?




Help optimize D solution to phone encoding problem: extremely slow performace.

2024-01-13 Thread Renato via Digitalmars-d-learn
I like to use a phone encoding problem to determine the 
strenghtness and weaknesses of programming languages because this 
problem is easy enough I can write solutions in any language in a 
few hours, but complex enough to exercise lots of interesting 
parts of the language.


You can check [my initial blog 
post](https://renato.athaydes.com/posts/revisiting-prechelt-paper-comparing-languages) about this, which was an analysis or the original study by Prechelt about the productivity differences between programmers using different languages.


This original problem had a flaw when used in modern computers as 
the programs can find solutions so quickly that most of the time 
in the benchmarks was being used to actually print solutions to 
stdout, so I modified the problem so that there's an option to 
either print the solutions, or just count the number of solutions 
- so the programs still need to do all the work, but are not 
required to print anything other than how many solutions were 
found.


Anyway, I ported the Common Lisp solution to D because, like CL, 
D has built-in data structures like associative arrays and 
`BigInt` (at least it's in the stdlib)... I thought this would 
actually give D an edge! But it turns out D performs very badly 
for larger input sets. It starts quite well on smaller inputs, 
but scales very poorly.


My initial Rust solution also performed very poorly, so I'm 
afraid the same is happening with my initial D solution, despite 
my best effort to write something "decent".


You can find my D solution [in this Pull 
Request](https://github.com/renatoathaydes/prechelt-phone-number-encoding/pull/16).


The solution is almost identical, to the extent possible, to the 
Common Lisp solution, which can be [found 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/lisp/main.lisp).


The secret to high performance for the algorithm being used is 
having a very efficient `BigInt` implementation, and fast hashing 
function for the hash table (associative array in D, with 
`BigInt` as keys). Hence, I suppose D's hash function or `BigInt` 
are not that fast (Rust's default hash is also not great due to 
security concerns, but it's very easy to use a custom one which 
is much faster by changing a single line of code).


Anyway, here's the current relative performance (the other 
languages are pretty heavily optimised, the Java solution uses a 
different algorithm so it's not directly comparable, [the Rust 
solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/rust/phone_encoder/src/main.rs) uses approx. the same algorithm as used in CL and D, but instead of `BigInt`, it uses a `Vec` as key as that turned out to be faster - I may try that technique in D as well - but notice that even using Rust's slower BigInt, it was orders of magnitude faster than D).


```
Benchmarking...
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,109895680,377
java-Main,179634176,1025
java-Main,167149568,1621
java-Main,180203520,2493
java-Main,96481280,6112
java-Main,95526912,7989
===> sbcl --script src/lisp/main.fasl
sbcl,31780864,74
sbcl,79437824,888
sbcl,79613952,3991
sbcl,80654336,7622
sbcl,80420864,18623
sbcl,83402752,29503
===> ./rust
./rust,23257088,58
./rust,23437312,260
./rust,23433216,1077
./rust,23416832,2017
./rust,7106560,6757
./rust,7110656,10165
===> src/d/dencoder
src/d/dencoder,38748160,223
src/d/dencoder,75800576,3154
src/d/dencoder,75788288,14905
src/d/dencoder,75751424,30271
```

I had to abort D as it was taking too long.

The above is with the `print` option (that's normally slow when 
stdout is not buffered, but I did buffer D's writes and it's 
still very slow).


With the `count` option, which does not print anything except a 
number at the end, D took so long I couldn't even collect its 
results (I waited several minutes)... the other languages 
finished in much less than a minute:


```
Benchmarking...
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,124112896,7883
java-Main,107487232,9273
===> sbcl --script src/lisp/main.fasl
sbcl,82669568,25638
sbcl,83759104,33501
===> ./rust
./rust,7061504,9488
./rust,7127040,11441
===> src/d/dencoder
^C
(ldc-1.36.0)
```

I tried to profile the D code but the profiler seems to be broken 
on my OS (Mac):


```
▶ dub build -b profile
Starting Performing "profile" build using 
/Users/renato/dlang/ldc-1.36.0/bin/ldc2 for x86_64.
Building prechelt ~master: building configuration 
[application]

 Linking dencoder
(ldc-1.36.0)
prechelt-phone-number-encoding/src/d  dlang ✗ 
9m ◒

▶ cd ../..
(ldc-1.36.0)
programming/projects/prechelt-phone-number-encoding  dlang ✗  
9m ◒

▶ src/d/dencoder
[1]