Re: Help optimize D solution to phone encoding problem: extremely slow performace.
Dne so 20. 1. 2024 21:21 uživatel Renato via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> napsal: > On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote: > > On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn > > < digitalmars-d-learn@puremagic.com> wrote: > > > >> Wow, fantastic feedback from lots of people, thanks so much! > >> ... > >> > >> > evilrat: > >> > There is another important difference, i quickly looked up D > >> > associative array implementation and the search looks like > >> > nlog(n) complexity with plain loop iteration of hashes, > >> > whereas > >> > rust's implementation is using "swisstable" algorithm > >> > implementation that has packed SIMD optimized lookups, this > >> > is > >> > likely where the 3x speed difference comes from. > >> > >> I am not using the default hash implementation in Rust (which > >> is hashbrown as you've found out). That's too slow (because > >> it's ddos secure - still Java's hash also is and it's still > >> faster). I had to replace that with a library called `ahash`. > >> > >> If you're interested in knowing more about that, please [read > >> my blog > >> post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) > about optimising the Rust code. > >> > >> So, no, that's not where the difference is coming from... in > >> fact, I am using a faster hashing function in D. > >> > >> You can [see my custom hashing function > >> here]( > >> > https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3 > ). > >> This function is not good for the general purpose case, but > >> it's probably > >> as good as it gets for this problem (see my int128 trick in a > >> previous post > >> on this thread to understand why). I did try a few variations > >> of hashing > >> before settling on this one... Rust, if anything, is at a > >> disadvantage here. > >> > > > > And here you are wrong, it is the hashing when the slowdown > > comes. It is not only about the speed of hashing function. It > > is about the quality of hashing functiuon for this particular > > problem and it is about how hashing table (associative array) > > is implemented. > > I've explained why this function is almost a perfect hash > function for this problem (there will be almost no collisions > except for very large inputs where the shift-left will > "overflow", and even then the probability of collisions should be > very low). If you're going to claim I'm wrong, you must show > exactly where I'm wrong, and preferrably you should provide a > faster implementation. I will even accept a theoretical > explanation if you can give one. What I will not accept, is for > you to call me "wrong" just because you say so. That's childish > behaviour and uncalled for on a technical discussion. > If you provided info that hash is perfect, than I am sorry. I have probably missed that in this thread my fault. Than I will need to looked into this more carefully and do much more than just assumption. >
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote: On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote: Wow, fantastic feedback from lots of people, thanks so much! ... > evilrat: > There is another important difference, i quickly looked up D > associative array implementation and the search looks like > nlog(n) complexity with plain loop iteration of hashes, > whereas > rust's implementation is using "swisstable" algorithm > implementation that has packed SIMD optimized lookups, this > is > likely where the 3x speed difference comes from. I am not using the default hash implementation in Rust (which is hashbrown as you've found out). That's too slow (because it's ddos secure - still Java's hash also is and it's still faster). I had to replace that with a library called `ahash`. If you're interested in knowing more about that, please [read my blog post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about optimising the Rust code. So, no, that's not where the difference is coming from... in fact, I am using a faster hashing function in D. You can [see my custom hashing function here]( https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3). This function is not good for the general purpose case, but it's probably as good as it gets for this problem (see my int128 trick in a previous post on this thread to understand why). I did try a few variations of hashing before settling on this one... Rust, if anything, is at a disadvantage here. And here you are wrong, it is the hashing when the slowdown comes. It is not only about the speed of hashing function. It is about the quality of hashing functiuon for this particular problem and it is about how hashing table (associative array) is implemented. I've explained why this function is almost a perfect hash function for this problem (there will be almost no collisions except for very large inputs where the shift-left will "overflow", and even then the probability of collisions should be very low). If you're going to claim I'm wrong, you must show exactly where I'm wrong, and preferrably you should provide a faster implementation. I will even accept a theoretical explanation if you can give one. What I will not accept, is for you to call me "wrong" just because you say so. That's childish behaviour and uncalled for on a technical discussion.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote: > Wow, fantastic feedback from lots of people, thanks so much! > ... > > > evilrat: > > There is another important difference, i quickly looked up D > > associative array implementation and the search looks like > > nlog(n) complexity with plain loop iteration of hashes, whereas > > rust's implementation is using "swisstable" algorithm > > implementation that has packed SIMD optimized lookups, this is > > likely where the 3x speed difference comes from. > > I am not using the default hash implementation in Rust (which is > hashbrown as you've found out). That's too slow (because it's > ddos secure - still Java's hash also is and it's still faster). I > had to replace that with a library called `ahash`. > > If you're interested in knowing more about that, please [read my > blog > post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) > about optimising the Rust code. > > So, no, that's not where the difference is coming from... in > fact, I am using a faster hashing function in D. > > You can [see my custom hashing function > here]( > https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3). > This function is not good for the general purpose case, but it's probably > as good as it gets for this problem (see my int128 trick in a previous post > on this thread to understand why). I did try a few variations of hashing > before settling on this one... Rust, if anything, is at a disadvantage here. > And here you are wrong, it is the hashing when the slowdown comes. It is not only about the speed of hashing function. It is about the quality of hashing functiuon for this particular problem and it is about how hashing table (associative array) is implemented.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 20 January 2024 at 13:07:39 UTC, Renato wrote: D overhead with the fastest compiler, LDC, compared with Rust: ``` 1.0 1.707627119 1.919527897 1.954595186 1.351342502 1.556217748 ``` Oh sorry, I only posted the rates for the Linux benchmark... On Mac M1, for some reason, D was a bit closer to Rust in some of the runs: ``` 0.86 1.81944 2.018932874 2.0078125 1.069837447 1.165078285 ```
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
Wow, fantastic feedback from lots of people, thanks so much! I will try to answer all points raised in the several answers I've got here since yesterday. On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote: I'm confused by the chained hashing of the digits. Why is that necessary? I would have thought it'd be faster to hash the entire key instead of computing the hash of each digit and chaining them together. I looked up Rust's ahash algorithm. Apparently they leverage the CPU's hardware AES instruction to compute a collision-resistant hash very quickly. Somebody should file a bug on druntime to implement this where the hardware supports it, instead of the current hashOf. For relatively small keys this would be a big performance boost. T [This is the commit](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/e757f34c3ddd16240e3515bc91564f2a134659ff) I added "incremental hashing". What you're suggesting I also tried but it was much slower. The reason incremental hashing is much faster than computing the hash of the whole key on each iteration is that you're doing a lot less work. To compute the hash of an array fully, you need to do something like this (it's just how hashing works): ```d auto currentHash = ; foreach(item: array) { currentHash = hashOf(item); } ``` But by the structure of the problem, we know that we keep adding items to the `array` and re-computing the hash (in the `printTranslations` function, which is the hot path... so ignore the dictionary loading for now). What you're suggesting is do the above for the full array, on each iteration... what I did was optmise that so that we only re-compute the hash for each item instead per iteration - which is the only work that is required. As Siarhei Siamashka mentioned: you could keep optimising this using heuristics and knowledge about the problem, at which point you might converge to a Trie implementation! But for the purpose of this comparison, I'm happy to stop at a good enough, kind of general purpose algorithm which is comparable in multiple languages... I might compare the D Trie impl with a Rust Trie impl in the future, but for now, I've had enough of that. Hope that makes sense now... try to undo it and I believe it will run around 30% slower. One of the most important thing I found is that every call to printTranslations allocates a new array (`auto keyValue = new ubyte[...];`). Given the large number of recursions involved in this algorithm, this will add up to quite a lot. Bingo! I found that before reading your suggestion :) you can see the commit where I changed that [here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/4e8508c5d16a2b623a4673d9c056a68e03b3bd87). This one-line change was the most effective change, making the D impl not only consume a lot less memory, but become quite a bit faster (I will double check later , but I believe this made the code run almost 50% faster in the longest runs)! Secondly, your hash function looks suspicious. Why are you chaining your hash per digit? I've answered that above... but I went further and instead of using the built-in `hashof` function, I decided to use [my own hashing function](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3) which takes advantage of the input characteristics... I mentioned before the trick I did to make the int128 solution faster (by the way, the int128 solution was not acceptable, as @ssvb pointed out, that's why I had to remove that). This is a very simple hashing function that works well enough for this particular problem, and this also boosted performance noticeably. Then a minor point: I wouldn't use Array in printTranslations. It's overkill for what you need to do; a built-in array would work just fine. Hm, sorry to break it to you but [changing that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b927bc9cc4e33f9f6ba457d8abc3bccbc17c7f1e) to an Array was the single greatest improvement I had. IIRC it made the code 2x faster. Maybe I was using native arrays wrongly before... but it's unclear to me how I could've made that better (and the docs themselves suggest to use Array if performance matters, which I seem to have confirmed is good advice). Next, what stood out is ISolutionHandler. If I were to write this, I wouldn't use OO for this at all, and especially not interfaces, because they involve a double indirection. As mentioned by Daniel Kozak, this doesn't matter. I tried rewriting it so that there was no indirection to the extent possible (I used delegate fields within a final class so the functions to use are selected at startup, so there's no overhead choosing which functions to call at runtime) and it was actually slightly slower (but too close to call really). My solution with interfaces had two final classes used, so
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote: On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via Digitalmars-d-learn wrote: [...] > Try addressing the points I wrote above and see if it makes a > difference. I have tried it (all of it) even before you wrote it here, because I have completely the same ideas, but to be fair it has almost zero effect on speed. There is my version (It still use OOP, but I have try it wit Printer and Counter to be structs and it has no effect at all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY The only difference in speed in the end is caused by hash implementation of dlang associative arrays and rust HashMap, actually if you modify rust to not used ahash it has almost same speed as D [...] I'm confused by the chained hashing of the digits. Why is that necessary? I would have thought it'd be faster to hash the entire key instead of computing the hash of each digit and chaining them together. Hash which key? The problem of the naive hashtable-based algorithm is that we don't know the length of the key. So all lengths are tried starting from 1 and up to the remaining part of the phone number. An additional optimization would be to limit this search and terminate it early. For example, stop after we exceed the length of the longest word from the dictionary. Or stop the search upon encountering certain "leaf" words, but this effectively morphs the algorithm into something that partially resembles Trie. And further evolving it we may end up with a full-fledged Trie. The only practical value of this algorithm is that it's easy to implement and has low LOC count. My initial non-optimized naive version was also similar https://gist.github.com/ssvb/abe821b3cdba7fcb7f3c3cecde864153 (66 lines including blank lines and some comments). This is somewhat comparable to the Lisp results https://flownet.com/ron/papers/lisp-java/raw-results.html or to the Peter Norvig's solution http://www.norvig.com/java-lisp.html The purpose of the initial implementation "prototype" is just to have some sort of a starting point. And if the runtime performance doesn't matter, then we are already done. But if the runtime performance does matter, then a lot of things can be rapidly improved by spending a bit of time on it. Eventually one runs out of ideas and further optimization efforts yield only diminishing returns. That's when it's a good time to clean up the code, add comments, etc. and label it as the "final" version. I think that a good study, that intends to compare different programming languages against each other, could take all these factors into consideration: time to implement the "prototype", runtime performance of the prototype, time to implement the "final" version, runtime performance of the final version, the number of lines of code and its readability/maintainability.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via Digitalmars-d-learn wrote: [...] >> Try addressing the points I wrote above and see if it makes a >> difference. > >I have tried it (all of it) even before you wrote it here, because >I have completely the same ideas, but to be fair it has almost zero >effect on speed. >There is my version (It still use OOP, but I have try it wit >Printer and Counter to be structs and it has no effect at >all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY >The only difference in speed in the end is caused by hash >implementation of dlang associative arrays and rust HashMap, >actually if you modify rust to not used ahash it has almost same >speed as D [...] I'm confused by the chained hashing of the digits. Why is that necessary? I would have thought it'd be faster to hash the entire key instead of computing the hash of each digit and chaining them together. I looked up Rust's ahash algorithm. Apparently they leverage the CPU's hardware AES instruction to compute a collision-resistant hash very quickly. Somebody should file a bug on druntime to implement this where the hardware supports it, instead of the current hashOf. For relatively small keys this would be a big performance boost. T -- Valentine's Day: an occasion for florists to reach into the wallets of nominal lovers in dire need of being reminded to profess their hypothetical love for their long-forgotten.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Fri, Jan 19, 2024 at 4:44 PM H. S. Teoh via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote: > Taking a look at this code: > ... > Try addressing the points I wrote above and see if it makes a > difference. > > I have tried it (all of it) even before you wrote it here, because I have completely the same ideas, but to be fair it has almost zero effect on speed. There is my version (It still use OOP, but I have try it wit Printer and Counter to be structs and it has no effect at all) https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY The only difference in speed in the end is caused by hash implementation of dlang associative arrays and rust HashMap, actually if you modify rust to not used ahash it has almost same speed as D
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Friday, 19 January 2024 at 17:18:36 UTC, evilrat wrote: On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote: You do hash map lookup for every character in D, it's slow, whereas in Rust you do it via pattern matching, java does the same, pattern matching Yet another reason to advocate for pattern matching in D and switch as expression There is another important difference, i quickly looked up D associative array implementation and the search looks like nlog(n) complexity with plain loop iteration of hashes, whereas rust's implementation is using "swisstable" algorithm implementation that has packed SIMD optimized lookups, this is likely where the 3x speed difference comes from. Tried to look up rust implementation and it is SOOO generic that I was unable to decipher it to find the actual key search and stores. Anyway here is an interesting article about rust implementation https://faultlore.com/blah/hashbrown-tldr/ I'm not talking about the difference between the hashmap implementation, but the difference between the algorithm used to lookup the characters https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/9b1d7f026943841638a2729922cf000b1b3ce655/src/java/Main2.java#L106-L134 vs https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/d/src/dencoder.d#L10-L49 If D had pattern matching and switch as expression, the faster method would be: 1. the most obvious choice 2. the fastest by default 3. the most clean To save from having to write a old-school verbose `switch`, i suspect he went with a hashmap, wich is slower in that case, hence why i keep advocate for that feature, or i should say, that upgrade to `switch`, wich java has adopted, as well as rust: https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/rust/phone_encoder/src/main.rs#L146-L168
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Friday, 19 January 2024 at 08:57:40 UTC, Renato wrote: Do you know why the whole thread seems to have disappeared?? There's a lot of good stuff in the thread, it would be a huge shame to lose all that! I agree! Thanks for posting your benchmarks, I thought your whole benching setup was pretty good, and learnt alot from your code and the resulting contributions in the thread and others. Jordan
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote: You do hash map lookup for every character in D, it's slow, whereas in Rust you do it via pattern matching, java does the same, pattern matching Yet another reason to advocate for pattern matching in D and switch as expression There is another important difference, i quickly looked up D associative array implementation and the search looks like nlog(n) complexity with plain loop iteration of hashes, whereas rust's implementation is using "swisstable" algorithm implementation that has packed SIMD optimized lookups, this is likely where the 3x speed difference comes from. Tried to look up rust implementation and it is SOOO generic that I was unable to decipher it to find the actual key search and stores. Anyway here is an interesting article about rust implementation https://faultlore.com/blah/hashbrown-tldr/
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Friday, 19 January 2024 at 13:40:39 UTC, Renato wrote: On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote: On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote: I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well. Additionally if you comparing D by measuring DMD performance - don't. It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC. I tried with DMD again, and yeah, it's much slower. Here's the [current implementation in D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d), and the roughly [equivalent Rust implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs). The only "significant" difference is that in Rust, an enum `WordOrDigit` is used to represent currently known "words"... I [did try using that in D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d), but it made the algorithm slower. If you see anything in D that's not as efficient as it should be, or somehow "inferior" to what the Rust version is doing , please let me know. Notice that almost all of the time is spent in the for-loop inside `printTranslations` (which is misnamed as it doesn't necessarily "print" anything, like it did earlier) - the rest of the code almost doesn't matter. Current performance comparison: ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23920640,31 ./rust,24018944,149 ./rust,24084480,601 ./rust,24248320,1176 ./rust,7798784,2958 ./rust,7815168,15009 ===> src/d/dencoder src/d/dencoder,14254080,36 src/d/dencoder,24477696,368 src/d/dencoder,24510464,1740 src/d/dencoder,24559616,3396 src/d/dencoder,11321344,6740 src/d/dencoder,11321344,36787 ``` So , it's not really 3x slower anymore, here's the "D overhead" considering Rust as the baseline: ``` 1.161290323 2.469798658 2.895174709 2.887755102 2.278566599 2.450996069 ``` You do hash map lookup for every character in D, it's slow, whereas in Rust you do it via pattern matching, java does the same, pattern matching Yet another reason to advocate for pattern matching in D and switch as expression
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Fri, Jan 19, 2024 at 01:40:39PM +, Renato via Digitalmars-d-learn wrote: > On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote: [...] > > Additionally if you comparing D by measuring DMD performance - > > don't. It is valuable in developing for fast iterations, but it > > lacks many modern optimization techniques, for that we have LDC and > > GDC. > > I tried with DMD again, and yeah, it's much slower. For anything where performance is even remotely important, I wouldn't even consider DMD. It's a well-known fact that it produces suboptimal executables. Its only redeeming factor is really only its fast turnaround time. If fast turnaround is not important, I would always use LDC or GDC instead. > Here's the [current implementation in > D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d), > and the roughly [equivalent Rust > implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs). Taking a look at this code: One of the most important thing I found is that every call to printTranslations allocates a new array (`auto keyValue = new ubyte[...];`). Given the large number of recursions involved in this algorithm, this will add up to quite a lot. If I were optimizing this code, I'd look into ways of reducing, if not eliminating, this allocation. Observe that this allocation is needed each time printTranslations recurses, so instead of making separate allocations, you could put it on a stack. Either with alloca, or with my appendPath() trick in my version of the code: preallocate a reasonably large buffer and take slices of it each time you need a new keyValue array. Secondly, your hash function looks suspicious. Why are you chaining your hash per digit? That's a lot of hash computations. Shouldn't you just hash the entire key each time? That would eliminate the need to store a custom hash in your key, you could just lookup the entire key at once. Next, what stood out is ISolutionHandler. If I were to write this, I wouldn't use OO for this at all, and especially not interfaces, because they involve a double indirection. I'd just return a delegate instead (single indirection, no object lookup). This is a relatively small matter, but when it's being used inside a hot inner loop, it could be important. Then a minor point: I wouldn't use Array in printTranslations. It's overkill for what you need to do; a built-in array would work just fine. Take a look at the implementation of Array and you'll see lots of function calls and locks and GC root-adding and all that stuff. Most of it doesn't apply here, of course, and is compiled out. Nevertheless, it uses wrapped integer operations and range checks, etc.. Again, these are all minor issues, but in a hot inner loop they do add up. Built-in arrays let you literally just bump the pointer when adding an element. Just a couple of instructions as opposed to several function calls. Important difference when you're on the hot path. Now, as I mentioned earlier w.r.t. my own code, appending to built-in arrays comes with a cost. So here's where you'd optimize by creating your own buffer and custom push/pop operations. Something like appendPath() in my version of the code would do the job. Finally, a very a small point: in loadDictionary, you do an AA lookup with `n in result`, and then if that returns null, you create a new entry. This does two AA lookups, once unsuccessfully, and the second time to insert the missing key. You could use the .require operation with a delegate instead of `in` followed by `if (... is null)`, which only requires a single lookup. Probably not an important point, but for a large dictionary this might make a difference. > The only "significant" difference is that in Rust, an enum > `WordOrDigit` is used to represent currently known "words"... I [did > try using that in > D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d), > but it made the algorithm slower. > > If you see anything in D that's not as efficient as it should be, or > somehow "inferior" to what the Rust version is doing , please let me > know. Couldn't tell you, I don't know Rust. :-D > Notice that almost all of the time is spent in the for-loop inside > `printTranslations` (which is misnamed as it doesn't necessarily > "print" anything, like it did earlier) - the rest of the code almost > doesn't matter. [...] Of course, that's where your hot path is. And that loop makes recursive calls to printTranslations, so the entire body of the function could use some optimization. ;-) Try addressing the points I wrote above and see if it makes a difference. T -- The two rules of success: 1. Don't tell everything you know. -- YHL
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote: On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote: I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well. Additionally if you comparing D by measuring DMD performance - don't. It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC. I tried with DMD again, and yeah, it's much slower. Here's the [current implementation in D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d), and the roughly [equivalent Rust implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs). The only "significant" difference is that in Rust, an enum `WordOrDigit` is used to represent currently known "words"... I [did try using that in D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d), but it made the algorithm slower. If you see anything in D that's not as efficient as it should be, or somehow "inferior" to what the Rust version is doing , please let me know. Notice that almost all of the time is spent in the for-loop inside `printTranslations` (which is misnamed as it doesn't necessarily "print" anything, like it did earlier) - the rest of the code almost doesn't matter. Current performance comparison: ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23920640,31 ./rust,24018944,149 ./rust,24084480,601 ./rust,24248320,1176 ./rust,7798784,2958 ./rust,7815168,15009 ===> src/d/dencoder src/d/dencoder,14254080,36 src/d/dencoder,24477696,368 src/d/dencoder,24510464,1740 src/d/dencoder,24559616,3396 src/d/dencoder,11321344,6740 src/d/dencoder,11321344,36787 ``` So , it's not really 3x slower anymore, here's the "D overhead" considering Rust as the baseline: ``` 1.161290323 2.469798658 2.895174709 2.887755102 2.278566599 2.450996069 ```
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote: On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote: I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well. This is what I would like to be discussing in this thread: why is D running at Java speeds and not at D speeds when using the same algorithm? I know there's small differences in the implementations, they are different languages after all, but not enough IMO to justify anything like 3x difference from Rust. My guess is that's because int128 is not that much optimized due to being not so popular type, though to answer what's wrong would require to look at assembly code produced for both D and Rust. Additionally if you comparing D by measuring DMD performance - don't. It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC. I am not using int128 anymore. I explained why a few posts back. I am using a byte array and computing the hash incrementally when trying different input, so that partially computed hashes are re-used on each try (this is a bit cheating, as Rust is not doing that, but I consider that to be acceptable as it's still computing hashes and looking up entries in the associative array). I used all D compilers and picked the fastest one (GDC in the case of int128, but LDC2 in the current case).
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Friday, 19 January 2024 at 05:17:51 UTC, H. S. Teoh wrote: On Thu, Jan 18, 2024 at 04:23:16PM +, Renato via Digitalmars-d-learn wrote: [...] Ok, last time I'm running this for someone else :D ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23920640,30 ./rust,24018944,147 ./rust,24068096,592 ./rust,24150016,1187 ./rust,7766016,4972 ./rust,8011776,46101 ===> src/d/dencoder src/d/dencoder,44154880,42 src/d/dencoder,51347456,87 src/d/dencoder,51380224,273 src/d/dencoder,51462144,441 src/d/dencoder,18644992,4414 src/d/dencoder,18710528,43548 ``` OK, this piqued my interest enough that I decided to install rust using rustup instead of my distro's package manager. Here are the numbers I got for my machine: ===> ./rust ./rust,22896640,35 ./rust,22896640,137 ./rust,22384640,542 ./rust,22896640,1034 ./rust,8785920,2489 ./rust,8785920,12157 ===> src/d/dencoder src/d/dencoder,1066799104,36 src/d/dencoder,1066799104,72 src/d/dencoder,1066799104,198 src/d/dencoder,1066799104,344 src/d/dencoder,1035292672,2372 src/d/dencoder,1035292672,13867 Looks like we lost out to Rust for larger inputs. :-D Probably due to environmental factors (and the fact that std.stdio is slow). I re-ran it again and got this: ===> ./rust ./rust,22896640,30 ./rust,22896640,131 ./rust,22896640,511 ./rust,22896640,983 ./rust,8785920,3102 ./rust,8785920,9912 ===> src/d/dencoder src/d/dencoder,1066799104,36 src/d/dencoder,1066799104,71 src/d/dencoder,1066799104,197 src/d/dencoder,1066799104,355 src/d/dencoder,1035292672,3441 src/d/dencoder,1035292672,9471 Notice the significant discrepancy between the two runs; this seems to show that the benchmark is only accurate up to about ±1.5 seconds. That's not correct. The discrepancy is due to the benchmark always generating different input on each run - and the characteristics of the input affects runs significantly. This affects the last two runs much more due to them using the more challenging dictionary. The important is that the relative performance between languages is reliable. If you stop re-generating the phone numbers and just run the benchmark multiple times using the same input, you'll see it's very close between runs. Anyway, oddly enough, Java seems to beat Rust on larger inputs. Maybe my Java compiler has a better JIT implementation? :-P Again, in case I haven't made it clear by repeating this multiple times: The Java code is using a Trie, the Rust code is using a numeric solution. They are completely different algorithms. Much more important than which language is being used is the algorithm, as has been shown again and again. Congratulations on beating Rust :D but remember: you're using a much more efficient algorithm! I must conclude that the Rust translation of the Trie algorithm would be much faster still, unfortunately (you may have noticed that I am on D's side here!). At this point, it's not really about the difference between languages anymore; it's about the programmer's skill at optimizing his code. Traditionally Java is thought to be the slowest, because it runs in a VM and generally tends to use more heap allocations. In recent times, however, JIT and advanced GC implementations have significantly levelled that out, so you're probably not going to see the difference unless you hand-tweak your code down to the bare metal. Java has been very fast for over a decade now, this is not new at all. Surprisingly, at least on my machine, Lisp actually performed the worst. I'd have thought it would at least beat Java, but I was quite wrong. :-D Perhaps the Lisp implementation I'm using is suboptimal, I don't know. Or perhaps modern JVMs have really overtaken Lisp. Please don't compare different algorithms in different languages and make conclusions about each language's speed. Now I'm really curious how a Rust version of the trie algorithm would perform. Unfortunately I don't know Rust so I wouldn't be able to write it myself. (Hint, hint, nudge, nudge ;-)). As far as the performance of my D version is concerned, I still haven't squeezed out all the performance I could yet. Going into this, my intention was to take the lazy way of optimizing only what the profiler points out to me, with the slight ulterior motive of proving that a relatively small amount of targeted optimizations can go a long way at making the GC a non-problem in your typical D code. ;-) I haven't pulled out all the optimization guns at my disposal yet. You don't really have to... @ssvb's solution is incredibly fast at the "count" problem and I really don't think anyone can beat that implementation. The only problem is that the implementation is very C-like and nothing like D I would write. If I were to go the next step, I'd split up the impl() function so that I get a better profile of where it's spending most of its time, and then optimize that. My current suspicion is that the traversal of the
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Thu, Jan 18, 2024 at 04:23:16PM +, Renato via Digitalmars-d-learn wrote: [...] > Ok, last time I'm running this for someone else :D > > ``` > Proc,Run,Memory(bytes),Time(ms) > ===> ./rust > ./rust,23920640,30 > ./rust,24018944,147 > ./rust,24068096,592 > ./rust,24150016,1187 > ./rust,7766016,4972 > ./rust,8011776,46101 > ===> src/d/dencoder > src/d/dencoder,44154880,42 > src/d/dencoder,51347456,87 > src/d/dencoder,51380224,273 > src/d/dencoder,51462144,441 > src/d/dencoder,18644992,4414 > src/d/dencoder,18710528,43548 > ``` OK, this piqued my interest enough that I decided to install rust using rustup instead of my distro's package manager. Here are the numbers I got for my machine: ===> ./rust ./rust,22896640,35 ./rust,22896640,137 ./rust,22384640,542 ./rust,22896640,1034 ./rust,8785920,2489 ./rust,8785920,12157 ===> src/d/dencoder src/d/dencoder,1066799104,36 src/d/dencoder,1066799104,72 src/d/dencoder,1066799104,198 src/d/dencoder,1066799104,344 src/d/dencoder,1035292672,2372 src/d/dencoder,1035292672,13867 Looks like we lost out to Rust for larger inputs. :-D Probably due to environmental factors (and the fact that std.stdio is slow). I re-ran it again and got this: ===> ./rust ./rust,22896640,30 ./rust,22896640,131 ./rust,22896640,511 ./rust,22896640,983 ./rust,8785920,3102 ./rust,8785920,9912 ===> src/d/dencoder src/d/dencoder,1066799104,36 src/d/dencoder,1066799104,71 src/d/dencoder,1066799104,197 src/d/dencoder,1066799104,355 src/d/dencoder,1035292672,3441 src/d/dencoder,1035292672,9471 Notice the significant discrepancy between the two runs; this seems to show that the benchmark is only accurate up to about ±1.5 seconds. Anyway, oddly enough, Java seems to beat Rust on larger inputs. Maybe my Java compiler has a better JIT implementation? :-P > Congratulations on beating Rust :D but remember: you're using a much > more efficient algorithm! I must conclude that the Rust translation of > the Trie algorithm would be much faster still, unfortunately (you may > have noticed that I am on D's side here!). At this point, it's not really about the difference between languages anymore; it's about the programmer's skill at optimizing his code. Traditionally Java is thought to be the slowest, because it runs in a VM and generally tends to use more heap allocations. In recent times, however, JIT and advanced GC implementations have significantly levelled that out, so you're probably not going to see the difference unless you hand-tweak your code down to the bare metal. Surprisingly, at least on my machine, Lisp actually performed the worst. I'd have thought it would at least beat Java, but I was quite wrong. :-D Perhaps the Lisp implementation I'm using is suboptimal, I don't know. Or perhaps modern JVMs have really overtaken Lisp. Now I'm really curious how a Rust version of the trie algorithm would perform. Unfortunately I don't know Rust so I wouldn't be able to write it myself. (Hint, hint, nudge, nudge ;-)). As far as the performance of my D version is concerned, I still haven't squeezed out all the performance I could yet. Going into this, my intention was to take the lazy way of optimizing only what the profiler points out to me, with the slight ulterior motive of proving that a relatively small amount of targeted optimizations can go a long way at making the GC a non-problem in your typical D code. ;-) I haven't pulled out all the optimization guns at my disposal yet. If I were to go the next step, I'd split up the impl() function so that I get a better profile of where it's spending most of its time, and then optimize that. My current suspicion is that the traversal of the trie could be improved by caching intermediate results to eliminate a good proportion of recursive calls in impl(). Also, the `print` mode of operation is quite slow, probably because writefln() allocates. (It allocates less than if I had used .format like I did before, but it nevertheless still allocates.) To alleviate this cost, I'd allocate an output buffer and write to that, flushing only once it filled up. Another thing I could do is to use std.parallelism.parallel to run searches on batches of phone numbers in parallel. This is kinda cheating, though, since it's the same algorithm with the same cost, we're just putting more CPU cores to work. :-P But in D this is quite easy to do, often as easy as simply adding .parallel to your outer foreach loop. In this particular case it will need some additional refactoring due to the fact that the input is being read line by line. But it's relatively easy to load the input into a buffer by chunks instead, and just run the searches on all the numbers found in the buffer in parallel. On Thu, Jan 18, 2024 at 04:25:45PM +, Renato via Digitalmars-d-learn wrote: [...] > BTW here's you main function so it can run on the benchmark: [...] Thanks, I've adapted my code accordingly and pushed to my github repo. T -- This is a tpyo.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 16:54:00 UTC, H. S. Teoh wrote: On Wed, Jan 17, 2024 at 07:57:02AM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...] I'll push the code to github. [...] Here: https://github.com/quickfur/prechelt/blob/master/encode_phone.d T BTW here's you main function so it can run on the benchmark: ```d int main(string[] args) { bool countOnly = args.length > 1 ? (() { final switch (args[1]) { case "count": return true; case "print": return false; } })() : false; auto dictfile = args.length > 2 ? args[2] : "tests/words.txt"; auto input = args.length > 3 ? args[3] : "tests/numbers.txt"; Trie dict = loadDictionary(File(dictfile).byLine); if (countOnly) { size_t count; encodePhoneNumbers(File(input).byLine, dict, (phone, match) { count++; }); writefln("%d", count); } else { encodePhoneNumbers(File(input).byLine, dict, (phone, match) { writefln("%s: %s", phone, match); }); } return 0; } ```
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 16:54:00 UTC, H. S. Teoh wrote: On Wed, Jan 17, 2024 at 07:57:02AM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...] I'll push the code to github. [...] Here: https://github.com/quickfur/prechelt/blob/master/encode_phone.d T Ok, last time I'm running this for someone else :D ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23920640,30 ./rust,24018944,147 ./rust,24068096,592 ./rust,24150016,1187 ./rust,7766016,4972 ./rust,8011776,46101 ===> src/d/dencoder src/d/dencoder,44154880,42 src/d/dencoder,51347456,87 src/d/dencoder,51380224,273 src/d/dencoder,51462144,441 src/d/dencoder,18644992,4414 src/d/dencoder,18710528,43548 ``` Congratulations on beating Rust :D but remember: you're using a much more efficient algorithm! I must conclude that the Rust translation of the Trie algorithm would be much faster still, unfortunately (you may have noticed that I am on D's side here!).
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 16:30:08 UTC, H. S. Teoh wrote: On Wed, Jan 17, 2024 at 07:19:39AM +, Renato via Digitalmars-d-learn wrote: [...] But pls run the benchmarks yourself as I am not going to keep running it for you, and would be nice if you posted your solution on a Gist for example, pasting lots of code in the forum makes it difficult to follow. I can't. I spent half an hour trying to get ./benchmark.sh to run, but no matter what it could not compile benchmark_runner. It complains that my rustc is too old and some dependencies do not support it. I tried running the suggested cargo update command to pin the versions but none of them worked. Since I'm not a Rust user, I'm not feeling particularly motivated right now to spend any more time on this. Upgrading my rustc isn't really an option because that's the version currently in my distro and I really don't feel like spending more time to install a custom version of rustc just for this benchmark. T I've just updated the Rust version to the benchmark monitor could work on Linux (it only worked on Mac before) :D that's probably why your rustc didn't work, though as the project is still using edition2018 I would've thought even a very old compiler would have worked... anyway, if you ever find yourself actually using Rust, you should use `rustup` (https://rustup.rs/) which makes it trivial to update Rust. About the "count" option: I had assumed you didn't call format on the count option as it's never needed, there's nothing to print.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wed, Jan 17, 2024 at 07:57:02AM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...] > I'll push the code to github. [...] Here: https://github.com/quickfur/prechelt/blob/master/encode_phone.d T -- Why do conspiracy theories always come from the same people??
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wed, Jan 17, 2024 at 07:19:39AM +, Renato via Digitalmars-d-learn wrote: [...] > But pls run the benchmarks yourself as I am not going to keep running > it for you, and would be nice if you posted your solution on a Gist > for example, pasting lots of code in the forum makes it difficult to > follow. I can't. I spent half an hour trying to get ./benchmark.sh to run, but no matter what it could not compile benchmark_runner. It complains that my rustc is too old and some dependencies do not support it. I tried running the suggested cargo update command to pin the versions but none of them worked. Since I'm not a Rust user, I'm not feeling particularly motivated right now to spend any more time on this. Upgrading my rustc isn't really an option because that's the version currently in my distro and I really don't feel like spending more time to install a custom version of rustc just for this benchmark. T -- Today's society is one of specialization: as you grow, you learn more and more about less and less. Eventually, you know everything about nothing.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wed, Jan 17, 2024 at 07:19:39AM +, Renato via Digitalmars-d-learn wrote: > On Tuesday, 16 January 2024 at 22:13:55 UTC, H. S. Teoh wrote: > > used for the recursive calls. Getting rid of the .format ought to > > speed it up a bit. Will try that now... > > > > That will make no difference for the `count` option which is where > your solution was very slow. Of course it will. Passing the data directly to the callback that bumps a counter is faster than allocating a new string, formatting the data, and then passing it to the callback that bumps a counter. It may not look like much, but avoiding unnecessary GC allocations means the GC will have less work to do later when a collection is run, thus you save time over the long term. > To run the slow test manually use the `words_quarter.txt` dictionary > (the phone numbers file doesn't matter much - it's all in the > dictionary). > > But pls run the benchmarks yourself as I am not going to keep running > it for you, and would be nice if you posted your solution on a Gist > for example, pasting lots of code in the forum makes it difficult to > follow. I'll push the code to github. T -- "No, John. I want formats that are actually useful, rather than over-featured megaliths that address all questions by piling on ridiculous internal links in forms which are hideously over-complex." -- Simon St. Laurent on xml-dev
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 11:56:19 UTC, evilrat wrote: On Wednesday, 17 January 2024 at 11:20:14 UTC, Renato wrote: That means the input file is still not ASCII (or UTF-8) as it should. Java is reading files with the ASCII encoding so it should've worked fine. It seems that it is only works with ASCII encoding though. Yes, that's according to the rules - only ASCII for everything.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 11:20:14 UTC, Renato wrote: That means the input file is still not ASCII (or UTF-8) as it should. Java is reading files with the ASCII encoding so it should've worked fine. It seems that it is only works with ASCII encoding though.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 10:50:26 UTC, evilrat wrote: On Wednesday, 17 January 2024 at 10:43:22 UTC, Renato wrote: On Wednesday, 17 January 2024 at 10:24:31 UTC, Renato wrote: It's not Java writing the file, it's the bash script [`benchmark.sh`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/master/benchmark.sh#L31): ``` java -cp "build/util" util.GeneratePhoneNumbers 1000 > phones_1000.txt ``` Perhaps using this option when running Java will help: ``` java -DFile.Encoding=UTF-8 ... ``` I've used powershell env var to set output to utf8, D version now works but java doesn't. ``` java -Xms20M -Xmx100M -cp build/java Main print words-quarter.txt phones_1000.txt Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 65485 out of bounds for length 10 at Trie.completeSolution(Main.java:216) at Trie.forEachSolution(Main.java:192) at PhoneNumberEncoder.encode(Main.java:132) at Main.lambda$main$1(Main.java:38) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) at Main.main(Main.java:38) ``` This is this line: ``` var digit = chars[ index ] - 48; ``` That means the input file is still not ASCII (or UTF-8) as it should. Java is reading files with the ASCII encoding so it should've worked fine.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 10:43:22 UTC, Renato wrote: On Wednesday, 17 January 2024 at 10:24:31 UTC, Renato wrote: It's not Java writing the file, it's the bash script [`benchmark.sh`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/master/benchmark.sh#L31): ``` java -cp "build/util" util.GeneratePhoneNumbers 1000 > phones_1000.txt ``` Perhaps using this option when running Java will help: ``` java -DFile.Encoding=UTF-8 ... ``` I've used powershell env var to set output to utf8, D version now works but java doesn't. ``` java -Xms20M -Xmx100M -cp build/java Main print words-quarter.txt phones_1000.txt Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 65485 out of bounds for length 10 at Trie.completeSolution(Main.java:216) at Trie.forEachSolution(Main.java:192) at PhoneNumberEncoder.encode(Main.java:132) at Main.lambda$main$1(Main.java:38) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) at Main.main(Main.java:38) ```
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 10:24:31 UTC, Renato wrote: It's not Java writing the file, it's the bash script [`benchmark.sh`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/master/benchmark.sh#L31): ``` java -cp "build/util" util.GeneratePhoneNumbers 1000 > phones_1000.txt ``` Perhaps using this option when running Java will help: ``` java -DFile.Encoding=UTF-8 ... ```
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 09:15:02 UTC, evilrat wrote: On Wednesday, 17 January 2024 at 07:11:02 UTC, Renato wrote: If you want to check your performance, you know you can run the `./benchmark.sh` yourself? Out of curiosity I've tried to manually run this on Windows and it seems that Java generator for these numbers files is "broken", the resulting count or print runs fine for both Java and D versions provided in your D branch, but fails with generated files. D version complains about bad utf8 encoding. I've opened the generated file in text editor and it is UTF-16 (little-endian with BOM). Tried with adoptium jdk 17 and 21 (former openjdk), but I guess it doesn't matter since UTF-16 is default on Windows. It's not Java writing the file, it's the bash script [`benchmark.sh`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/master/benchmark.sh#L31): ``` java -cp "build/util" util.GeneratePhoneNumbers 1000 > phones_1000.txt ``` Java is just printing to stdout. I wasn't aware that piping like this would use the OS default encoding. Unfortunately I don't have Windows here, but try to change the encoding used by the bash script maybe?
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 07:06:25 UTC, Renato wrote: On Tuesday, 16 January 2024 at 22:15:04 UTC, Siarhei Siamashka wrote: On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote: It's a GC allocations fest. Things like this make it slow: ```diff { -string digit = [digits[0]]; +string digit = digits[0 .. 1]; words.insertBack(digit); ``` I was under the impression that `[digits[0]]` would just use a stack allocation?? The profiler does not show any GC anymore, are you sure it's a "GC allocations fest"??? nah, you are allocating new array out of single digit while the latter is just takes a slice. there is 'scope' storage specifier for when you know your variable won't escape the scope to place it on stack (works for classes too), but I'm not sure if it will work with array. `scope string digit = [digits[0]];`
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Wednesday, 17 January 2024 at 07:11:02 UTC, Renato wrote: If you want to check your performance, you know you can run the `./benchmark.sh` yourself? Out of curiosity I've tried to manually run this on Windows and it seems that Java generator for these numbers files is "broken", the resulting count or print runs fine for both Java and D versions provided in your D branch, but fails with generated files. D version complains about bad utf8 encoding. I've opened the generated file in text editor and it is UTF-16 (little-endian with BOM). Tried with adoptium jdk 17 and 21 (former openjdk), but I guess it doesn't matter since UTF-16 is default on Windows.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tuesday, 16 January 2024 at 22:13:55 UTC, H. S. Teoh wrote: used for the recursive calls. Getting rid of the .format ought to speed it up a bit. Will try that now... That will make no difference for the `count` option which is where your solution was very slow. To run the slow test manually use the `words_quarter.txt` dictionary (the phone numbers file doesn't matter much - it's all in the dictionary). But pls run the benchmarks yourself as I am not going to keep running it for you, and would be nice if you posted your solution on a Gist for example, pasting lots of code in the forum makes it difficult to follow.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tuesday, 16 January 2024 at 22:15:04 UTC, Siarhei Siamashka wrote: On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote: For the record (I already posted this on GitHub)... here's [my current fastest solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d) time using the same algorithm as Rust ... [...] ... what I am really curious about is what the code I wrote is doing wrong that causes it to run 4x slower than Rust despite doing "the same thing"... It's a GC allocations fest. Things like this make it slow: ```diff { -string digit = [digits[0]]; +string digit = digits[0 .. 1]; words.insertBack(digit); ``` I was under the impression that `[digits[0]]` would just use a stack allocation?? The profiler does not show any GC anymore, are you sure it's a "GC allocations fest"??? And at the top is the associative array lookup (when profiling the handling of the "phones_1_000_000_with_empty.txt" input file): ``` 36.85% dencoder dencoder [.] _aaInX 12.38% dencoder dencoder [.] void Well, I know, but I did everything to make the hash "cheap" to compute so I still don't see a way to improve it.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tue, Jan 16, 2024 at 10:15:04PM +, Siarhei Siamashka via Digitalmars-d-learn wrote: > On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote: [...] > > ... what I am really curious about is what the code I wrote is doing > > wrong that causes it to run 4x slower than Rust despite doing "the > > same thing"... > > It's a GC allocations fest. Indeed. I have just completed 2 rounds of optimizations of my version of the code, and both times the profiler also showed the problem to be excessive allocations in the inner loop. So, I did the following optimizations: 1) Get rid of .format in the inner loop. Not only does .format cause a lot of allocations, it is also a known performance hog. So instead of constructing the output string in the search function, I changed it to take a delegate instead, and the delegate either counts the result or prints it directly (bypassing the construction of an intermediate string). This improved performance quite a bit for the count-only runs, but also wins some performance even when output is generated. Overall, however, this optimization only gave me some minor savings. 2) Changed the `path` parameter from string[] to string, since I didn't really need it to be an array of strings anyway. This in itself only improved performance marginally, barely noticeable, but it led to (3), which gave a huge performance boost. 3) Basically, in the earlier version of the code, the `path` parameter was appended to every time I recursed, and furthermore the same initial segment gets appended to many times with different trailers as the algorithm walks the trie. As a result, this triggers a lot of array reallocations to store the new strings. Most of these allocations are unnecessary, because we already know that the initial segment of the string will stay constant, only the tail end changes. Furthermore, we only ever have a single instance of .path at any point in time in the algorithm. So we could use a single buffer to hold all of these instances of .path, and simply return slices to it as we go along, overwriting the tail end each time we need to append something. This significantly cut down on the number of allocations, and along with (1) and (2), performance improved by about 3x (!). It didn't completely remove all allocations, but I'm reasonably happy with the performance now that I probably won't try to optimize it more unless it's still losing out to another language. ;-) (I'm especially curious to see if this beats the Rust version. :-P) Optimized version of the code: ---snip /** * Encoding phone numbers according to a dictionary. */ import std; /** * Table of digit mappings. */ static immutable ubyte[dchar] digitOf; shared static this() { digitOf = [ 'E': 0, 'J': 1, 'N': 1, 'Q': 1, 'R': 2, 'W': 2, 'X': 2, 'D': 3, 'S': 3, 'Y': 3, 'F': 4, 'T': 4, 'A': 5, 'M': 5, 'C': 6, 'I': 6, 'V': 6, 'B': 7, 'K': 7, 'U': 7, 'L': 8, 'O': 8, 'P': 8, 'G': 9, 'H': 9, 'Z': 9, ]; } /** * Trie for storing dictionary words according to the phone number mapping. */ class Trie { Trie[10] edges; string[] words; private void insert(string word, string suffix) { const(ubyte)* dig; while (!suffix.empty && (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null) { suffix = suffix[1 .. $]; } if (suffix.empty) { words ~= word; return; } auto node = new Trie; auto idx = *dig; if (edges[idx] is null) { edges[idx] = new Trie; } edges[idx].insert(word, suffix[1 .. $]); } /** * Insert a word into the Trie. * * Characters that don't map to any digit are ignored in building the Trie. * However, the original form of the word will be retained as-is in the * leaf node. */ void insert(string word) { insert(word, word[]); } /** * Iterate over all words stored in this Trie. */ void foreachEntry(void delegate(string path, string word) cb) { void impl(Trie node, string path = "") { if (node is null) return; foreach (word; node.words) { cb(path, word); } foreach (i, child; node.edges) { impl(child, path ~ cast(char)('0' + i)); } } impl(this); } } /** * Loads the given dictionary into a Trie. */ Trie loadDictionary(R)(R lines) if (isInputRange!R & is(ElementType!R : const(char)[])) { Trie result = new Trie; foreach (line; lines) { result.insert(line.idup); } return result; } /// unittest { auto dict = loadDictionary(q"ENDDICT an blau Bo" Boot bo"s da Fee fern Fest fort je jemand mir Mix Mixer
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote: For the record (I already posted this on GitHub)... here's [my current fastest solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d) time using the same algorithm as Rust ... [...] ... what I am really curious about is what the code I wrote is doing wrong that causes it to run 4x slower than Rust despite doing "the same thing"... It's a GC allocations fest. Things like this make it slow: ```diff { -string digit = [digits[0]]; +string digit = digits[0 .. 1]; words.insertBack(digit); ``` And at the top is the associative array lookup (when profiling the handling of the "phones_1_000_000_with_empty.txt" input file): ``` 36.85% dencoder dencoder [.] _aaInX 12.38% dencoder dencoder [.] void dencoder.printTranslations(immutable(char)[][][dencoder.Key], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array) 6.43% dencoder dencoder [.] nothrow @nogc scope void core.internal.gc.impl.conservative.gc.Gcx.mark!(false, true, true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange) 4.53% dencoder dencoder [.] pure @safe void std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult.popFront() 4.08% dencoder dencoder [.] _d_array_slice_copy 2.43% dencoder dencoder [.] _d_newarrayU 2.21% dencoder dencoder [.] shared nothrow @nogc @trusted void core.internal.spinlock.SpinLock.lock() 2.00% dencoder dencoder [.] pure @safe void std.array.Appender!(immutable(char)[]).Appender.put!(dchar).put(dchar) 1.91% dencoder dencoder [.] nothrow void* core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong, ref ulong, uint, const(TypeInfo)) 1.67% dencoder dencoder [.] _DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_ 1.66% dencoder dencoder [.] pure nothrow @nogc scope @trusted ulong core.internal.gc.bits.GCBits.setLocked(ulong) 1.60% dencoder dencoder [.] const pure nothrow @trusted ulong object.TypeInfo_Struct.getHash(scope const(void*)) 1.53% dencoder dencoder [.] nothrow @safe void core.internal.util.array.enforceRawArraysConformable(const(char[]), const(ulong), const(void[]), const(void[]), const(bool)) 1.49% dencoder dencoder [.] nothrow void* core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) 1.27% dencoder dencoder [.] pure nothrow @safe void std.array.Appender!(immutable(char)[]).Appender.ensureAddable(ulong) 0.81% dencoder dencoder [.] pure @property @safe dchar std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult.front() 0.79% dencoder dencoder [.] nothrow @safe void core.internal.util.array._enforceNoOverlap(const(char[]), ulong, ulong, const(ulong)) 0.74% dencoder dencoder [.] nothrow void core.internal.gc.impl.conservative.gc.Pool.setBits(ulong, uint) 0.73% dencoder dencoder [.] pure @safe void std.array.Appender!(immutable(char)[]).Appender.put!(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult).put(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult) 0.70% dencoder dencoder [.] pure @safe ulong std.range.primitives.walkLength!(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult).walkLength(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult) 0.60% dencoder dencoder [.] bool dencoder.lastItemIsDigit(std.container.array.Array!(immutable(char)[]).Array) 0.54% dencoder dencoder [.] _d_newarrayT [...] ```
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tue, Jan 16, 2024 at 09:15:19PM +, Renato via Digitalmars-d-learn wrote: > On Tuesday, 16 January 2024 at 20:34:48 UTC, H. S. Teoh wrote: > > On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via > > Digitalmars-d-learn wrote: [...] > > > Anyway, I've fixed the problem, now my program produces the exact > > > same output as Renato's repo. Code is posted below. > > [...] > > > > Great, I ran the benchmarks for you :) > > I had to change how you accept arguments, even though you did "the > right thing" using `getopt`, the solutions should just take a `count` > or `print` argument first... Oops, haha :-P > Anyway, here's your result: > > ``` > ===> ./rust > ./rust,24133632,25 > ./rust,24739840,130 > ./rust,24477696,536 > ./rust,25247744,1064 > ./rust,8175616,6148 > ./rust,8306688,8315 > ===> src/d/dencoder > src/d/dencoder,46055424,43 > src/d/dencoder,96337920,146 > src/d/dencoder,102350848,542 > src/d/dencoder,102268928,1032 > src/d/dencoder,40206336,99936 > ^C > ``` > > It took too long with the `count` option, so I had to abort before the > last run ended... there's probably some bug there, otherwise the Trie > runs very fast, as I had expected. [...] Do you have the problematic data file handy? I'd like to look into any potential bugs. Also, the profiler revealed that a lot of time was spent in the GC and in small allocations. The cause is in all likelihood the .format() call for each found match, and array append being used for the recursive calls. Getting rid of the .format ought to speed it up a bit. Will try that now... T -- If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote: I can't explain why it's so incredibly fast, specially for the `count` case. I tried using the same hashing function on my solution, but that didn't really help me! That's dynamic programming with memoization. Basically caching the already calculated results (in the `dp` array) and avoiding recalculations when the recursive function revisits the same state again.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tuesday, 16 January 2024 at 20:34:48 UTC, H. S. Teoh wrote: On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...] Anyway, I've fixed the problem, now my program produces the exact same output as Renato's repo. Code is posted below. [...] Great, I ran the benchmarks for you :) I had to change how you accept arguments, even though you did "the right thing" using `getopt`, the solutions should just take a `count` or `print` argument first... Anyway, here's your result: ``` ===> ./rust ./rust,24133632,25 ./rust,24739840,130 ./rust,24477696,536 ./rust,25247744,1064 ./rust,8175616,6148 ./rust,8306688,8315 ===> src/d/dencoder src/d/dencoder,46055424,43 src/d/dencoder,96337920,146 src/d/dencoder,102350848,542 src/d/dencoder,102268928,1032 src/d/dencoder,40206336,99936 ^C ``` It took too long with the `count` option, so I had to abort before the last run ended... there's probably some bug there, otherwise the Trie runs very fast, as I had expected. For the record (I already posted this on GitHub)... here's [my current fastest solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d) time using the same algorithm as Rust (after I fixed my solution that was using `int128`, which was not valid, as @ssvb pointed out): ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23986176,24 ./rust,24346624,118 ./rust,24543232,525 ./rust,24346624,1027 ./rust,8011776,3830 ./rust,8486912,18403 ===> src/d/dencoder src/d/dencoder,13615104,30 src/d/dencoder,24723456,374 src/d/dencoder,24592384,1750 src/d/dencoder,24788992,3472 src/d/dencoder,11517952,10235 src/d/dencoder,11517952,51211 ``` Not great :( But @ssvb came up with a really fast implementation, though totally different algorithm (so dont' take this as evidence of D being faster than Rust or anything like that): ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23986176,24 ./rust,24461312,135 ./rust,24494080,514 ./rust,24526848,981 ./rust,8175616,3647 ./rust,8011776,15404 ===> src/d/dencoder src/d/dencoder,16433152,30 src/d/dencoder,16613376,125 src/d/dencoder,16613376,468 src/d/dencoder,16613376,941 src/d/dencoder,5570560,12 src/d/dencoder,6701056,18 ``` I can't explain why it's so incredibly fast, specially for the `count` case. I tried using the same hashing function on my solution, but that didn't really help me! Pretty cool to see different solutions to the problem, but I'm still curious to know where the solution I wrote is being slow compared to Rust which is using identical, to my understanding, code! I'm sure people could write incredibly fast code to solve this problem, but what I am really curious about is what the code I wrote is doing wrong that causes it to run 4x slower than Rust despite doing "the same thing"...
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...] > Anyway, I've fixed the problem, now my program produces the exact same > output as Renato's repo. Code is posted below. [...] Oops, forgot to actually paste the code. Here it is: snip /** * Encoding phone numbers according to a dictionary. */ import std; /** * Table of digit mappings. */ static immutable ubyte[dchar] digitOf; shared static this() { digitOf = [ 'E': 0, 'J': 1, 'N': 1, 'Q': 1, 'R': 2, 'W': 2, 'X': 2, 'D': 3, 'S': 3, 'Y': 3, 'F': 4, 'T': 4, 'A': 5, 'M': 5, 'C': 6, 'I': 6, 'V': 6, 'B': 7, 'K': 7, 'U': 7, 'L': 8, 'O': 8, 'P': 8, 'G': 9, 'H': 9, 'Z': 9, ]; } /** * Trie for storing dictionary words according to the phone number mapping. */ class Trie { Trie[10] edges; string[] words; private void insert(string word, string suffix) { const(ubyte)* dig; while (!suffix.empty && (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null) { suffix = suffix[1 .. $]; } if (suffix.empty) { words ~= word; return; } auto node = new Trie; auto idx = *dig; if (edges[idx] is null) { edges[idx] = new Trie; } edges[idx].insert(word, suffix[1 .. $]); } /** * Insert a word into the Trie. * * Characters that don't map to any digit are ignored in building the Trie. * However, the original form of the word will be retained as-is in the * leaf node. */ void insert(string word) { insert(word, word[]); } /** * Iterate over all words stored in this Trie. */ void foreachEntry(void delegate(string path, string word) cb) { void impl(Trie node, string path = "") { if (node is null) return; foreach (word; node.words) { cb(path, word); } foreach (i, child; node.edges) { impl(child, path ~ cast(char)('0' + i)); } } impl(this); } } /** * Loads the given dictionary into a Trie. */ Trie loadDictionary(R)(R lines) if (isInputRange!R & is(ElementType!R : const(char)[])) { Trie result = new Trie; foreach (line; lines) { result.insert(line.idup); } return result; } /// unittest { auto dict = loadDictionary(q"ENDDICT an blau Bo" Boot bo"s da Fee fern Fest fort je jemand mir Mix Mixer Name neu o"d Ort so Tor Torf Wasser ENDDICT".splitLines); auto app = appender!(string[]); dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); }); assert(app.data == [ "10: je", "105513: jemand", "107: neu", "1550: Name", "253302: Wasser", "35: da", "38: so", "400: Fee", "4021: fern", "4034: Fest", "482: Tor", "4824: fort", "4824: Torf", "51: an", "562: mir", "562: Mix", "56202: Mixer", "78: Bo\"", "783: bo\"s", "7857: blau", "7884: Boot", "824: Ort", "83: o\"d" ]); } /** * Find all encodings of the given phoneNumber according to the given * dictionary, and write each encoding to the given sink. */ void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink) if (isOutputRange!(W, string)) { bool impl(Trie node, const(char)[] suffix, string[] path, bool allowDigit) { if (node is null) return false; // Ignore non-digit characters in phone number while (!suffix.empty && (suffix[0] < '0' || suffix[0] > '9')) suffix = suffix[1 .. $]; if (suffix.empty) { // Found a match, print result foreach (word; node.words) { put(sink, format("%s: %-(%s %)", phoneNumber, path.chain(only(word; } return !node.words.empty; } bool ret; foreach (word; node.words) { // Found a matching word, try to match the rest of the phone // number. ret = true; if (impl(dict, suffix, path ~ word, true)) allowDigit = false; } if (impl(node.edges[suffix[0] - '0'], suffix[1 .. $], path, false)) { allowDigit = false; ret = true; } if (allowDigit) { // If we got here, it means that if we take the current node as the // next word choice, then the following digit will have no further // matches, and we may encode it as a single digit. auto nextSuffix = suffix[1 .. $];
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tue, Jan 16, 2024 at 06:54:56PM +, Renato via Digitalmars-d-learn wrote: > On Tuesday, 16 January 2024 at 16:56:04 UTC, Siarhei Siamashka wrote: [...] > > You are not allowed to emit "1" as the first token in the output as > > long as there are any dictionary word matches at that position. The > > relevant paragraph from the problem statement: Ohhh now I get it. Initially I misunderstood that as saying that if the rest of the phone number has at least one match, then a digit is not allowed. Now I see that what it's actually saying is that even if some random dictionary word matches at that position, even if it does not lead to any full matches, then a digit is excluded. [...] > > I also spent a bit of time trying to figure out this nuance when > > implementing my solution. It doesn't make much sense visually (no > > back-to-back digits in the output either way), but that's how it is. > > Exactly, this is one of the things that make this problem a bit > annoying to solve :) It's a strange requirement, for sure, but I don't think it's annoying. It makes the problem more Interesting(tm). ;-) Anyway, I've fixed the problem, now my program produces the exact same output as Renato's repo. Code is posted below. Interestingly enough, the running time has now halved to about 0.9 seconds for 1 million phone numbers. I guess that's caused by the more stringent requirement excluding many more matching possibilities, effectively pruning away large parts of the search tree. > @"H. S. Teoh" you implemented the solution as a Trie!! Nice, that's > also what I did when I "participated" in the study. Here's [my Trie > solution in > Java](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/java/Main.java). > > These are basically the two common approaches to the problem: a Trie > or a numeric-based table. According to the study, people who use > scripting languages almost always go with the numeric approach, while > people coming from lower level languages tend to use a data structure > like Trie (if they don't know Trie, they come up with something > similar which is fascinating), which is harder to implement but more > efficient in general. Interesting. I guess my C/C++ background is showing. ;-) I'm not sure what exactly motivated me to go this route; I guess it was just my default preference of choosing the path of least work as far as the algorithm is concerned: I chose the algorithm that did the least amount of work needed to produce the right answer. Scanning through sections of the dictionary to find a match was therefore excluded; so my first thought was an AA. But then how much of the initial prefix to look up an in AA? Since it can't be known beforehand, I'd have to gradually lengthen the prefix to search for, which does a lot of repetitive work (we keep looking up the first few digits repeatedly each time we search for a longer prefix). Plus, multiple consecutive AA lookups is not cache-friendly. So my next thought was, what should I do such that I don't have to look at the initial digits anymore once I already processed it? This line of thought naturally led to a trie structure. Once I arrived at a trie structure, the next question was how exactly dictionary entries would be stored in it. Again, in the vein of doing the least amount of work I could get away with, I thought, if I stored words in the trie directly, with each edge encoding a letter, then during the search I'd have to repeatedly convert letters to the corresponding phone number digit and vice versa. So why not do this conversion beforehand, and store only phone digits in the trie? This also had the additional benefit of letting me effectively search multiple letters simultaneously, since multiple letters map to the same digit, so scanning a digit is equivalent to searching multiple letters at the same time. The output, of course, required the original form of the words -- so the obvious solution was to attach the original words as a list of words attached to the trie node representing the end of that word. Once this was all decided, the only remaining question was the search algorithm. This turned out to take the most time in solving this problem, due to the recursive nature of the search, I had to grapple with where and how to make the recursive calls, and how to propagate return values correctly. The initial implementation only found word matches, and did not allow the single digits. Happily, the recursive algorithm turned out to have enough structure to encode the single digit requirements as well, although it took a bit of trial and error to find the correct implementation. > Can I ask you why didn't you use the [D stdlib > Trie](https://dlang.org/phobos/std_uni.html#codepointTrie)? Not sure > that would've worked, but did you consider that? Haha, I didn't even think of that. :-D I wouldn't have wanted to use it anyway, because it was optimized for
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tuesday, 16 January 2024 at 16:56:04 UTC, Siarhei Siamashka wrote: On Tuesday, 16 January 2024 at 15:50:35 UTC, H. S. Teoh wrote: Unfortunately there seems to be some discrepancy between the output I got and the prescribed output in your repository. For example, in your output the number 1556/0 does not have an encoding, but isn't "1 Mai 0" a valid encoding according to your dictionary and the original problem description? You are not allowed to emit "1" as the first token in the output as long as there are any dictionary word matches at that position. The relevant paragraph from the problem statement: ==snip== Encodings of phone numbers can consist of a single word or of multiple words separated by spaces. The encodings are built word by word from left to right. If and only if at a particular point no word at all from the dictionary can be inserted, a single digit from the phone number can be copied to the encoding instead. Two subsequent digits are never allowed, though. To put it differently: In a partial encoding that currently covers k digits, digit k+1 is encoded by itself if and only if, first, digit k was not encoded by a digit and, second, there is no word in the dictionary that can be used in the encoding starting at digit k+1. ==snip== I also spent a bit of time trying to figure out this nuance when implementing my solution. It doesn't make much sense visually (no back-to-back digits in the output either way), but that's how it is. Exactly, this is one of the things that make this problem a bit annoying to solve :) @"H. S. Teoh" you implemented the solution as a Trie!! Nice, that's also what I did when I "participated" in the study. Here's [my Trie solution in Java](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/java/Main.java). These are basically the two common approaches to the problem: a Trie or a numeric-based table. According to the study, people who use scripting languages almost always go with the numeric approach, while people coming from lower level languages tend to use a data structure like Trie (if they don't know Trie, they come up with something similar which is fascinating), which is harder to implement but more efficient in general. Can I ask you why didn't you use the [D stdlib Trie](https://dlang.org/phobos/std_uni.html#codepointTrie)? Not sure that would've worked, but did you consider that?
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tuesday, 16 January 2024 at 15:50:35 UTC, H. S. Teoh wrote: Unfortunately there seems to be some discrepancy between the output I got and the prescribed output in your repository. For example, in your output the number 1556/0 does not have an encoding, but isn't "1 Mai 0" a valid encoding according to your dictionary and the original problem description? You are not allowed to emit "1" as the first token in the output as long as there are any dictionary word matches at that position. The relevant paragraph from the problem statement: ==snip== Encodings of phone numbers can consist of a single word or of multiple words separated by spaces. The encodings are built word by word from left to right. If and only if at a particular point no word at all from the dictionary can be inserted, a single digit from the phone number can be copied to the encoding instead. Two subsequent digits are never allowed, though. To put it differently: In a partial encoding that currently covers k digits, digit k+1 is encoded by itself if and only if, first, digit k was not encoded by a digit and, second, there is no word in the dictionary that can be used in the encoding starting at digit k+1. ==snip== I also spent a bit of time trying to figure out this nuance when implementing my solution. It doesn't make much sense visually (no back-to-back digits in the output either way), but that's how it is.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
P.S. Compiling my program with `ldc -O2`, it runs so fast that I couldn't measure any meaningful running time that's greater than startup overhead. So I wrote a helper program to generate random phone numbers up to 50 characters long, and found that it could encode 1 million phone numbers in 2.2 seconds (using the 75,000 entry dictionary from your repository). Counting vs. printing the results made no significant difference to this. T -- People tell me that I'm skeptical, but I don't believe them.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Tue, Jan 16, 2024 at 07:50:35AM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...] > Unfortunately there seems to be some discrepancy between the output I > got and the prescribed output in your repository. For example, in your > output the number 1556/0 does not have an encoding, but isn't "1 Mai 0" > a valid encoding according to your dictionary and the original problem > description? [...] Also, found a bug in my program that misses some solutions when the phone number has trailing non-digits. Here's the updated code. It still finds extra encodings from the output in your repo, though. Maybe I misunderstood part of the requirements? --snip-- /** * Encoding phone numbers according to a dictionary. */ import std; /** * Table of digit mappings. */ static immutable ubyte[dchar] digitOf; shared static this() { digitOf = [ 'E': 0, 'J': 1, 'N': 1, 'Q': 1, 'R': 2, 'W': 2, 'X': 2, 'D': 3, 'S': 3, 'Y': 3, 'F': 4, 'T': 4, 'A': 5, 'M': 5, 'C': 6, 'I': 6, 'V': 6, 'B': 7, 'K': 7, 'U': 7, 'L': 8, 'O': 8, 'P': 8, 'G': 9, 'H': 9, 'Z': 9, ]; } /** * Trie for storing dictionary words according to the phone number mapping. */ class Trie { Trie[10] edges; string[] words; private void insert(string word, string suffix) { const(ubyte)* dig; while (!suffix.empty && (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null) { suffix = suffix[1 .. $]; } if (suffix.empty) { words ~= word; return; } auto node = new Trie; auto idx = *dig; if (edges[idx] is null) { edges[idx] = new Trie; } edges[idx].insert(word, suffix[1 .. $]); } /** * Insert a word into the Trie. * * Characters that don't map to any digit are ignored in building the Trie. * However, the original form of the word will be retained as-is in the * leaf node. */ void insert(string word) { insert(word, word[]); } /** * Iterate over all words stored in this Trie. */ void foreachEntry(void delegate(string path, string word) cb) { void impl(Trie node, string path = "") { if (node is null) return; foreach (word; node.words) { cb(path, word); } foreach (i, child; node.edges) { impl(child, path ~ cast(char)('0' + i)); } } impl(this); } } /** * Loads the given dictionary into a Trie. */ Trie loadDictionary(R)(R lines) if (isInputRange!R & is(ElementType!R : const(char)[])) { Trie result = new Trie; foreach (line; lines) { result.insert(line.idup); } return result; } /// unittest { auto dict = loadDictionary(q"ENDDICT an blau Bo" Boot bo"s da Fee fern Fest fort je jemand mir Mix Mixer Name neu o"d Ort so Tor Torf Wasser ENDDICT".splitLines); auto app = appender!(string[]); dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); }); assert(app.data == [ "10: je", "105513: jemand", "107: neu", "1550: Name", "253302: Wasser", "35: da", "38: so", "400: Fee", "4021: fern", "4034: Fest", "482: Tor", "4824: fort", "4824: Torf", "51: an", "562: mir", "562: Mix", "56202: Mixer", "78: Bo\"", "783: bo\"s", "7857: blau", "7884: Boot", "824: Ort", "83: o\"d" ]); } /** * Find all encodings of the given phoneNumber according to the given * dictionary, and write each encoding to the given sink. */ void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink) if (isOutputRange!(W, string)) { bool impl(Trie node, const(char)[] suffix, string[] path, bool allowDigit) { if (node is null) return false; // Ignore non-digit characters in phone number while (!suffix.empty && (suffix[0] < '0' || suffix[0] > '9')) suffix = suffix[1 .. $]; if (suffix.empty) { // Found a match, print result foreach (word; node.words) { put(sink, format("%s: %-(%s %)", phoneNumber, path.chain(only(word; } return !node.words.empty; } bool ret; foreach (word; node.words) { // Found a matching word, try to match the rest of the phone // number. if (impl(dict, suffix, path ~ word, true)) { allowDigit = false; ret = true; } } if (impl(node.edges[suffix[0] -
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Mon, Jan 15, 2024 at 08:10:55PM +, Renato via Digitalmars-d-learn wrote: > On Monday, 15 January 2024 at 01:10:14 UTC, Sergey wrote: > > On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote: > > > If anyone can find any flaw in my methodology or optmise my code so > > > that it can still get a couple of times faster, approaching Rust's > > > performance, I would greatly appreciate that! But for now, my > > > understanding is that the most promising way to get there would be > > > to write D in `betterC` style?! > > > > I've added port from Rust in the PR comment. Can you please check > > this solution? > > Most probably it need to be optimized with profiler. Just > > interesting how close-enough port will work. > > As discussed on GitHub, the line-by-line port of the Rust code is 5x > slower than [my latest solution using > int128](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c/src/d/src/dencoder.d), > which is itself 3 to 4x slower than the Rust implementation (at around > the same order of magnitude as algorithm-equivalent Java and Common > Lisp implementations, D is perhaps 15% faster). > > I did the best I could to make D run faster, but we hit a limit that's > a bit hard to get past now. Happy to be given suggestions (see > profiling information in previous messages), but I've run out of ideas > myself. This problem piqued my interest, so yesterday and today I worked on it and came up with my own solution (I did not look at existing solutions in order to prevent bias). I have not profiled it or anything, but the runtime seems quite promising. Here it is: -snip-- /** * Encoding phone numbers according to a dictionary. */ import std; /** * Table of digit mappings. */ static immutable ubyte[dchar] digitOf; shared static this() { digitOf = [ 'E': 0, 'J': 1, 'N': 1, 'Q': 1, 'R': 2, 'W': 2, 'X': 2, 'D': 3, 'S': 3, 'Y': 3, 'F': 4, 'T': 4, 'A': 5, 'M': 5, 'C': 6, 'I': 6, 'V': 6, 'B': 7, 'K': 7, 'U': 7, 'L': 8, 'O': 8, 'P': 8, 'G': 9, 'H': 9, 'Z': 9, ]; } /** * Trie for storing dictionary words according to the phone number mapping. */ class Trie { Trie[10] edges; string[] words; private void insert(string word, string suffix) { const(ubyte)* dig; while (!suffix.empty && (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null) { suffix = suffix[1 .. $]; } if (suffix.empty) { words ~= word; return; } auto node = new Trie; auto idx = *dig; if (edges[idx] is null) { edges[idx] = new Trie; } edges[idx].insert(word, suffix[1 .. $]); } /** * Insert a word into the Trie. * * Characters that don't map to any digit are ignored in building the Trie. * However, the original form of the word will be retained as-is in the * leaf node. */ void insert(string word) { insert(word, word[]); } /** * Iterate over all words stored in this Trie. */ void foreachEntry(void delegate(string path, string word) cb) { void impl(Trie node, string path = "") { if (node is null) return; foreach (word; node.words) { cb(path, word); } foreach (i, child; node.edges) { impl(child, path ~ cast(char)('0' + i)); } } impl(this); } } /** * Loads the given dictionary into a Trie. */ Trie loadDictionary(R)(R lines) if (isInputRange!R & is(ElementType!R : const(char)[])) { Trie result = new Trie; foreach (line; lines) { result.insert(line.idup); } return result; } /// unittest { auto dict = loadDictionary(q"ENDDICT an blau Bo" Boot bo"s da Fee fern Fest fort je jemand mir Mix Mixer Name neu o"d Ort so Tor Torf Wasser ENDDICT".splitLines); auto app = appender!(string[]); dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); }); assert(app.data == [ "10: je", "105513: jemand", "107: neu", "1550: Name", "253302: Wasser", "35: da", "38: so", "400: Fee", "4021: fern", "4034: Fest", "482: Tor", "4824: fort", "4824: Torf", "51: an", "562: mir", "562: Mix", "56202: Mixer", "78: Bo\"", "783: bo\"s", "7857: blau", "7884: Boot", "824: Ort", "83: o\"d" ]); } /** * Find all encodings of the given phoneNumber according to the given * dictionary, and write each encoding to the given sink. */ void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink)
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Monday, 15 January 2024 at 01:10:14 UTC, Sergey wrote: On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote: If anyone can find any flaw in my methodology or optmise my code so that it can still get a couple of times faster, approaching Rust's performance, I would greatly appreciate that! But for now, my understanding is that the most promising way to get there would be to write D in `betterC` style?! I've added port from Rust in the PR comment. Can you please check this solution? Most probably it need to be optimized with profiler. Just interesting how close-enough port will work. As discussed on GitHub, the line-by-line port of the Rust code is 5x slower than [my latest solution using int128](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c/src/d/src/dencoder.d), which is itself 3 to 4x slower than the Rust implementation (at around the same order of magnitude as algorithm-equivalent Java and Common Lisp implementations, D is perhaps 15% faster). I did the best I could to make D run faster, but we hit a limit that's a bit hard to get past now. Happy to be given suggestions (see profiling information in previous messages), but I've run out of ideas myself.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote: If anyone can find any flaw in my methodology or optmise my code so that it can still get a couple of times faster, approaching Rust's performance, I would greatly appreciate that! But for now, my understanding is that the most promising way to get there would be to write D in `betterC` style?! I've added port from Rust in the PR comment. Can you please check this solution? Most probably it need to be optimized with profiler. Just interesting how close-enough port will work.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Sunday, 14 January 2024 at 10:02:58 UTC, Jordan Wilson wrote: On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote: I like to use a phone encoding problem to determine the strenghtness and weaknesses of programming languages because this problem is easy enough I can write solutions in any language in a few hours, but complex enough to exercise lots of interesting parts of the language. [...] Hello Renato, This seems to be quite a lot of calls: ``` Timer frequency unknown, Times are in Megaticks Num TreeFuncPer CallsTimeTimeCall 1920496437613756 0 pure nothrow ref @trusted immutable(char)[][] core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong) 1920492489573474 0 @safe void dencoder.printTranslations(immutable(char)[][][dencoder.Key], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][]) ``` This is when using the `words-quarter.txt` input (the `dictionary.txt` input seems to finish much faster, although still slower than `java`/`rust`). I also used only 100 phone numbers as input. My final observation is that `words-quarter.txt` contains some 1-letter inputs, (for example, `i` or `m`)...this may result in a large number of encoding permutations, which may explain the high number of recursion calls? Jordan The characteristics of the dictionary impact the number of solutions greatly. I explored this in my blog post about [Common Lisp, Part 2](https://renato.athaydes.com/posts/revenge_of_lisp-part-2). The fact there's a ridiculous amount of function calls is why this problem can take minutes even without having to allocate much memory or print anything. ** I've managed to improve the D code enough that it is now faster than Common Lisp and the equivalent algorithm in Java.** It took some profiling to do that, though... thanks to @Anonymouse for the suggestion to use Valgrind... with that, I was able to profile the code nicely (Valgrind works nicely with D, it doesn't even show mangled names!). Here's what I did. First: the solution using int128 was much faster than the BigInt solution, as I had already mentioned. But when profiling that, it was clear that for the problems with a very large number of solutions, GC became a problem: ``` 23,044,096,944 (100.0%) PROGRAM TOTALS Ir file:function 7,079,878,197 (30.72%) ???:core.internal.gc.impl.conservative.gc.Gcx.mark!(false, true, true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 2,375,100,857 (10.31%) ???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 1,971,210,820 ( 8.55%) ???:_aaInX [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 1,922,961,924 ( 8.34%) ???:_d_arraysetlengthT [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 1,298,747,622 ( 5.64%) ???:core.int128.mul(core.int128.Cent, core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 1,134,644,706 ( 4.92%) ???:core.internal.gc.bits.GCBits.setLocked(ulong) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 849,587,834 ( 3.69%) ???:core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong, ref ulong, uint, const(TypeInfo)) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 827,407,988 ( 3.59%) ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6] 688,845,027 ( 2.99%) ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6] 575,415,884 ( 2.50%) ???:_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_ [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 562,146,592 ( 2.44%) ???:core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong,
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote: On Saturday, 13 January 2024 at 19:35:57 UTC, Renato wrote: On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote: On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote: [...] I will have to try it... I thought that `BigInt` was to blame for the slowness (from what I could read from the trace logs), but after replacing that with basically a byte array key (see [commit here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0e9025b9aacdcfef5a2649be4cc82b9bc607fd6c)) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust. In the repo is hard to find the proper version. I've checked the Rust from master branch and it looks a bit different from D implementation.. I explicitly linked to the Rust implementation in my question: the [Rust solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/rust/phone_encoder/src/main.rs) uses approx. the same algorithm as used in CL and D, but instead of BigInt, it uses a Vec as key Can you be more specific about which part of the Rust implementation is not roughly equivalent to the D implementation?? There are obvious differences in style and syntax, but I hope that it's mostly equivalent algorithm-wise. But to clarify: the branch where the fastest solutions are is called `fastest-implementations-print-or-count`. The D branches with my alternative implementations are all called `dlang-*`. I would suggest to rewrite in the same way as Rust implemented. Probably you would like to try: * do not use BigInt from std. It could be quite slow. Try to use GMP library from Dub instead Right, but as I posted above, even using a byte-array just like in Rust, the performance was still very bad (but around 2x faster than using BigInt). Also, using GMP is always suggested to me, but I can't accept that because my goal is not to use C libraries to achieve the fastest speed (which I could do by using C or FFI in any other language), it's to use D (and the other languages) to solve my problem in an idiomatic way. I [ended up using `Int128`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/98dcbcf1c77d1ded3406cc03af9026e126df5b2d) (the problem description requires more than i64, but int128 is enough), which grealy improved performance over my D baseline AND on the byte-array solution. * don't do "idup" every time * instead of byLine, try byLineCopy `idup` is necessary because the strings are stored in a Map after the iteration ends. Anyway, I replaced that with `byLineCopy` and it became sligthtly slower. * instead of "arr ~= data" try to use Appender (https://dlang.org/library/std/array/appender.html) I was worried about `~=` allocating too much, though IIUC it shouldn't be a problem the way I used it... in any case, I [replaced it with `Appender`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/f1364d0897f1f37882f1d39b92c16b84b1abdc31) and the performance was completely unchanged - which is good as I think it means I used `~=` correctly to avoid overallocation (or I messed up using `Appender` :D - pls have a look). * also you could try to use splitter (https://dlang.org/library/std/algorithm/iteration/splitter.html) to lazily process each part of the data This is not necessary because that would only affect the time spent loading the dictionary, which is the faster part of the problem... nearly all of the time is spent looking for solutions instead. * isLastDigit function has many checks, but I think it could be implemented easier in a Rust way The Rust solution uses sum types for that, but that had a very minor effect on the overall performance (though in Rust, that's "cleaner" to use anyway)... I know D does have SumType in the stdlib, but I thought it is not as "zero cost" as in Rust, and because both variants wrap a String, it's awkward to use that... so I instead tried using a struct like this: ```d struct WordOrDigit { string value; bool isDigit; } ``` You can check [the commit here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0f6b4ce83373c46b14b5bb40c53bb0e2d0905e66). This change made the code slightly slower. * also consider to use functions from Range (filter, map) as you use it in Rust, instead of using for loops Why would for loops be slower? Or you're just saying I should make the D code nicer? Anyway, thanks for the suggestions! On Sunday, 14 January 2024 at 08:33:24 UTC, Anonymouse wrote: On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote: I would suggest to rewrite in the same way as Rust implemented. Probably you would like to try: [...] I would strongly argue for profiling first instead of optimising based on conjecture. If you profile you have solid evidence on what is actually slow.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote: I like to use a phone encoding problem to determine the strenghtness and weaknesses of programming languages because this problem is easy enough I can write solutions in any language in a few hours, but complex enough to exercise lots of interesting parts of the language. [...] Hello Renato, This seems to be quite a lot of calls: ``` Timer frequency unknown, Times are in Megaticks Num TreeFuncPer CallsTimeTimeCall 1920496437613756 0 pure nothrow ref @trusted immutable(char)[][] core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong) 1920492489573474 0 @safe void dencoder.printTranslations(immutable(char)[][][dencoder.Key], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][]) ``` This is when using the `words-quarter.txt` input (the `dictionary.txt` input seems to finish much faster, although still slower than `java`/`rust`). I also used only 100 phone numbers as input. My final observation is that `words-quarter.txt` contains some 1-letter inputs, (for example, `i` or `m`)...this may result in a large number of encoding permutations, which may explain the high number of recursion calls? Jordan
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote: I would suggest to rewrite in the same way as Rust implemented. Probably you would like to try: [...] I would strongly argue for profiling first instead of optimising based on conjecture. If you profile you have solid evidence on what is actually slow. If you're very good at analysing D, well-educated hypotheses *may* be enough, until they suddenly aren't and you will have spent a lot of time on the wrong problem.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 13 January 2024 at 19:35:57 UTC, Renato wrote: On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote: On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote: [...] I will have to try it... I thought that `BigInt` was to blame for the slowness (from what I could read from the trace logs), but after replacing that with basically a byte array key (see [commit here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0e9025b9aacdcfef5a2649be4cc82b9bc607fd6c)) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust. In the repo is hard to find the proper version. I've checked the Rust from master branch and it looks a bit different from D implementation.. I would suggest to rewrite in the same way as Rust implemented. Probably you would like to try: * do not use BigInt from std. It could be quite slow. Try to use GMP library from Dub instead * don't do "idup" every time * instead of byLine, try byLineCopy * instead of "arr ~= data" try to use Appender (https://dlang.org/library/std/array/appender.html) * also you could try to use splitter (https://dlang.org/library/std/algorithm/iteration/splitter.html) to lazily process each part of the data * isLastDigit function has many checks, but I think it could be implemented easier in a Rust way * also consider to use functions from Range (filter, map) as you use it in Rust, instead of using for loops
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote: On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote: [...] Not a great profiling experience :). Anyone has a better suggestion to "parse" the trace file? As a drive-by suggestion and I hope it doesn't derail anything, but if you have the opportunity to run it on linux, have you tried profiling with callgrind instead, with {Q,K}Cachegrind to visualise things? Your repositories probably have them. (callgrind is a part of valgrind.) The wiki only mentions callgrind in passing, but it has worked well for me. [(example)](https://i.imgur.com/WWZAwy3.png) Thanks for the suggestion, this looks promising as I do have a Linux laptop (just not my main one). I will have to try it... I thought that `BigInt` was to blame for the slowness (from what I could read from the trace logs), but after replacing that with basically a byte array key (see [commit here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0e9025b9aacdcfef5a2649be4cc82b9bc607fd6c)) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust.
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote: [...] Not a great profiling experience :). Anyone has a better suggestion to "parse" the trace file? As a drive-by suggestion and I hope it doesn't derail anything, but if you have the opportunity to run it on linux, have you tried profiling with callgrind instead, with {Q,K}Cachegrind to visualise things? Your repositories probably have them. (callgrind is a part of valgrind.) The wiki only mentions callgrind in passing, but it has worked well for me. [(example)](https://i.imgur.com/WWZAwy3.png)
Re: Help optimize D solution to phone encoding problem: extremely slow performace.
On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote: I tried to profile the D code but the profiler seems to be broken on my OS (Mac): I profiled it on Linux and stored [the trace.log file](https://gist.github.com/renatoathaydes/fd8752ed81b0cf792ed7ef5aa3d68acd) on a public Gist. I tried using the HTML viewer [recommended in the wiki](https://wiki.dlang.org/Development_tools) but that doesn't work... first try: ``` Running ../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/d-profile-viewer Corrupt trace.log (can't compute ticks per second), please re-profile and try again ``` Second try with a longer trace (the one I saved in the gist): ``` Running ../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/d-profile-viewer std.conv.ConvException@/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d(2533): Unexpected '-' when converting from type char[] to type ulong ??:? [0x55c4be9bc99e] ??:? [0x55c4be9bc612] ??:? [0x55c4be9de81e] ??:? [0x55c4be9c4fbf] /home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:2533 [0x55c4be98ba2f] /home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:2002 [0x55c4be98b6fc] /home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:210 [0x55c4be9665ec] ../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/source/app.d:1095 [0x55c4be965e21] ../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/source/app.d:1138 [0x55c4be96698a] ??:? [0x55c4be9c4c9c] ``` Not a great profiling experience :). Anyone has a better suggestion to "parse" the trace file?
Help optimize D solution to phone encoding problem: extremely slow performace.
I like to use a phone encoding problem to determine the strenghtness and weaknesses of programming languages because this problem is easy enough I can write solutions in any language in a few hours, but complex enough to exercise lots of interesting parts of the language. You can check [my initial blog post](https://renato.athaydes.com/posts/revisiting-prechelt-paper-comparing-languages) about this, which was an analysis or the original study by Prechelt about the productivity differences between programmers using different languages. This original problem had a flaw when used in modern computers as the programs can find solutions so quickly that most of the time in the benchmarks was being used to actually print solutions to stdout, so I modified the problem so that there's an option to either print the solutions, or just count the number of solutions - so the programs still need to do all the work, but are not required to print anything other than how many solutions were found. Anyway, I ported the Common Lisp solution to D because, like CL, D has built-in data structures like associative arrays and `BigInt` (at least it's in the stdlib)... I thought this would actually give D an edge! But it turns out D performs very badly for larger input sets. It starts quite well on smaller inputs, but scales very poorly. My initial Rust solution also performed very poorly, so I'm afraid the same is happening with my initial D solution, despite my best effort to write something "decent". You can find my D solution [in this Pull Request](https://github.com/renatoathaydes/prechelt-phone-number-encoding/pull/16). The solution is almost identical, to the extent possible, to the Common Lisp solution, which can be [found here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/lisp/main.lisp). The secret to high performance for the algorithm being used is having a very efficient `BigInt` implementation, and fast hashing function for the hash table (associative array in D, with `BigInt` as keys). Hence, I suppose D's hash function or `BigInt` are not that fast (Rust's default hash is also not great due to security concerns, but it's very easy to use a custom one which is much faster by changing a single line of code). Anyway, here's the current relative performance (the other languages are pretty heavily optimised, the Java solution uses a different algorithm so it's not directly comparable, [the Rust solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/rust/phone_encoder/src/main.rs) uses approx. the same algorithm as used in CL and D, but instead of `BigInt`, it uses a `Vec` as key as that turned out to be faster - I may try that technique in D as well - but notice that even using Rust's slower BigInt, it was orders of magnitude faster than D). ``` Benchmarking... Proc,Run,Memory(bytes),Time(ms) ===> java -Xms20M -Xmx100M -cp build/java Main java-Main,109895680,377 java-Main,179634176,1025 java-Main,167149568,1621 java-Main,180203520,2493 java-Main,96481280,6112 java-Main,95526912,7989 ===> sbcl --script src/lisp/main.fasl sbcl,31780864,74 sbcl,79437824,888 sbcl,79613952,3991 sbcl,80654336,7622 sbcl,80420864,18623 sbcl,83402752,29503 ===> ./rust ./rust,23257088,58 ./rust,23437312,260 ./rust,23433216,1077 ./rust,23416832,2017 ./rust,7106560,6757 ./rust,7110656,10165 ===> src/d/dencoder src/d/dencoder,38748160,223 src/d/dencoder,75800576,3154 src/d/dencoder,75788288,14905 src/d/dencoder,75751424,30271 ``` I had to abort D as it was taking too long. The above is with the `print` option (that's normally slow when stdout is not buffered, but I did buffer D's writes and it's still very slow). With the `count` option, which does not print anything except a number at the end, D took so long I couldn't even collect its results (I waited several minutes)... the other languages finished in much less than a minute: ``` Benchmarking... Proc,Run,Memory(bytes),Time(ms) ===> java -Xms20M -Xmx100M -cp build/java Main java-Main,124112896,7883 java-Main,107487232,9273 ===> sbcl --script src/lisp/main.fasl sbcl,82669568,25638 sbcl,83759104,33501 ===> ./rust ./rust,7061504,9488 ./rust,7127040,11441 ===> src/d/dencoder ^C (ldc-1.36.0) ``` I tried to profile the D code but the profiler seems to be broken on my OS (Mac): ``` ▶ dub build -b profile Starting Performing "profile" build using /Users/renato/dlang/ldc-1.36.0/bin/ldc2 for x86_64. Building prechelt ~master: building configuration [application] Linking dencoder (ldc-1.36.0) prechelt-phone-number-encoding/src/d dlang ✗ 9m ◒ ▶ cd ../.. (ldc-1.36.0) programming/projects/prechelt-phone-number-encoding dlang ✗ 9m ◒ ▶ src/d/dencoder [1]