On 1/31/11 3:37 PM, bearophile wrote:
This is a high-level implementation of Levenshtein distance in Haskell:


levenshtein s1 s2 = last $ foldl transform [0 .. length s1] s2
   where transform ns@(n:ns') c = scanl calc (n+1) $ zip3 s1 ns ns'
           where calc z (c', x, y) = minimum [y+1, z+1, x + fromEnum (c' /= c)]

main = print (levenshtein "kitten" "sitting")


(I don't explain that Haskell code because below I give a kind of D translation 
that is probably enough to understand.)
It prints 3, the code comes from the rosettacode.org site:
http://rosettacode.org/wiki/Levenshtein_distance#Haskell

Currently in std.algorithm there are higher order functions (HOFs) like "map", "filter", 
"reduce" etc, reduce is the right fold.

That Haskell code uses a foldl (fold left), and scanl (like foldl, but keeps 
and returns all intermediate values too). This is a brutal translation to D2:


import std.stdio, std.range, std.array, std.algorithm, std.typecons;

/// foldl f z [x1, x2, ..., xn] == (...((z `f` x1) `f` x2) `f`...) `f` xn
T1 foldl(F, T1, Range)(F f, T1 z, Range xs) {
     foreach (x; xs)
         z = f(z, x);
     return z;
}

/// scanl f z [x1, x2, ...] == [z, z `f` x1, (z `f` x1) `f` x2, ...]
T1[] scanl(F, T1, Range)(F f, T1 z, Range xs) {
     auto result = [z];
     foreach (x; xs) {
         z = f(z, x);
         result ~= z;
     }
     return result;
}

auto levenshtein(string s1, string s2) {
     int[] transform(int[] ns, char c) {
         auto n = ns[0];
         auto ns2 = ns[1..$];
         int compute(int z, Tuple!(dchar, int, int) t) {
             return min(t[2]+1, z+1, t[1] + (t[0] != c));
         }
         return scanl(&compute, n+1, zip(s1, ns, ns2));
     }
     return foldl(&transform, array(iota(cast(int)s1.length+1)), s2)[$-1];
}

void main() {
     string add(string x, string y) { return x ~ y; }
     writeln(foldl(&add, "", ["ab", "cd", "ef"])); // "abcdef"
     writeln(scanl(&add, "", ["ab", "cd", "ef"])); // ["", "ab", "abcd", 
"abcdef"]

     writeln(levenshtein("kitten", "sitting")); // 3
}


Haskell has tons of built-in HOFs (monadic ones too, like foldM), but D is less functional than Haskell, and 
reading highly functional-style code is not easy for many programmers (and D syntax limits make such code 
noisy). What's the right number of HOFs to put in std.algorithm+std.range? Are (better implementations of) 
scanl and foldl fit for Phobos? (I am not suggesting to rename "reduce" to "foldr". You 
may add an "alias reduce foldr;" if you want, but that's all).

Hours ago I have added this suggestion to bugzilla:
http://d.puremagic.com/issues/show_bug.cgi?id=5510

Bye,
bearophile

Thanks for this work.

The Haskell implementation doesn't scale. My testbed:

levenshtein s1 s2 = last $ foldl transform [0 .. length s1] s2
  where transform ns@(n:ns') c = scanl calc (n+1) $ zip3 s1 ns ns'
where calc z (c', x, y) = minimum [y+1, z+1, x + fromEnum (c' /= c)]

n = 4
s = concat $ replicate n "kitten"
t = concat $ replicate n "sitting"

main = print (levenshtein s t)

I assigned n doubling values: 4 8 16 32 64 ... and measured the running times gave me:

4 -> 3ms
8 -> 4ms
16 -> 8ms
32 -> 32ms
64 -> 137ms
128 -> 607ms
256 -> 2.6s
512 -> 15.8s
1024 -> 78s
2048 -> doesn't finish with -O, crashes without -O

I used ghc -O on a top-of-the-line server (8 cores, 64 GB RAM).

Then I ran your implementation on a MacBook Pro with dmd -O -release -inline:

4 -> 4ms
8 -> 4ms
16 -> 6ms
32 -> 11ms
64 -> 36ms
128 -> 108ms
256 -> 402ms
512 -> 1.59s
1024 -> 6.36s
2048 -> 25.43s
4096 -> 100.2s

Finally, on the same laptop and with the same options I ran the std-provided levenshteinDistance, obtaining:

4 -> 5ms
8 -> 5ms
16 -> 5ms
32 -> 6ms
64 -> 11ms
128 -> 33ms
256 -> 112ms
512 -> 431ms
1024 -> 1.8s
2048 -> 7.9s
4096 -> 37.9s

Note that levenshteinDistance is more general (works with any forward ranges, automatically performs correct UTF decoding, accepts custom comparison).

The trend of the two D implementations is quadratic: time required is 4x upon doubling the size of the input. The trend of the Haskell implementation is super-quadratic.

Note that reduce is akin to foldl, not foldr. To answer your question, adding functional primitives to Phobos is good but should be done in moderation; a staunch functional approach is not always the most recommendable.


Andrei

Reply via email to