I'm writing a package that compute various string distances.  Some 
distances require to compare the set of n-grams (continugous sequences of n 
characters) for each string.
For now, the implementation of these distances is quite slow. I've written 
the function for jaccard on a gist here <https://goo.gl/S4dkb1>.
 The function is 10x slower than R stringdist 
<https://github.com/markvanderloo/stringdist> (written in C and based on 
binary trees rather than hash tables). Profiling shows that most of time 
comes from the creation of the Set of q-gram. Can you think of a way to 
improve its performance?

Reply via email to