David Su added the comment:

I am trying to implement the following algorithm:

http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

This algorithm is used for fast nearest neighbor queries for spell correction. 

Given a corpus of strings, for each string s, I generate all subsequences 
(Note: subsequence, not substring) of that string length (len(n) - k), call 
that subsequence t, where k is a tunable parameter. Then, I create a dict 
mapping the subsequences t to the original string s.

This process creates a very large dict from a relatively small corpus. For 
example, my corpus of ~500k strings blew up to a dict of ~10M keys, with k=3.

Then, for a query string q, you generate all subsequences of (len(q) - k), and 
perform an intersection with the keys of the dict. Then, the values of the dict 
corresponding to the key intersection will be possible misspelling of the query 
string q.

In pseudo-python:

def get_subseq(s: str, k: int):
    returns all subsequences of s of length len(s) - k

def build_dict(corpus: List[str], k: int):
    d = defaultdict(list)

    for orig_str in corpus:
        for subseq in get_subseq(orig_str, k):
            d[subseq].append(orig_str)

    return d

def query(d: dict, q: str, k:int):
    q_subseq = set(get_subseq(q, k))
    intersection = d.viewkeys() & q_subseq # this is slow when len(d) is large
    return [orig_str for k in intersection for orig_str in d[k]]
    

By exposing an intersection interface, a certain degree of performance is 
expected by the user. It took me a considerable amount of debugging to realize 
that the dict.viewkeys() intersection was the performance bottleneck. I finally 
decided to dig into the CPython source code before discovering the root cause 
of the issue. While I am familiar with C and found the issue relatively 
quickly, for those that aren't, it should not be necessary to dig into the 
source code for a mainstream programming language like Python.


I am currently looking at how to potentially fix this in the CPython 
repository. I may submit a patch in the near future if I find a good solution.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to