for the k=1 case in my mind your last comment might not really be that much slower than storing the additional data... sounds worth investigating
On Tue, Jan 6, 2009 at 8:04 PM, robert engels <reng...@ix.netcom.com> wrote: > I think you would need to store the position in the stream using position > == to the k factor. Pretty straightforward, both for indexing and for > searching. > I think if you want the utmost in performance this is the way to go. > > If you don't want to store all of the additional data, I still think a > better fuzzy search can be done without the external index entirely. As I > see it, the external index's sole purpose (in your case) is to provide > "indexed" words at a certain edit distance given a certain source word. > Using a combination of inverting the alg, and a binary selective search on > the term index. > > On Jan 6, 2009, at 5:44 PM, Robert Muir wrote: > > robert theres only one problem i see: i don't see how you can do a single > search since fastssWC returns some false positives (with k=1 it will still > return some things with ED of 2). maybe if you store the deletion position > information as a payload (thus using original fastss where there are no > false positives) it would work though. i looked at storing position > information but it appeared like it might be complex and the api was (is) > still marked experimental so i didn't go that route. > > i also agree lucene index might not be the best possible data structure... > just convenient thats all. i used it because i store other things related to > the term besides deletion neighborhoods for my fuzzy matching. > > i guess i'll also mention that i do think storage size should be a big > consideration. you really don't need this kind of stuff unless you are > searching pretty big indexes in the first place (for <= few million docs the > default fuzzy is probably just fine for a lot of people). > > for me, the whole thing was about turning 30second queries into 1 second > queries by removing a linear algorithm, i didn't really optimize much beyond > that because i was just very happy to have reasonable performance.. > > On Tue, Jan 6, 2009 at 6:26 PM, robert engels <reng...@ix.netcom.com>wrote: > >> I understand now. >> The index in my case would definitely be MUCH larger, but I think it would >> perform better, as you only need to do a single search - for obert (if you >> assume it was a misspelling). >> >> In your case you would eventually do an OR search in the lucene index for >> all possible matches (robert, roberta, roberto, ...) which could be much >> larger with some commonly prefixed/postfixed words). >> >> Classic performance vs. size trade-off. In your case where it is not for >> misspellings, the performance difference might be worthwhile. >> >> Still, in your case, I am not sure using a Lucene index as the external >> index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit >> permutation) with a a key on both would allow easy search and update. If >> implemented as a service, some intelligent caching of common misspellings >> could really improve the performance. >> >> On Jan 6, 2009, at 4:29 PM, Robert Muir wrote: >> >> >> >> On Tue, Jan 6, 2009 at 5:15 PM, robert engels <reng...@ix.netcom.com>wrote: >> >>> It is definitely going to increase the index size, but not any more than >>> than the external one would (if my understanding is correct). >>> The nice thing is that you don't have to try and keep documents numbers >>> in sync - it will be automatic. >>> >>> Maybe I don't understand what your external index is storing. Given that >>> the document contains 'robert' but the user enters' obert', what is the >>> process to find the matching documents? >>> >> >> heres a simple example. neighborhood stored for robert is 'robert obert >> rbert roert ...' this is indexed in a tokenized field. >> >> at query time user typoes robert and enters 'tobert'. again neighborhood >> is generated 'tobert obert tbert ...' >> the system does a query on tobert OR obert OR tbert ... and robert is >> returned because 'obert' is present in both neighborhoods. >> in this example, by storing k=1 deletions you guarantee to satisfy all >> edit distance matches <= 1 without linear scan. >> you get some false positives too with this approach, thats why what comes >> back is only a CANDIDATE and true edit distance must be used to verify. this >> might be tricky to do with your method, i don't know. >> >> >> >> >>> >>> Is the external index essentially a constant list, that given obert, the >>> source words COULD BE robert, tobert, reobert etc., and it contains no >>> document information so: >>> >> >> no. see above, you generate all possible 1-character deletions of the >> index term and store them, then at query time you generate all possible >> 1-character deletions of the query term. basically, LUCENE and LUBENE are 1 >> character different, but they are the same (LUENE) if you delete 1 character >> from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just >> store LUENE. >> >>> >>> given the source word X, and an edit distance k, you ask the external >>> dictionary for possible indexed words, and it returns the list, and then use >>> search lucene using each of those words? >>> >>> If the above is the case, it certainly seems you could generate this list >>> in real-time rather efficiently with no IO (unless the external index only >>> stores words which HAVE BEEN indexed). >>> >>> I think the confusion may be because I understand Otis's comments, but >>> they don't seem to match what you are stating. >>> >>> Essentially performing any term match requires efficient >>> searching/matching of the term index. If this is efficient enough, I don't >>> think either process is needed - just an improved real-time fuzzy >>> possibilities word generator. >>> >>> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: >>> >>> i see, your idea would definitely simplify some things. >>> >>> What about the index size difference between this approach and using >>> separate index? Would this separate field increase index size? >>> >>> I guess my line of thinking is if you have 10 docs with robert, with >>> separate index you just have robert, and its deletion neighborhood one time. >>> with this approach you have the same thing, but at least you must have >>> document numbers and the other inverted index stuff with each neighborhood >>> term. would this be a significant change to size and/or performance? and >>> since the documents have multiple terms there is additional positional >>> information for slop factor for each neighborhood term... >>> >>> i think its worth investigating, maybe performance would actually be >>> better, just curious. i think i boxed myself in to auxiliary index because >>> of some other irrelevant thigns i am doing. >>> >>> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <reng...@ix.netcom.com>wrote: >>> >>>> I don't think that is the case. You will have single deletion >>>> neighborhood. The number of unique terms in the field is going to be the >>>> union of the deletion dictionaries of each source term. >>>> For example, given the following documents A which have field 'X' with >>>> value best, and document B with value jest (and k == 1). >>>> >>>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes >>>> >>>> so field FieldXFuzzy contains >>>> (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) >>>> >>>> I don't think the storage requirement is any greater doing it this way. >>>> >>>> >>>> 3.2.1 Indexing >>>> For all words in a dictionary, and a given number of edit operations k, >>>> FastSS >>>> generates all variant spellings recursively and save them as tuples of >>>> type >>>> vā² ā Ud (v, k) ā (v, x) where v is a dictionary word and x a list of >>>> deletion >>>> positions. >>>> >>>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants >>>> for n >>>> dictionary words of length m with k mismatches. >>>> >>>> >>>> 3.2.2 Retrieval >>>> For a query p and edit distance k, first generate the neighborhood Ud (p, >>>> k). >>>> Then compare the words in the neighborhood with the index, and find >>>> matching candidates. Compare deletion positions for each candidate with >>>> the deletion positions in U(p, k), using Theorem 4. >>>> >>>> >>>> >>> >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> >>> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> >> > > > -- > Robert Muir > rcm...@gmail.com > > > -- Robert Muir rcm...@gmail.com