i see what you are saying here. this is different than fastss but sounds nice for spelling correction.
i suppose one reason why i like fastss is for my application i need the true complete edit distance, i'm actually not using it for spelling correction but as a first step for other tasks. but maybe fastss is too complex for the general case (spelling correction) and we should do what you mention below. i think either way it would be nice to have some kind of fuzzy matching in lucene that isn't a linear scan. On Tue, Jan 6, 2009 at 5:29 PM, robert engels <reng...@ix.netcom.com> wrote: > To clarify a statement in the last email. > To generate the 'possible source words' in real-time is not a difficult as > first seems, if you assume some sort of first character prefix (which is > what it appears google does). > > For example, assume the user typed 'robrt' instead of 'robert'. You see > that this word has very low frequency (or none), so you want to find > possible misspellings, so do a fuzzy search starting with r. But this search > can be optimized, because as the edit/delete position moves to the right, > the prefix remains the same, so these possibilities can be quickly skipped. > > If you don't find any words with high enough frequency as possible edit > distances, try [a-z]robrt, assuming the user may have dropped the first > character (possibly try this in "know combination" order, rather than alpha > (i.e. try sr before nr). > > For example, searching google for 'robrt engels' works. So does 'obert > engels', so does 'robt engels', all ask me if I meant 'robert engels', but > searching for 'obrt engels' does not. > > On Jan 6, 2009, at 4:15 PM, robert engels wrote: > > It is definitely going to increase the index size, but not any more than > than the external one would (if my understanding is correct). > The nice thing is that you don't have to try and keep documents numbers in > sync - it will be automatic. > > Maybe I don't understand what your external index is storing. Given that > the document contains 'robert' but the user enters' obert', what is the > process to find the matching documents? > > Is the external index essentially a constant list, that given obert, the > source words COULD BE robert, tobert, reobert etc., and it contains no > document information so: > > given the source word X, and an edit distance k, you ask the external > dictionary for possible indexed words, and it returns the list, and then use > search lucene using each of those words? > > If the above is the case, it certainly seems you could generate this list > in real-time rather efficiently with no IO (unless the external index only > stores words which HAVE BEEN indexed). > > I think the confusion may be because I understand Otis's comments, but they > don't seem to match what you are stating. > > Essentially performing any term match requires efficient searching/matching > of the term index. If this is efficient enough, I don't think either process > is needed - just an improved real-time fuzzy possibilities word generator. > > On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: > > i see, your idea would definitely simplify some things. > > What about the index size difference between this approach and using > separate index? Would this separate field increase index size? > > I guess my line of thinking is if you have 10 docs with robert, with > separate index you just have robert, and its deletion neighborhood one time. > with this approach you have the same thing, but at least you must have > document numbers and the other inverted index stuff with each neighborhood > term. would this be a significant change to size and/or performance? and > since the documents have multiple terms there is additional positional > information for slop factor for each neighborhood term... > > i think its worth investigating, maybe performance would actually be > better, just curious. i think i boxed myself in to auxiliary index because > of some other irrelevant thigns i am doing. > > On Tue, Jan 6, 2009 at 4:42 PM, robert engels <reng...@ix.netcom.com>wrote: > >> I don't think that is the case. You will have single deletion >> neighborhood. The number of unique terms in the field is going to be the >> union of the deletion dictionaries of each source term. >> For example, given the following documents A which have field 'X' with >> value best, and document B with value jest (and k == 1). >> >> A will generate est bst, bet, bes, B will generate est, jest, jst, jes >> >> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) >> >> I don't think the storage requirement is any greater doing it this way. >> >> >> 3.2.1 Indexing >> For all words in a dictionary, and a given number of edit operations k, >> FastSS >> generates all variant spellings recursively and save them as tuples of >> type >> vā² ā Ud (v, k) ā (v, x) where v is a dictionary word and x a list of >> deletion >> positions. >> >> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for >> n >> dictionary words of length m with k mismatches. >> >> >> 3.2.2 Retrieval >> For a query p and edit distance k, first generate the neighborhood Ud (p, >> k). >> Then compare the words in the neighborhood with the index, and find >> matching candidates. Compare deletion positions for each candidate with >> the deletion positions in U(p, k), using Theorem 4. >> >> >> > > > -- > Robert Muir > rcm...@gmail.com > > > > -- Robert Muir rcm...@gmail.com