Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Robert Muir Tue, 06 Jan 2009 14:39:03 -0800

i see what you are saying here. this is different than fastss but sounds
nice for spelling correction.


i suppose one reason why i like fastss is for my application i need the true
complete edit distance, i'm actually not using it for spelling correction
but as a first step for other tasks.

but maybe fastss is too complex for the general case (spelling correction)
and we should do what you mention below. i think either way it would be nice
to have some kind of fuzzy matching in lucene that isn't a linear scan.

On Tue, Jan 6, 2009 at 5:29 PM, robert engels <[email protected]> wrote:

> To clarify a statement in the last email.
> To generate the 'possible source words' in real-time is not a difficult as
> first seems, if you assume some sort of first character prefix (which is
> what it appears google does).
>
> For example, assume the user typed 'robrt' instead of 'robert'. You see
> that this word has very low frequency (or none), so you want to find
> possible misspellings, so do a fuzzy search starting with r. But this search
> can be optimized, because as the edit/delete position moves to the right,
> the prefix remains the same, so these possibilities can be quickly skipped.
>
> If you don't find any words with high enough frequency as possible edit
> distances, try [a-z]robrt, assuming the user may have dropped the first
> character (possibly try this in "know combination" order, rather than alpha
> (i.e. try sr before nr).
>

> For example, searching google for 'robrt engels' works. So does 'obert
> engels', so does 'robt engels', all ask me if I meant 'robert engels', but
> searching for 'obrt engels' does not.
>
> On Jan 6, 2009, at 4:15 PM, robert engels wrote:
>
> It is definitely going to increase the index size, but not any more than
> than the external one would (if my understanding is correct).
> The nice thing is that you don't have to try and keep documents numbers in
> sync - it will be automatic.
>
> Maybe I don't understand what your external index is storing. Given that
> the document contains 'robert' but the user enters' obert', what is the
> process to find the matching documents?
>
> Is the external index essentially a constant list, that given obert, the
> source words COULD BE robert, tobert, reobert etc., and it contains no
> document information so:
>
> given the source word X, and an edit distance k, you ask the external
> dictionary for possible indexed words, and it returns the list, and then use
> search lucene using each of those words?
>
> If the above is the case, it certainly seems you could generate this list
> in real-time rather efficiently with no IO (unless the external index only
> stores words which HAVE BEEN indexed).
>
> I think the confusion may be because I understand Otis's comments, but they
> don't seem to match what you are stating.
>
> Essentially performing any term match requires efficient searching/matching
> of the term index. If this is efficient enough, I don't think either process
> is needed - just an improved real-time fuzzy possibilities word generator.
>
> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>
> i see, your idea would definitely simplify some things.
>
> What about the index size difference between this approach and using
> separate index? Would this separate field increase index size?
>
> I guess my line of thinking is if you have 10 docs with robert, with
> separate index you just have robert, and its deletion neighborhood one time.
> with this approach you have the same thing, but at least you must have
> document numbers and the other inverted index stuff with each neighborhood
> term. would this be a significant change to size and/or performance? and
> since the documents have multiple terms there is additional positional
> information for slop factor for each neighborhood term...
>
> i think its worth investigating, maybe performance would actually be
> better, just curious. i think i boxed myself in to auxiliary index because
> of some other irrelevant thigns i am doing.
>
> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <[email protected]>wrote:
>
>>  I don't think that is the case. You will have single deletion
>> neighborhood. The number of unique terms in the field is going to be the
>> union of the deletion dictionaries of each source term.
>> For example, given the following documents A which have field 'X' with
>> value best, and document B with value jest (and k == 1).
>>
>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>>
>> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>
>> I don't think the storage requirement is any greater doing it this way.
>>
>>
>> 3.2.1 Indexing
>> For all words in a dictionary, and a given number of edit operations k,
>> FastSS
>> generates all variant spellings recursively and save them as tuples of
>> type
>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
>> deletion
>> positions.
>>
>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
>> n
>> dictionary words of length m with k mismatches.
>>
>>
>> 3.2.2 Retrieval
>> For a query p and edit distance k, first generate the neighborhood Ud (p,
>> k).
>> Then compare the words in the neighborhood with the index, and find
>> matching candidates. Compare deletion positions for each candidate with
>> the deletion positions in U(p, k), using Theorem 4.
>>
>>
>>
>
>
> --
> Robert Muir
> [email protected]
>
>
>
>


-- 
Robert Muir
[email protected]

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Reply via email to