Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Robert Muir Tue, 06 Jan 2009 17:30:02 -0800

for the k=1 case in my mind your last comment might not really be that much
slower than storing the additional data... sounds worth investigating


On Tue, Jan 6, 2009 at 8:04 PM, robert engels <reng...@ix.netcom.com> wrote:

> I think you would need to store the position in the stream using position
> == to the k factor. Pretty straightforward, both for indexing and for
> searching.
> I think if you want the utmost in performance this is the way to go.
>
> If you don't want to store all of the additional data, I still think a
> better fuzzy search can be done without the external index entirely. As I
> see it, the external index's sole purpose (in your case) is to provide
> "indexed" words at a certain edit distance given a certain source word.
>  Using a combination of inverting the alg, and a binary selective search on
> the term index.
>
> On Jan 6, 2009, at 5:44 PM, Robert Muir wrote:
>
> robert theres only one problem i see: i don't see how you can do a single
> search since fastssWC returns some false positives (with k=1 it will still
> return some things with ED of 2). maybe if you store the deletion position
> information as a payload (thus using original fastss where there are no
> false positives) it would work though. i looked at storing position
> information but it appeared like it might be complex and the api was (is)
> still marked experimental so i didn't go that route.
>
> i also agree lucene index might not be the best possible data structure...
> just convenient thats all. i used it because i store other things related to
> the term besides deletion neighborhoods for my fuzzy matching.
>
> i guess i'll also mention that i do think storage size should be a big
> consideration. you really don't need this kind of stuff unless you are
> searching pretty big indexes in the first place (for <= few million docs the
> default fuzzy is probably just fine for a lot of people).
>
> for me, the whole thing was about turning 30second queries into 1 second
> queries by removing a linear algorithm, i didn't really optimize much beyond
> that because i was just very happy to have reasonable performance..
>
> On Tue, Jan 6, 2009 at 6:26 PM, robert engels <reng...@ix.netcom.com>wrote:
>
>> I understand now.
>> The index in my case would definitely be MUCH larger, but I think it would
>> perform better, as you only need to do a single search - for obert (if you
>> assume it was a misspelling).
>>
>> In your case you would eventually do an OR search in the lucene index for
>> all possible matches (robert, roberta, roberto, ...) which could be much
>> larger with some commonly prefixed/postfixed words).
>>
>> Classic performance vs. size trade-off.  In your case where it is not for
>> misspellings, the performance difference might be worthwhile.
>>
>> Still, in your case, I am not sure using a Lucene index as the external
>> index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit
>> permutation) with a a key on both would allow easy search and update. If
>> implemented as a service, some intelligent caching of common misspellings
>> could really improve the performance.
>>
>> On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:
>>
>>
>>
>> On Tue, Jan 6, 2009 at 5:15 PM, robert engels <reng...@ix.netcom.com>wrote:
>>
>>>  It is definitely going to increase the index size, but not any more than
>>> than the external one would (if my understanding is correct).
>>> The nice thing is that you don't have to try and keep documents numbers
>>> in sync - it will be automatic.
>>>
>>> Maybe I don't understand what your external index is storing. Given that
>>> the document contains 'robert' but the user enters' obert', what is the
>>> process to find the matching documents?
>>>
>>
>> heres a simple example. neighborhood stored for robert is 'robert obert
>> rbert roert ...'  this is indexed in a tokenized field.
>>
>> at query time user typoes robert and enters 'tobert'. again neighborhood
>> is generated 'tobert obert tbert ...'
>> the system does a query on tobert OR obert OR tbert ... and robert is
>> returned because 'obert' is present in both neighborhoods.
>> in this example, by storing k=1 deletions you guarantee to satisfy all
>> edit distance matches <= 1 without linear scan.
>> you get some false positives too with this approach, thats why what comes
>> back is only a CANDIDATE and true edit distance must be used to verify. this
>> might be tricky to do with your method, i don't know.
>>
>>
>>
>>
>>>
>>> Is the external index essentially a constant list, that given obert, the
>>> source words COULD BE robert, tobert, reobert etc., and it contains no
>>> document information so:
>>>
>>
>> no. see above, you generate all possible 1-character deletions of the
>> index term and store them, then at query time you generate all possible
>> 1-character deletions of the query term. basically, LUCENE and LUBENE are 1
>> character different, but they are the same (LUENE) if you delete 1 character
>> from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
>> store LUENE.
>>
>>>
>>> given the source word X, and an edit distance k, you ask the external
>>> dictionary for possible indexed words, and it returns the list, and then use
>>> search lucene using each of those words?
>>>
>>> If the above is the case, it certainly seems you could generate this list
>>> in real-time rather efficiently with no IO (unless the external index only
>>> stores words which HAVE BEEN indexed).
>>>
>>> I think the confusion may be because I understand Otis's comments, but
>>> they don't seem to match what you are stating.
>>>
>>> Essentially performing any term match requires efficient
>>> searching/matching of the term index. If this is efficient enough, I don't
>>> think either process is needed - just an improved real-time fuzzy
>>> possibilities word generator.
>>>
>>> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>>>
>>> i see, your idea would definitely simplify some things.
>>>
>>> What about the index size difference between this approach and using
>>> separate index? Would this separate field increase index size?
>>>
>>> I guess my line of thinking is if you have 10 docs with robert, with
>>> separate index you just have robert, and its deletion neighborhood one time.
>>> with this approach you have the same thing, but at least you must have
>>> document numbers and the other inverted index stuff with each neighborhood
>>> term. would this be a significant change to size and/or performance? and
>>> since the documents have multiple terms there is additional positional
>>> information for slop factor for each neighborhood term...
>>>
>>> i think its worth investigating, maybe performance would actually be
>>> better, just curious. i think i boxed myself in to auxiliary index because
>>> of some other irrelevant thigns i am doing.
>>>
>>> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <reng...@ix.netcom.com>wrote:
>>>
>>>>  I don't think that is the case. You will have single deletion
>>>> neighborhood. The number of unique terms in the field is going to be the
>>>> union of the deletion dictionaries of each source term.
>>>> For example, given the following documents A which have field 'X' with
>>>> value best, and document B with value jest (and k == 1).
>>>>
>>>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>>>>
>>>> so field FieldXFuzzy contains
>>>> (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>>>
>>>> I don't think the storage requirement is any greater doing it this way.
>>>>
>>>>
>>>> 3.2.1 Indexing
>>>> For all words in a dictionary, and a given number of edit operations k,
>>>> FastSS
>>>> generates all variant spellings recursively and save them as tuples of
>>>> type
>>>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
>>>> deletion
>>>> positions.
>>>>
>>>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants
>>>> for n
>>>> dictionary words of length m with k mismatches.
>>>>
>>>>
>>>> 3.2.2 Retrieval
>>>> For a query p and edit distance k, first generate the neighborhood Ud (p,
>>>> k).
>>>> Then compare the words in the neighborhood with the index, and find
>>>> matching candidates. Compare deletion positions for each candidate with
>>>> the deletion positions in U(p, k), using Theorem 4.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>>
>>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>>
>>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>
>
>


-- 
Robert Muir
rcm...@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Reply via email to