Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

robert engels Tue, 06 Jan 2009 13:03:06 -0800

Why not just create a new field for this? That is, if you haveFieldA, create field FieldAFuzzy and put the various permutations there.

The fuzzy scorer/parser can be changed to automatically use theXXXXFuzzy field when required.

You could also store positions, and allow that the first term is the"closest", next is the second closest, etc. to add support for a slopfactor.

This is similar to the same way fast phonetic searches can beimplemented.

If you do it this way, you don't have any of the synchronizationissues between the index and the external "fuzzy" index.



On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314 ]
Robert Muir commented on LUCENE-1513:
-------------------------------------

otis, discussion was on java-user.
again, I apologize for the messy code. as mentioned there, my setupis very specific to exactly what I am doing and in no way is thiscode ready. But since i'm currently pretty busy with other thingsat work I just wanted to put something up here for anyone elseinterested.
theres the issues you mentioned, and also some i mentioned on java-user. for example how to handle updates to indexes that introducenew terms (they must be added to auxiliary index), or even ifauxiliary index is the best approach.
the general idea is that instead of enumerating terms to findterms, the deletion neighborhood as described in the paper is usedinstead. this way search time is not linear based on number ofterms. yes you are correct that it only can guarantee editdistances of K which is determined at index time. perhaps thisshould be configurable, but i hardcoded k=1 for simplicity. i thinkits something like 80% of typos...
as i mentioned on the list another idea is you could implementFastSS (not the wC variant) with deletion positions maybe by usingpayloads. This would require more space but eliminate the candidateverification step. maybe it would be nice to have some of theirother algorithms such as block-based,etc available also.
fastss fuzzyquery
-----------------

                Key: LUCENE-1513
URL: https://issues.apache.org/jira/browse/LUCENE-1513
            Project: Lucene - Java
         Issue Type: New Feature
         Components: contrib/*
           Reporter: Robert Muir
           Priority: Minor
        Attachments: fastSSfuzzy.zip


code for doing fuzzyqueries with fastssWC algorithm.
FuzzyIndexer: given a lucene field, it enumerates all terms andcreates an auxiliary offline index for fuzzy queries.FastFuzzyQuery: similar to fuzzy query except it queries theauxiliary index to retrieve a candidate list. this list is thenverified with levenstein algorithm.sorry but the code is a bit messy... what I'm actually using isvery different from this so its pretty much untested. but at leastyou can see whats going on or fix it up.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Reply via email to