Why not just create a new field for this? That is, if you have FieldA, create field FieldAFuzzy and put the various permutations there.

The fuzzy scorer/parser can be changed to automatically use the XXXXFuzzy field when required.

You could also store positions, and allow that the first term is the "closest", next is the second closest, etc. to add support for a slop factor.

This is similar to the same way fast phonetic searches can be implemented.

If you do it this way, you don't have any of the synchronization issues between the index and the external "fuzzy" index.


On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:


[ https://issues.apache.org/jira/browse/LUCENE-1513? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel&focusedCommentId=12661314#action_12661314 ]

Robert Muir commented on LUCENE-1513:
-------------------------------------

otis, discussion was on java-user.

again, I apologize for the messy code. as mentioned there, my setup is very specific to exactly what I am doing and in no way is this code ready. But since i'm currently pretty busy with other things at work I just wanted to put something up here for anyone else interested.

theres the issues you mentioned, and also some i mentioned on java- user. for example how to handle updates to indexes that introduce new terms (they must be added to auxiliary index), or even if auxiliary index is the best approach.

the general idea is that instead of enumerating terms to find terms, the deletion neighborhood as described in the paper is used instead. this way search time is not linear based on number of terms. yes you are correct that it only can guarantee edit distances of K which is determined at index time. perhaps this should be configurable, but i hardcoded k=1 for simplicity. i think its something like 80% of typos...

as i mentioned on the list another idea is you could implement FastSS (not the wC variant) with deletion positions maybe by using payloads. This would require more space but eliminate the candidate verification step. maybe it would be nice to have some of their other algorithms such as block-based,etc available also.



fastss fuzzyquery
-----------------

                Key: LUCENE-1513
URL: https://issues.apache.org/jira/browse/ LUCENE-1513
            Project: Lucene - Java
         Issue Type: New Feature
         Components: contrib/*
           Reporter: Robert Muir
           Priority: Minor
        Attachments: fastSSfuzzy.zip


code for doing fuzzyqueries with fastssWC algorithm.
FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries. FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm. sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to