Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for the insert/delete/transposition changes...
Is the number of edits smallish? Ie you're not concerned about combinatoric explosion of step 1? For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use BasicAutomata.makeString(String) on each of your expanded terms, then BasicOperations.union on all of those automata to make a single automaton accepting all your expanded terms, then likely call .determinize() on the resulting automaton (maybe also .minimize() but I think that may not help). Then pass that automaton to AQ. We don't yet have a way to drive a query from an FST, but that would be an interesting addition. EG you could then support weights as well, to decide how the terms are scored (if certain OCR errors are more likely than others). Mike McCandless http://blog.mikemccandless.com On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward <alan.woodw...@romseysoftware.co.uk> wrote: > Hello, > > I'm trying to create a Lucene Query that will take a term and expand it to > include common OCR errors (for example, 'cl' is often misread as 'd', so a > search for 'clog' should also hit 'dog'). My plan is to do this by > generating all the possible variants of a term, using an existing list of > errors, and then somehow mapping this into an AutomatonQuery. I've been > looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, > and I *think* that this is possible, but I'm so far failing to work out how > to put the various bits together. > > I'm thinking it should work like this: > 1) expand query term to sorted list of possible matches > 2) create an FST over those matches > 3) plug this FST into an AutomatonQuery subclass. > > 1) is easy. It's 2) and 3) I'm having trouble with. > > All help gratefully received! > > Thanks, > > Alan Woodward > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org