Re: LevenshteinFilter proposal

Robert Muir Mon, 26 Jul 2010 08:35:52 -0700

On Mon, Jul 26, 2010 at 11:30 AM, <[email protected]> wrote:

>  It occurs to me that the proper static initializer code might well be
> able to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
minutes or hours to compute...


> Nevertheless, the real question of import to me right now is: what
> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The
> maximum “minimumDistance” the query accepts is 1.0.
>
right, fuzzyquery uses a strange formula that doesnt tie directly to edit
distance (unless you factor in length).

you can do this easier like this:
LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
Automaton ed2 = builder.toAutomaton(2);
// the term doesnt really matter, just for the field name and toString, etc.
AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
ed2);


>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:[email protected]]
> *Sent:* Friday, July 23, 2010 11:22 AM
>
> *To:* [email protected]
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> Well, there are two main things involved:
>
> 1. number of terms seeked to in the term enum
>
> 2. expense of the comparison itself.
>
>
>
> one challenge is the construction of a DFA LevK(x) that recognizes all
> words within edit distance <= k of x is an expensive operation.
>
> This is because of the nature of the computation (traditionally its dynamic
> programming with nasty runtime).
>
>
>
> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
> offline the nasty part so its linear time: O(n) where n is the length of the
> query term.
>
> But these offline-computated tables get pretty large quick, and so we only
> include 1,2 to keep the lucene jar file down.
>
>
>
> additionally for n > 2 it starts to be a ton of termenum seeks... at some
> of these high distances its faster to just next() thru the terms and
> compare.
>
>
>
> we might be able to speed it up more by including say a table for n=3 (at
> the expense of increased jar size), but not using this n=3 table for #1,
> only to speed up #2. but its diminishing returns here...
>
>
>
> On Fri, Jul 23, 2010 at 11:09 AM, <[email protected]> wrote:
>
> Glad I asked.
>
>
>
> I would think that the automaton would be superior even for larger edit
> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
> don’t understand something. ;-)
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:[email protected]]
> *Sent:* Friday, July 23, 2010 11:05 AM
>
>
> *To:* [email protected]
>
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> this is actually done in trunk.
>
>
>
> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> automaton.
>
>
>
> for higher distances it uses the crappy "brute force" method.
>
> but, higher distances still get accelerated if you use a reasonable
> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>
>
>
>
>
> On Fri, Jul 23, 2010 at 10:59 AM, <[email protected]> wrote:
>
> Thanks!
>
>
>
> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
> Automaton, no?  I understand that this would be a trunk project.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Uwe Schindler [mailto:[email protected]]
> *Sent:* Friday, July 23, 2010 10:45 AM
>
>
> *To:* [email protected]
>
> *Subject:* RE: LevenshteinFilter proposal
>
>
>
> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: [email protected]
>
>
>
> *From:* [email protected] [mailto:[email protected]]
> *Sent:* Friday, July 23, 2010 4:25 PM
> *To:* [email protected]
> *Subject:* LevenshteinFilter proposal
>
>
>
> Hi Folks,
>
>
> I’m very interested in using (or developing!) a Levenshtein Filter within
> the family of Solr Filter objects. I don’t see such a class today anywhere.
> I see how the AutomatonQuery object would permit such a thing to be built,
> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
> when I look for it in the javadocs for Lucene – can you point me in the
> correct direction?
>
> Thanks!
> Karl
>
>
>
>
>
>
>
>
> --
> Robert Muir
> [email protected]
>
>
>
>
> --
> Robert Muir
> [email protected]
>



-- 
Robert Muir
[email protected]

Re: LevenshteinFilter proposal

Reply via email to