I didn’t integrate it directly with a Lucene index. I used the FST class (and related utilities) to build a stand-alone spellchecker.
However, the FST for the term dictionary is modified at index time only. At query time, we need to create a variation of the Levemshtein automata and match it with the FST accordingly. I need to take a closer look at the Lucene code to see how can we integrate it with the existing modules. I’ll do that this week. On Tue, Jul 9, 2019 at 03:18 Michael McCandless <[email protected]> wrote: > This sounds interesting! > > Did your patch build the FST at index time, so at search time all that was > needed was to load the FST from the index directory? > > Is the FST per-segment? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Jul 9, 2019 at 1:43 AM Juan Caicedo <[email protected]> > wrote: > >> Hi Michael, >> >> I guess that I should have added more details :-) >> >> The main benefit from the technique is to do approximate search in >> large dictionaries. That is: find all the entries that are within 1 or >> 2 edit steps from the query. It's more useful for longer (starting >> from ~7 characters, iirc), but in general the technique can be applied >> to queries of any length. The main requirement is that we build the >> dictionary as an FST, using the backwards (reversed) keys of the >> original dictionary. >> >> I initially used the technique to implement a stand-alone >> spellchecker, but I think that it can also be used to optimize the >> Lucene fuzzy queries (e.g. for the spelling/suggest module). However, >> I'll need to look how can it be integrated with the part that creates >> the dictionary. >> >> I'll take a look at the code this week and I'll try to publish it in a >> public repository so that we can discuss about this with more concrete >> details. >> >> I skimmed the paper trying to understand possible applications of the >> technique. It sounds like efficient approximate (ie with some edits) >> substring search is the main idea? I don't believe such a query exists >> today in Lucene (nor any Suggester as far as I know). It sounds as if >> this would be useful for searching within large strings, eg DNA >> sequences or something like that, and maybe less applicable to typical >> "full text" (ie tokenized) search where the strings being searched are >> relatively shorter - does that sound right? >> >> On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov <[email protected]> >> wrote: >> > >> > Juan, that sounds intriguing. >> > >> > I skimmed the paper trying to understand possible applications of the >> > technique. It sounds like efficient approximate (ie with some edits) >> > substring search is the main idea? I don't believe such a query exists >> > today in Lucene (nor any Suggester as far as I know). It sounds as if >> > this would be useful for searching within large strings, eg DNA >> > sequences or something like that, and maybe less applicable to typical >> > "full text" (ie tokenized) search where the strings being searched are >> > relatively shorter - does that sound right? >> > >> > On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <[email protected]> >> wrote: >> > > >> > > Hello, >> > > >> > > I've been working on a project for extending LevenshteinAutomata and >> > > I'd like to know if it would be useful to add it to Lucene. >> > > >> > > I've implemented the 'backwards dictionary' technique (see [1], >> > > section 6) for speeding up approximate search. This technique allows >> > > us to narrow down the search and, therefore, reduce the running time >> > > (at the expense of using more memory). >> > > >> > > I implemented it quite some time ago using an older version of Lucene, >> > > so I need to revisit the code. However, the implementation was >> > > relatively simple and it didn't require major changes to the core >> > > classes. I can share the code in a public repository and iterate on >> > > it, while I make it compatible for new Lucene APIs, add benchmarks, >> > > and more unit tests. >> > > >> > > Ideally, I'd like to contribute to Lucene, either as part of core, >> > > suggest or a different module. >> > > >> > > What do you think? >> > > >> > > [1] >> https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf >> > > >> > > --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: [email protected] >> > > For additional commands, e-mail: [email protected] >> > > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [email protected] >> > For additional commands, e-mail: [email protected] >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
