Re: New feature idea - Backwards (FST) dictionary for approximate string search
I didn’t integrate it directly with a Lucene index. I used the FST class (and related utilities) to build a stand-alone spellchecker. However, the FST for the term dictionary is modified at index time only. At query time, we need to create a variation of the Levemshtein automata and match it with the FST accordingly. I need to take a closer look at the Lucene code to see how can we integrate it with the existing modules. I’ll do that this week. On Tue, Jul 9, 2019 at 03:18 Michael McCandless wrote: > This sounds interesting! > > Did your patch build the FST at index time, so at search time all that was > needed was to load the FST from the index directory? > > Is the FST per-segment? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Jul 9, 2019 at 1:43 AM Juan Caicedo > wrote: > >> Hi Michael, >> >> I guess that I should have added more details :-) >> >> The main benefit from the technique is to do approximate search in >> large dictionaries. That is: find all the entries that are within 1 or >> 2 edit steps from the query. It's more useful for longer (starting >> from ~7 characters, iirc), but in general the technique can be applied >> to queries of any length. The main requirement is that we build the >> dictionary as an FST, using the backwards (reversed) keys of the >> original dictionary. >> >> I initially used the technique to implement a stand-alone >> spellchecker, but I think that it can also be used to optimize the >> Lucene fuzzy queries (e.g. for the spelling/suggest module). However, >> I'll need to look how can it be integrated with the part that creates >> the dictionary. >> >> I'll take a look at the code this week and I'll try to publish it in a >> public repository so that we can discuss about this with more concrete >> details. >> >> I skimmed the paper trying to understand possible applications of the >> technique. It sounds like efficient approximate (ie with some edits) >> substring search is the main idea? I don't believe such a query exists >> today in Lucene (nor any Suggester as far as I know). It sounds as if >> this would be useful for searching within large strings, eg DNA >> sequences or something like that, and maybe less applicable to typical >> "full text" (ie tokenized) search where the strings being searched are >> relatively shorter - does that sound right? >> >> On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov >> wrote: >> > >> > Juan, that sounds intriguing. >> > >> > I skimmed the paper trying to understand possible applications of the >> > technique. It sounds like efficient approximate (ie with some edits) >> > substring search is the main idea? I don't believe such a query exists >> > today in Lucene (nor any Suggester as far as I know). It sounds as if >> > this would be useful for searching within large strings, eg DNA >> > sequences or something like that, and maybe less applicable to typical >> > "full text" (ie tokenized) search where the strings being searched are >> > relatively shorter - does that sound right? >> > >> > On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo >> wrote: >> > > >> > > Hello, >> > > >> > > I've been working on a project for extending LevenshteinAutomata and >> > > I'd like to know if it would be useful to add it to Lucene. >> > > >> > > I've implemented the 'backwards dictionary' technique (see [1], >> > > section 6) for speeding up approximate search. This technique allows >> > > us to narrow down the search and, therefore, reduce the running time >> > > (at the expense of using more memory). >> > > >> > > I implemented it quite some time ago using an older version of Lucene, >> > > so I need to revisit the code. However, the implementation was >> > > relatively simple and it didn't require major changes to the core >> > > classes. I can share the code in a public repository and iterate on >> > > it, while I make it compatible for new Lucene APIs, add benchmarks, >> > > and more unit tests. >> > > >> > > Ideally, I'd like to contribute to Lucene, either as part of core, >> > > suggest or a different module. >> > > >> > > What do you think? >> > > >> > > [1] >> https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf >> > > >> > > - >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: dev-h...@lucene.apache.org >> > > >> > >> > - >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>
Re: New feature idea - Backwards (FST) dictionary for approximate string search
Hi Michael, I guess that I should have added more details :-) The main benefit from the technique is to do approximate search in large dictionaries. That is: find all the entries that are within 1 or 2 edit steps from the query. It's more useful for longer (starting from ~7 characters, iirc), but in general the technique can be applied to queries of any length. The main requirement is that we build the dictionary as an FST, using the backwards (reversed) keys of the original dictionary. I initially used the technique to implement a stand-alone spellchecker, but I think that it can also be used to optimize the Lucene fuzzy queries (e.g. for the spelling/suggest module). However, I'll need to look how can it be integrated with the part that creates the dictionary. I'll take a look at the code this week and I'll try to publish it in a public repository so that we can discuss about this with more concrete details. I skimmed the paper trying to understand possible applications of the technique. It sounds like efficient approximate (ie with some edits) substring search is the main idea? I don't believe such a query exists today in Lucene (nor any Suggester as far as I know). It sounds as if this would be useful for searching within large strings, eg DNA sequences or something like that, and maybe less applicable to typical "full text" (ie tokenized) search where the strings being searched are relatively shorter - does that sound right? On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov wrote: > > Juan, that sounds intriguing. > > I skimmed the paper trying to understand possible applications of the > technique. It sounds like efficient approximate (ie with some edits) > substring search is the main idea? I don't believe such a query exists > today in Lucene (nor any Suggester as far as I know). It sounds as if > this would be useful for searching within large strings, eg DNA > sequences or something like that, and maybe less applicable to typical > "full text" (ie tokenized) search where the strings being searched are > relatively shorter - does that sound right? > > On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo > wrote: > > > > Hello, > > > > I've been working on a project for extending LevenshteinAutomata and > > I'd like to know if it would be useful to add it to Lucene. > > > > I've implemented the 'backwards dictionary' technique (see [1], > > section 6) for speeding up approximate search. This technique allows > > us to narrow down the search and, therefore, reduce the running time > > (at the expense of using more memory). > > > > I implemented it quite some time ago using an older version of Lucene, > > so I need to revisit the code. However, the implementation was > > relatively simple and it didn't require major changes to the core > > classes. I can share the code in a public repository and iterate on > > it, while I make it compatible for new Lucene APIs, add benchmarks, > > and more unit tests. > > > > Ideally, I'd like to contribute to Lucene, either as part of core, > > suggest or a different module. > > > > What do you think? > > > > [1] > > https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
New feature idea - Backwards (FST) dictionary for approximate string search
Hello, I've been working on a project for extending LevenshteinAutomata and I'd like to know if it would be useful to add it to Lucene. I've implemented the 'backwards dictionary' technique (see [1], section 6) for speeding up approximate search. This technique allows us to narrow down the search and, therefore, reduce the running time (at the expense of using more memory). I implemented it quite some time ago using an older version of Lucene, so I need to revisit the code. However, the implementation was relatively simple and it didn't require major changes to the core classes. I can share the code in a public repository and iterate on it, while I make it compatible for new Lucene APIs, add benchmarks, and more unit tests. Ideally, I'd like to contribute to Lucene, either as part of core, suggest or a different module. What do you think? [1] https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org