The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines. After that, you can wildcards. This will use very little space. I believe leading&trailing wildcards are supported now, right?
On Sun, Aug 26, 2012 at 11:29 AM, Ilya Zavorin <[email protected]> wrote: > The user uploads a set of text files, either all of them at once or one at a > time, and then they will be searched locally on the phone against a set of > "hotlist" words. This assumes no connection to any sort of server so > everything must be done locally. > > I already have Lucene integrated so I might want to try the n-gram approach. > But I just want to double-check first that it will work with any Unicode > string, be it an English word, a foreign word, a sequence of digits or any > random sequence of Unicode characters. In other words, this is not in any way > language-dependent/-specific. > > Thanks, > > Ilya > > -----Original Message----- > From: Dawid Weiss [mailto:[email protected]] > Sent: Sunday, August 26, 2012 3:55 AM > To: [email protected] > Subject: Re: Efficient string lookup using Lucene > >> Does Lucene support this type of structure, or do I need to somehow >> implement it outside Lucene? > > You'd have to implement it separately but it'd be much, much smaller than > Lucene itself (even obfuscated). > >> By the way, I need this to run on an Android phone so size of memory might >> be an issue... > > How large is your input? Do you need to index on the android or just read the > index on it? These are all factors to take into account. I mentioned suffix > trees and suffix arrays because these two are "canonical" data structures to > perform any substring lookups in constant time (in fact, the lookup takes the > number of elements of the matched input string, building the suffix tree/ > array is O(n), at least in theory). > > If you already have Lucene integrated in your pipeline then that n-gram > approach will also work. If you know your minimum match substring length to > be p then index p-sized shingles. For strings longer than p you can create a > query which will search for all n-gram occurrences and take into account > positional information to remove false matches. > > Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -- Lance Norskog [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
