Re: Efficient string lookup using Lucene

Lance Norskog Sun, 26 Aug 2012 12:14:14 -0700

The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines.
After that, you can wildcards. This will use very little space. I
believe leading&trailing wildcards are supported now, right?


On Sun, Aug 26, 2012 at 11:29 AM, Ilya Zavorin <[email protected]> wrote:
> The user uploads a set of text files, either all of them at once or one at a 
> time, and then they will be searched locally on the phone against a set of 
> "hotlist" words. This assumes no connection to any sort of server so 
> everything must be done locally.
>
> I already have Lucene integrated so I might want to try the n-gram approach. 
> But I just want to double-check first that it will work with any Unicode 
> string, be it an English word, a foreign word, a sequence of digits or any 
> random sequence of Unicode characters. In other words, this is not in any way 
> language-dependent/-specific.
>
> Thanks,
>
> Ilya
>
> -----Original Message-----
> From: Dawid Weiss [mailto:[email protected]]
> Sent: Sunday, August 26, 2012 3:55 AM
> To: [email protected]
> Subject: Re: Efficient string lookup using Lucene
>
>> Does Lucene support this type of structure, or do I need to somehow 
>> implement it outside Lucene?
>
> You'd have to implement it separately but it'd be much, much smaller than 
> Lucene itself (even obfuscated).
>
>> By the way, I need this to run on an Android phone so size of memory might 
>> be an issue...
>
> How large is your input? Do you need to index on the android or just read the 
> index on it? These are all factors to take into account. I mentioned suffix 
> trees and suffix arrays because these two are "canonical" data structures to 
> perform any substring lookups in constant time (in fact, the lookup takes the 
> number of elements of the matched input string, building the suffix tree/ 
> array is O(n), at least in theory).
>
> If you already have Lucene integrated in your pipeline then that n-gram 
> approach will also work. If you know your minimum match substring length to 
> be p then index p-sized shingles. For strings longer than p you can create a 
> query which will search for all n-gram occurrences and take into account 
> positional information to remove false matches.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>



-- 
Lance Norskog
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Efficient string lookup using Lucene

Reply via email to