Efficient string lookup using Lucene

Ilya Zavorin Fri, 24 Aug 2012 12:49:17 -0700

Hi Everyone,

I have the following task. I have a set of documents in multiple languages. I 
don't know what these languages are. Any given doc may contain text in several 
languages mixed up. So to me these are just a bunch of Unicode text files.


What I need is to implement an efficient EXACT string lookup. That is, I need 
to be able to find ANY Unicode string exactly as it appears. I do not care 
about language-specific modifications of the string. That is, if I search for a 
string "run", I do not need to find "ran" but I do want to find it in all of 
these strings below:

Fox is running fast
!%#^&$run!$!%@&$#
run,run

Is there a way of using StandardAnalyzer or any other analyzer and the 
corresponding query parser to find these? Again, my queries might be more or 
less random Unicode sequences and I need to find all their accurrences in the 
text.

Essentially, what I am trying to do is implement substring matching more 
efficiently that using Java's standard substring matching methods.

Thanks!

Ilya Zavorin

Efficient string lookup using Lucene

Reply via email to