What you need is a suffix tree or a suffix array. Both data structures
will allow you to perform constant-time searches for existence/
occurrence of any input pattern. Depending on how much text you have
on the input it may either be a simple task -- see here:

http://labs.carrotsearch.com/jsuffixarrays.html

or a complicated task if your input size is larger (larger than
memory). Google search for suffix trees/ suffix arrays though, it's
the data structure to use here.

Dawid

On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin <izavo...@caci.com> wrote:
> Hi Everyone,
>
> I have the following task. I have a set of documents in multiple languages. I 
> don't know what these languages are. Any given doc may contain text in 
> several languages mixed up. So to me these are just a bunch of Unicode text 
> files.
>
> What I need is to implement an efficient EXACT string lookup. That is, I need 
> to be able to find ANY Unicode string exactly as it appears. I do not care 
> about language-specific modifications of the string. That is, if I search for 
> a string "run", I do not need to find "ran" but I do want to find it in all 
> of these strings below:
>
> Fox is running fast
> !%#^&$run!$!%@&$#
> run,run
>
> Is there a way of using StandardAnalyzer or any other analyzer and the 
> corresponding query parser to find these? Again, my queries might be more or 
> less random Unicode sequences and I need to find all their accurrences in the 
> text.
>
> Essentially, what I am trying to do is implement substring matching more 
> efficiently that using Java's standard substring matching methods.
>
> Thanks!
>
> Ilya Zavorin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to