I can't speak for any non-Latin languages, but how about simply using the
StandardAnalyzer plus the EdgeNGramFilter for indexing (but not query.) The
latter would allow a query of "run" to match "running".
-- Jack Krupansky
-----Original Message-----
From: Ilya Zavorin
Sent: Friday, August 24, 2012 3:48 PM
To: java-user@lucene.apache.org
Subject: Efficient string lookup using Lucene
Hi Everyone,
I have the following task. I have a set of documents in multiple languages.
I don't know what these languages are. Any given doc may contain text in
several languages mixed up. So to me these are just a bunch of Unicode text
files.
What I need is to implement an efficient EXACT string lookup. That is, I
need to be able to find ANY Unicode string exactly as it appears. I do not
care about language-specific modifications of the string. That is, if I
search for a string "run", I do not need to find "ran" but I do want to find
it in all of these strings below:
Fox is running fast
!%#^&$run!$!%@&$#
run,run
Is there a way of using StandardAnalyzer or any other analyzer and the
corresponding query parser to find these? Again, my queries might be more or
less random Unicode sequences and I need to find all their accurrences in
the text.
Essentially, what I am trying to do is implement substring matching more
efficiently that using Java's standard substring matching methods.
Thanks!
Ilya Zavorin
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org