Re: Efficient string lookup using Lucene

Jack Krupansky Fri, 24 Aug 2012 13:53:06 -0700

I can't speak for any non-Latin languages, but how about simply using theStandardAnalyzer plus the EdgeNGramFilter for indexing (but not query.) Thelatter would allow a query of "run" to match "running".


-- Jack Krupansky

-----Original Message-----From: Ilya Zavorin

Sent: Friday, August 24, 2012 3:48 PM
To: [email protected]
Subject: Efficient string lookup using Lucene

Hi Everyone,

I have the following task. I have a set of documents in multiple languages.I don't know what these languages are. Any given doc may contain text inseveral languages mixed up. So to me these are just a bunch of Unicode textfiles.

What I need is to implement an efficient EXACT string lookup. That is, Ineed to be able to find ANY Unicode string exactly as it appears. I do notcare about language-specific modifications of the string. That is, if Isearch for a string "run", I do not need to find "ran" but I do want to findit in all of these strings below:


Fox is running fast
!%#^&$run!$!%@&$#
run,run

Is there a way of using StandardAnalyzer or any other analyzer and thecorresponding query parser to find these? Again, my queries might be more orless random Unicode sequences and I need to find all their accurrences inthe text.

Essentially, what I am trying to do is implement substring matching moreefficiently that using Java's standard substring matching methods.


Thanks!

Ilya Zavorin


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Efficient string lookup using Lucene

Reply via email to