RE: Searching with Special Characters

Moray McConnachie Fri, 28 Aug 2009 09:03:53 -0700

There are several ways of handling this in Lucene. The most effective is
to write your own analyzer and/or Tokenizer and/or TokenFilter.  You
don't say what analyzer you are using when you setup the index writer.


A basic guide to what these do is at http://www.darksleep.com/lucene/
and http://mext.at/?p=26

"The tokenizers take care of the actual rules for where to break the
text up into words (typically whitespace)"

You basically need to write a custom analyzer which makes use of your
custom tokenizer which decides when to break "a word" (token) and what
"words" (tokens) to filter out and/or what transformations to apply to
each "word" (token).

Note that if you need to analyze/tokenize different fields in different
ways, use the PerFieldAnalyzerWrapper
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/PerFi
eldAnalyzerWrapper.html

Note also that you should use the same analyzer for queryparser as you
use for indexing.

I don't have time to go into detail of how you can do this, but if you
google around tokenfilter Lucene you should find plenty of things.

Yours,
Moray
------------------------------------- 
Moray McConnachie
Director of IT    +44 1865 261 600
Oxford Analytica  http://www.oxan.com

-----Original Message-----
From: Trevor Watson [mailto:[email protected]] 
Sent: 28 August 2009 16:18
To: [email protected]
Subject: Searching with Special Characters

Hello folks,

We are currently attempting to use Lucene.Net to do some searching of a
Lucene index built off of a MySQL database.  The index is built and
searching on it is going quite well.  However, we are attempting to
search for characters that Lucene trims out automatically.

For example, "asdf23(4)" becomes two separate terms "asdf23" and "4".  
When searching for "asdf23\(4\)" (slashes included to allow the brackets
to remain in the search query), we receive no results.  This is because
when adding it to the index, it strips out the brackets and divides them
into individual terms.

Is there a way to stop Lucene from splitting that into individual terms?


The code we use to add documents is as follows:
[start code]
string[] sReplace = new string[] {"\\", "+", "-", "&&", "||", "!", "(",
")", "{", "}", "[", "]", "^", "\"", "~", "*", "?", ":"}; foreach (string
sReplaceTerm in sReplace)
    sInsert = sInsert.Replace(sReplaceTerm, "\\" + sReplaceTerm);

doc.Add(new Lucene.Net.Documents.Field(dr["FieldName"].ToString(),
sInsert, Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED,
Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));
[end code]

Thanks in advance,

Trevor Watson

RE: Searching with Special Characters

Reply via email to