There are several ways of handling this in Lucene. The most effective is to write your own analyzer and/or Tokenizer and/or TokenFilter. You don't say what analyzer you are using when you setup the index writer.
A basic guide to what these do is at http://www.darksleep.com/lucene/ and http://mext.at/?p=26 "The tokenizers take care of the actual rules for where to break the text up into words (typically whitespace)" You basically need to write a custom analyzer which makes use of your custom tokenizer which decides when to break "a word" (token) and what "words" (tokens) to filter out and/or what transformations to apply to each "word" (token). Note that if you need to analyze/tokenize different fields in different ways, use the PerFieldAnalyzerWrapper http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/PerFi eldAnalyzerWrapper.html Note also that you should use the same analyzer for queryparser as you use for indexing. I don't have time to go into detail of how you can do this, but if you google around tokenfilter Lucene you should find plenty of things. Yours, Moray ------------------------------------- Moray McConnachie Director of IT +44 1865 261 600 Oxford Analytica http://www.oxan.com -----Original Message----- From: Trevor Watson [mailto:[email protected]] Sent: 28 August 2009 16:18 To: [email protected] Subject: Searching with Special Characters Hello folks, We are currently attempting to use Lucene.Net to do some searching of a Lucene index built off of a MySQL database. The index is built and searching on it is going quite well. However, we are attempting to search for characters that Lucene trims out automatically. For example, "asdf23(4)" becomes two separate terms "asdf23" and "4". When searching for "asdf23\(4\)" (slashes included to allow the brackets to remain in the search query), we receive no results. This is because when adding it to the index, it strips out the brackets and divides them into individual terms. Is there a way to stop Lucene from splitting that into individual terms? The code we use to add documents is as follows: [start code] string[] sReplace = new string[] {"\\", "+", "-", "&&", "||", "!", "(", ")", "{", "}", "[", "]", "^", "\"", "~", "*", "?", ":"}; foreach (string sReplaceTerm in sReplace) sInsert = sInsert.Replace(sReplaceTerm, "\\" + sReplaceTerm); doc.Add(new Lucene.Net.Documents.Field(dr["FieldName"].ToString(), sInsert, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.TOKENIZED, Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS)); [end code] Thanks in advance, Trevor Watson
