Unicode Tokenizer problem with Registered Trademark Search

Bruce.Nawrocki Wed, 02 Apr 2008 13:59:11 -0700

I am having a problem when searching for certain Unicode characters, such as 
the Registered Trademark. That's the Unicode character 00AE. It's also a 
problem searching for a Japanese Yen symbol (Unicode character 00A5).


I'm using the Lucene 2.0.0 jar file, and we used to use Lucene 1.4.2 jar file, 
where this used to work OK. But Lucene 2.0.0 doesn't work the same way.

I see that the registered trademark is in the Lucene index file, so that's 
good. The problem comes when I try to search for these characters.

I see that my query starts off OK, as this:

( (Locale:en) AND ( productName:(Digital¥^95) ) )    (if you cannot see the 
Japanese Yen symbol, it comes directly after "Digital")

Note: the "^95" is just a boost factor, and is OK.

I'm using StandardAnalyzer and StandardTokenizer to create a new QueryParser , 
and after I call the "parse" method of the QueryParser, my query becomes this:

 +Locale:en +productName:digital^95.0

Notice that the Japanese Yen symbol is gone! I think it's because the 
StandardTokenizer.jj file doesn't handle this character, and so it throws it 
away.

Is there any way to use a different Analyzer and/or Tokenizer, rather than 
building my own?

And if I had created my Lucene indexes with the StandardAnalyzer, must I use 
the StandardAnalyzer and StandardTokenizer to search the index?

Thanks.

Unicode Tokenizer problem with Registered Trademark Search

Reply via email to