Le mer 18/09/2002 � 22:55, Ian Parkin a �crit : 
> Hello all,
> 
> I suspect my answer will involve unicode, but I'd like to make sure that I 
> am going down the right path here.
> 
> I have 100,000+ small HTML files that are mainly in the english language. I 
> just noticed that we have some user names with umlauts. These are seemingly 
> stored and searchable as the '?' character.
> 
> My code is based on the demo code that is provided with Lucene, under the 
> 'demo' directory.
> 
> I am wondering what changes I will need to make to handle such characters as 
> umlauts within english text ?

Try to use an analyzer like StandardAnalyzer in both direction: while
indexing and while searching.
I don't remember if one of the filter used by StandardAnalyzer modifies
the accented letters, if not, you will have to create one.

The idea is to transform every word to a normalized form (removing
common words, removing accents [� => u], making the word lowercase)
before indexing the word and before searching the word. That way,
someone looking for �mlaut will have the same results than someone
looking for umlaut. (and knowing the lazyness of most of common users,
they will thank oyu to make that posisble :-)

It's quite easy to implement a subclass of
org.apache.lucene.analysis.TokenFilter that will answer to your needs
and to use it in a subclass of org.apache.lucene.analysis.Analyzer
with all the supplementary filters you need to add.

Remy

-- 
E-mail : [EMAIL PROTECTED]
Kelkoo R&D Director (http://www.kelkoo.com/)


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to