Le mer 18/09/2002 � 22:55, Ian Parkin a �crit : > Hello all, > > I suspect my answer will involve unicode, but I'd like to make sure that I > am going down the right path here. > > I have 100,000+ small HTML files that are mainly in the english language. I > just noticed that we have some user names with umlauts. These are seemingly > stored and searchable as the '?' character. > > My code is based on the demo code that is provided with Lucene, under the > 'demo' directory. > > I am wondering what changes I will need to make to handle such characters as > umlauts within english text ?
Try to use an analyzer like StandardAnalyzer in both direction: while indexing and while searching. I don't remember if one of the filter used by StandardAnalyzer modifies the accented letters, if not, you will have to create one. The idea is to transform every word to a normalized form (removing common words, removing accents [� => u], making the word lowercase) before indexing the word and before searching the word. That way, someone looking for �mlaut will have the same results than someone looking for umlaut. (and knowing the lazyness of most of common users, they will thank oyu to make that posisble :-) It's quite easy to implement a subclass of org.apache.lucene.analysis.TokenFilter that will answer to your needs and to use it in a subclass of org.apache.lucene.analysis.Analyzer with all the supplementary filters you need to add. Remy -- E-mail : [EMAIL PROTECTED] Kelkoo R&D Director (http://www.kelkoo.com/) -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
