Basis Technology has a commercial product Rosette Language Identifier to identify the input language. If you are interested in, you can send email to [EMAIL PROTECTED]
-zhaohui -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 10:37 AM To: Lucene Developers List Subject: Re: special character with lucene Usually the text is in one specific language. English, German, Spanish, French, ... However, I dont really have a runtime identifier which language it is. I could only pick a few words and decide from there (?) - if this is a good idea? Is there a tool part of lucene that helps deciding what language a specific text is? In a simple test I noticed that StandardAnalyzer removes special characters like ä, ö, ... If I leave the characters the way they are, I don't find f.e the German word "Äpfel" anymore. So it looks like there are only two solutions: a) - decide which language it is by choosing a few words from the text - use the language specific analyzer; where do I find Spanish and Frensh analyzer? b) - replace each special character (ä, ö, ...) with some code ï, .... There is no stemming then. Any help is appreciated, Greetings, Philipp Erik Hatcher <[EMAIL PROTECTED]> 28.02.2005 16:17 Bitte antworten an "Lucene Developers List" <lucene-dev@jakarta.apache.org> An "Lucene Developers List" <lucene-dev@jakarta.apache.org> Kopie Thema Re: special character with lucene On Feb 28, 2005, at 10:01 AM, [EMAIL PROTECTED] wrote: > Hello, > I would like to build a search engine using several different > languages - > f.e. Spanish names, French names, ... Will your text be a mix of languages within a single field? Or would each document (or field) be a single language? > - Using a different analyzer for each language would be one solution. You will most likely have to use a different analyzer for each language, though that depends on the answers to the above. > - But how about replacing each special character (Umlaute, ...ä, ö, > ...) > with its html special character before indexing and doing the same with > each search query before searching?? An HTML entity is more than one character. The simplest is to leave the characters as-is, in Unicode. > This seems to me the simplest approach to handling this issues - ? > > What are the drawbacks? No Stem search? Other considerations? Stemming is language-specific, which factors into your analyzer(s) choices. Erik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]