Basis Technology has a commercial product Rosette Language Identifier to identify the input language. If you are interested in, you can send email to [EMAIL PROTECTED]
-zhaohui -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 10:37 AM To: Lucene Developers List Subject: Re: special character with lucene Usually the text is in one specific language. English, German, Spanish, French, ... However, I dont really have a runtime identifier which language it is. I could only pick a few words and decide from there (?) - if this is a good idea? Is there a tool part of lucene that helps deciding what language a specific text is? In a simple test I noticed that StandardAnalyzer removes special characters like �, �, ... If I leave the characters the way they are, I don't find f.e the German word "�pfel" anymore. So it looks like there are only two solutions: a) - decide which language it is by choosing a few words from the text - use the language specific analyzer; where do I find Spanish and Frensh analyzer? b) - replace each special character (�, �, ...) with some code ï, .... There is no stemming then. Any help is appreciated, Greetings, Philipp Erik Hatcher <[EMAIL PROTECTED]> 28.02.2005 16:17 Bitte antworten an "Lucene Developers List" <[email protected]> An "Lucene Developers List" <[email protected]> Kopie Thema Re: special character with lucene On Feb 28, 2005, at 10:01 AM, [EMAIL PROTECTED] wrote: > Hello, > I would like to build a search engine using several different > languages - > f.e. Spanish names, French names, ... Will your text be a mix of languages within a single field? Or would each document (or field) be a single language? > - Using a different analyzer for each language would be one solution. You will most likely have to use a different analyzer for each language, though that depends on the answers to the above. > - But how about replacing each special character (Umlaute, ...�, �, > ...) > with its html special character before indexing and doing the same with > each search query before searching?? An HTML entity is more than one character. The simplest is to leave the characters as-is, in Unicode. > This seems to me the simplest approach to handling this issues - ? > > What are the drawbacks? No Stem search? Other considerations? Stemming is language-specific, which factors into your analyzer(s) choices. Erik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
