RE: special character with lucene

Zhaohui Li Mon, 28 Feb 2005 10:54:49 -0800

Basis Technology has a commercial product Rosette Language Identifier to 
identify the input language. If you are interested in, you can send email to 
[EMAIL PROTECTED]

-zhaohui

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 28, 2005 10:37 AM
To: Lucene Developers List
Subject: Re: special character with lucene

Usually the text is in one specific language. English, German, Spanish, 
French, ... 
However, I dont really have a runtime identifier which language it is. I 
could only pick a few words and decide from there (?) - if this is a good 
idea?

Is there a tool part of lucene that helps deciding what language a 
specific text is?

In a simple test I noticed that StandardAnalyzer removes special 
characters like �, �, ... If I leave the characters the way they are, I 
don't find f.e the German word "�pfel" anymore. So it looks like there are 
only two solutions:

a)      - decide which language it is by choosing a few words from the 
text
        - use the language specific analyzer; where do I find Spanish and 
Frensh analyzer?

b)      - replace each special character (�, �, ...) with some code &#239, 
.... There is no stemming then.

Any help is appreciated,
Greetings,
Philipp

Erik Hatcher <[EMAIL PROTECTED]> 
28.02.2005 16:17
Bitte antworten an
"Lucene Developers List" <[email protected]>

An
"Lucene Developers List" <[email protected]>
Kopie

Thema
Re: special character with lucene

On Feb 28, 2005, at 10:01 AM, [EMAIL PROTECTED] wrote:
> Hello,
> I would like to build a search engine using several different 
> languages -
> f.e. Spanish names, French names, ...

Will your text be a mix of languages within a single field?  Or would 
each document (or field) be a single language?

> - Using a different analyzer for each language would be one solution.

You will most likely have to use a different analyzer for each 
language, though that depends on the answers to the above.

> - But how about replacing each special character (Umlaute, ...�, �, 
> ...)
> with its html special character before indexing and doing the same with
> each search query before searching??

An HTML entity is more than one character.  The simplest is to leave 
the characters as-is, in Unicode.

> This seems to me the simplest approach to handling this issues - ?
>
> What are the drawbacks? No Stem search? Other considerations?

Stemming is language-specific, which factors into your analyzer(s) 
choices.

                 Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: special character with lucene

Reply via email to