Fw: [fw-general] Zend_Search_Lucene UTF-8 encoding

Sebi Sat, 23 Dec 2006 01:53:55 -0800

Sorry for the mistake from the last replay. I wanted to say "I think I will 
create a list with all non-ascii latin characters,
together with some ascii (alpha-digit) patterns".



>'&' is used as a part of query syntax.
>But Analyzer is used after query recognition to process lexemes or 
>phrases. So htmlentities() may be used.

I will try to replace with some alpha digit pattern;

>From the other side, it doesn't help with a problem, which we have for 
 > full UTF-8 support.
>Index manipulation engine can work with UTF-8 characters, but we can't 
>recognize, if it's alpha, digit or any other type of characters. Thus 
>input text can't be tokenized correctly.
>It doesn't depend on a format (UTF-8, HTML encoded, URL encoded or so on).

>Current solution is based on iconv translation intelligence. It, in 
>principle, should translate white spaces to ascii white space and 
>letters to ascii letters.

In my case iconv failed to convert UTF letters to ascii letters. I think I will 
create a list with all non-ascii latin characters,
together with some ascii patterns. This is the most elegant way to implement 
this. After this I will call iconv() for the 
replaced text. In this way I will help iconv intelligence. :)  

>I don't expect, that we will have UTF-8 compatible ctype_alpha(), 
>ctype_digit() functions.
>Thus the only way I see now is to treat all non-ascii characters as 
>letters and use ctype_...() for ascii characters.

>I saw a lot of UTF-8 support requests, so I think to implement this soon.
>But it's a question for me, if this behavior should be default or not.
 >From one point of view it's a solution. From other, it should be used 
>with care (non-letters may be treated as a part of words).


 > > I want to index text in UTF-8 format. I use latin characters.  
>>  Here are some examples of characters (encoded in ISO-8859-1): ó, é, á, etc.
>>  
> > I used iconv function  iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'Animación') 
> > and i got Animaci'on which also contains some break 
> > characters for the tokenizer. Also, for characters like é, á I got 'a, 'e.
> > 
> > The solution is to replace `'` character with some alpha-digit pattern. But 
> > what If I get other break 
> > characters for other latin characters? Or maybe I will use other UTF-8 
> > characters from german language 
> > which also produce some distinct break characters (not alpha-digit 
> > characters).
> > 
> > I saw that some people used htmlentities which produce only 2 break 
> > characters ('&' and ';'). In this case I 
> > can find 2 alphadigit patterns to match them more easily. And htmlentities 
> > encode all utf-8 characters. Is this
> > the best solution? Maybe there are some Analyzers which I can use and which 
> > not break on '&' and ';' characters. 
>>  
> > Maybe someone has a better solution or some opinions on this problem. 
> > 
> > Thank you.





__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Fw: [fw-general] Zend_Search_Lucene UTF-8 encoding

Reply via email to