Re: [fw-general] Zend_Search_Lucene UTF-8 encoding

Alexander Veremyev Fri, 22 Dec 2006 10:24:08 -0800

'&' is used as a part of query syntax.

But Analyzer is used after query recognition to process lexemes orphrases. So htmlentities() may be used.

From the other side, it doesn't help with a problem, which we have forfull UTF-8 support.Index manipulation engine can work with UTF-8 characters, but we can'trecognize, if it's alpha, digit or any other type of characters. Thusinput text can't be tokenized correctly.

It doesn't depend on a format (UTF-8, HTML encoded, URL encoded or so on).

Current solution is based on iconv translation intelligence. It, inprinciple, should translate white spaces to ascii white space andletters to ascii letters.

I don't expect, that we will have UTF-8 compatible ctype_alpha(),ctype_digit() functions.Thus the only way I see now is to treat all non-ascii characters asletters and use ctype_...() for ascii characters.


I saw a lot of UTF-8 support requests, so I think to implement this soon.
But it's a question for me, if this behavior should be default or not.

From one point of view it's a solution. From other, it should be usedwith care (non-letters may be treated as a part of words).



With best regards,
   Alexander Veremyev.

Sebi wrote:

I want to index text in UTF-8 format. I use latin characters.Here are some examples of characters (encoded in ISO-8859-1): ó, é, á, etc.
I used iconv function iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'Animación') and i got Animaci'on which also contains some breakcharacters for the tokenizer. Also, for characters like é, á I got 'a, 'e.
The solution is to replace `'` character with some alpha-digit pattern. But what If I get other breakcharacters for other latin characters? Or maybe I will use other UTF-8 characters from german languagewhich also produce some distinct break characters (not alpha-digit characters).
I saw that some people used htmlentities which produce only 2 break characters ('&' and ';'). In this case Ican find 2 alphadigit patterns to match them more easily. And htmlentities encode all utf-8 characters. Is thisthe best solution? Maybe there are some Analyzers which I can use and which not break on '&' and ';' characters.Maybe someone has a better solution or some opinions on this problem.
Thank you.




__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection aroundhttp://mail.yahoo.com

Re: [fw-general] Zend_Search_Lucene UTF-8 encoding

Reply via email to