'&' is used as a part of query syntax.
But Analyzer is used after query recognition to process lexemes or
phrases. So htmlentities() may be used.
From the other side, it doesn't help with a problem, which we have for
full UTF-8 support.
Index manipulation engine can work with UTF-8 characters, but we can't
recognize, if it's alpha, digit or any other type of characters. Thus
input text can't be tokenized correctly.
It doesn't depend on a format (UTF-8, HTML encoded, URL encoded or so on).
Current solution is based on iconv translation intelligence. It, in
principle, should translate white spaces to ascii white space and
letters to ascii letters.
I don't expect, that we will have UTF-8 compatible ctype_alpha(),
ctype_digit() functions.
Thus the only way I see now is to treat all non-ascii characters as
letters and use ctype_...() for ascii characters.
I saw a lot of UTF-8 support requests, so I think to implement this soon.
But it's a question for me, if this behavior should be default or not.
From one point of view it's a solution. From other, it should be used
with care (non-letters may be treated as a part of words).
With best regards,
Alexander Veremyev.
Sebi wrote:
I want to index text in UTF-8 format. I use latin characters.
Here are some examples of characters (encoded in ISO-8859-1): ó, é, á, etc.
I used iconv function iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'Animación') and i got Animaci'on which also contains some break
characters for the tokenizer. Also, for characters like é, á I got 'a, 'e.
The solution is to replace `'` character with some alpha-digit pattern. But what If I get other break
characters for other latin characters? Or maybe I will use other UTF-8 characters from german language
which also produce some distinct break characters (not alpha-digit characters).
I saw that some people used htmlentities which produce only 2 break characters ('&' and ';'). In this case I
can find 2 alphadigit patterns to match them more easily. And htmlentities encode all utf-8 characters. Is this
the best solution? Maybe there are some Analyzers which I can use and which not break on '&' and ';' characters.
Maybe someone has a better solution or some opinions on this problem.
Thank you.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com