Sorry for the mistake from the last replay. I wanted to say "I think I will
create a list with all non-ascii latin characters,
together with some ascii (alpha-digit) patterns".
>'&' is used as a part of query syntax.
>But Analyzer is used after query recognition to process lexemes or
>phrases. So htmlentities() may be used.
I will try to replace with some alpha digit pattern;
>From the other side, it doesn't help with a problem, which we have for
> full UTF-8 support.
>Index manipulation engine can work with UTF-8 characters, but we can't
>recognize, if it's alpha, digit or any other type of characters. Thus
>input text can't be tokenized correctly.
>It doesn't depend on a format (UTF-8, HTML encoded, URL encoded or so on).
>Current solution is based on iconv translation intelligence. It, in
>principle, should translate white spaces to ascii white space and
>letters to ascii letters.
In my case iconv failed to convert UTF letters to ascii letters. I think I will
create a list with all non-ascii latin characters,
together with some ascii patterns. This is the most elegant way to implement
this. After this I will call iconv() for the
replaced text. In this way I will help iconv intelligence. :)
>I don't expect, that we will have UTF-8 compatible ctype_alpha(),
>ctype_digit() functions.
>Thus the only way I see now is to treat all non-ascii characters as
>letters and use ctype_...() for ascii characters.
>I saw a lot of UTF-8 support requests, so I think to implement this soon.
>But it's a question for me, if this behavior should be default or not.
>From one point of view it's a solution. From other, it should be used
>with care (non-letters may be treated as a part of words).
> > I want to index text in UTF-8 format. I use latin characters.
>> Here are some examples of characters (encoded in ISO-8859-1): ó, é, á, etc.
>>
> > I used iconv function iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'Animación')
> > and i got Animaci'on which also contains some break
> > characters for the tokenizer. Also, for characters like é, á I got 'a, 'e.
> >
> > The solution is to replace `'` character with some alpha-digit pattern. But
> > what If I get other break
> > characters for other latin characters? Or maybe I will use other UTF-8
> > characters from german language
> > which also produce some distinct break characters (not alpha-digit
> > characters).
> >
> > I saw that some people used htmlentities which produce only 2 break
> > characters ('&' and ';'). In this case I
> > can find 2 alphadigit patterns to match them more easily. And htmlentities
> > encode all utf-8 characters. Is this
> > the best solution? Maybe there are some Analyzers which I can use and which
> > not break on '&' and ';' characters.
>>
> > Maybe someone has a better solution or some opinions on this problem.
> >
> > Thank you.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com