php-i18n Digest 17 Nov 2003 01:41:29 -0000 Issue 202

Topics (messages 628 through 632):

Re: Few mbstring/i18n questions
        628 by: Moriyoshi Koizumi
        629 by: Ilia Alshanetsky
        630 by: Moriyoshi Koizumi
        631 by: Ilia Alshanetsky
        632 by: Moriyoshi Koizumi

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message --- Hi,

On 2003/11/15, at 7:22, Ilia Alshanetsky wrote:

I am trying to find answers to the few questions I have regarding the i18n &
mbstring and after exhausting the usual google/manual resources hoping that
perhaps you folks could render some assistance.


1) When using mbstring is there a way to get a list of all supported character
sets? (I am particularly interested in Simplified/Traditional Chinese
support).

As of the current versions, there's no way to get a list of all the supported encodings, though I don't know why you want to know such.

Since 4.3.4 all the usable encodings are enabled by default, so maybe
you don't have to worry about it, if you are concerned about the
certainty whether the encoding you'd like to use is available or not.

2) Outside of mb_ereg() and doing mb_substr() (in a loop) is there a quick way
to break a multibyte string into an array of multibyte characters?

What do you want to do exactly with this idea? I've never been in a situation
like that...


Moriyoshi
--- End Message ---
--- Begin Message ---
> As of the current versions, there's no way to get a list of all the
> supported encodings, though I don't know why you want to know such.

I need to see if the encodings I am interested in (BIG-5, gb2312) are 
supported. According to the documentation those encodings are avaliable since 
PHP 4.3.0, but not always enabled. So, for PHP version 4.3.0-4.3.3 I need a 
way to determine their availability.

> What do you want to do exactly with this idea? I've never been in a 
> situation like that

This is an interesting situation, I am trying to make a search system capable 
of supporting multibyte languages. Currently (non-mutlibyte) systems works by 
breaking the text into individual words to be indexed. This unfortunately 
won't work for multibyte languages were there is rarely a space between 
'words'. The solution I am tinkering with, involves indexing the text by 
'characters', but to do that I need to good (fast & reliable) method of 
breaking a text into individual multibyte characters.
So far my solution has been to do this:

preg_match_all('!(\W)!u', iconv("BIG-5", "UTF-8", $str), $words);

Ilia

P.S. Please CC me on your replies, I am not subscribed to the list.

--- End Message ---
--- Begin Message --- On 2003/11/16, at 6:31, Ilia Alshanetsky wrote:

As of the current versions, there's no way to get a list of all the
supported encodings, though I don't know why you want to know such.

I need to see if the encodings I am interested in (BIG-5, gb2312) are
supported. According to the documentation those encodings are avaliable since
PHP 4.3.0, but not always enabled. So, for PHP version 4.3.0-4.3.3 I need a
way to determine their availability.

Well, so basically there's no apparent solution for now... But you can
check whether a certain encoding is supported or not, by mb_internal_encoding()
or similar functions that take an encoding name for its argument. With those
functions you just have to see if the return value is false or not.


What do you want to do exactly with this idea? I've never been in a
situation like that

This is an interesting situation, I am trying to make a search system capable
of supporting multibyte languages. Currently (non-mutlibyte) systems works by
breaking the text into individual words to be indexed. This unfortunately
won't work for multibyte languages were there is rarely a space between
'words'. The solution I am tinkering with, involves indexing the text by
'characters', but to do that I need to good (fast & reliable) method of
breaking a text into individual multibyte characters.
So far my solution has been to do this:


preg_match_all('!(\W)!u', iconv("BIG-5", "UTF-8", $str), $words);

Perhaps you can handle it with mb_split() when it comes to Japanese encodings,
though mbregex functions cannot deal with Chinese encodings for now. So
I think the solution you proposed is the best possible workaround.


BTW, I suppose separating a set of chinese strings into individual
characters won't suffice, because lots of chinese words often occur as a
compound of two or more letters. (The same thing applies to other
multibyte languages.) You better refer to the codes out there that may
be called as morphological analyser, if you really want to get to the
right way. Things are not that simple at all.

P.S. Please CC me on your replies, I am not subscribed to the list.

Hmm, I think I did.. Maybe I'm not used to Apple mail client yet :)


Moriyoshi
--- End Message ---
--- Begin Message ---
> Well, so basically there's no apparent solution for now... But you 
canmorphological analyser
> check whether a certain encoding is supported or not, by 
> mb_internal_encoding() or similar functions that take an encoding name for
> its argument. With those functions you just have to see if the return value 
> is false or not.

Sounds like a workable solution, thanks.

> BTW, I suppose separating a set of chinese strings into individual
> characters won't suffice, because lots of chinese words often occur as a
> compound of two or more letters. (The same thing applies to other
> multibyte languages.) You better refer to the codes out there that may
> be called as morphological analyser, if you really want to get to the
> right way. Things are not that simple at all.

It is not a perfect solution by any means, however it does seem to accomplish 
the goal better then the current solution which does not work at all. BTW if 
you do know of any documentation about morphological analyzer especially with 
focus on multibyte languages I would be grateful if you could share that 
information.

Thanks,

Ilia

P.S. Apparently Apple's mail client does not like me :( (no CC).

--- End Message ---
--- Begin Message ---
I hit on another idea later. If you don't have to use regular
expressions, you can also convert it to UCS-4, break the
resulting string into a collection of individual 4 octets,
and finally turn them back into the original encoding.
This must be the fastest.

It is not a perfect solution by any means, however it does seem to accomplish
the goal better then the current solution which does not work at all. BTW if
you do know of any documentation about morphological analyzer especially with
focus on multibyte languages I would be grateful if you could share that
information.

Please take a look over this list's archive, as I posted a pointer to the resource already.

Moriyoshi
--- End Message ---

Reply via email to