php-i18n Digest 17 Nov 2003 01:41:29 -0000 Issue 202
Topics (messages 628 through 632):
Re: Few mbstring/i18n questions
628 by: Moriyoshi Koizumi
629 by: Ilia Alshanetsky
630 by: Moriyoshi Koizumi
631 by: Ilia Alshanetsky
632 by: Moriyoshi Koizumi
Administrivia:
To subscribe to the digest, e-mail:
[EMAIL PROTECTED]
To unsubscribe from the digest, e-mail:
[EMAIL PROTECTED]
To post to the list, e-mail:
[EMAIL PROTECTED]
----------------------------------------------------------------------
--- Begin Message ---
Hi,
On 2003/11/15, at 7:22, Ilia Alshanetsky wrote:
I am trying to find answers to the few questions I have regarding the
i18n &
mbstring and after exhausting the usual google/manual resources hoping
that
perhaps you folks could render some assistance.
1) When using mbstring is there a way to get a list of all supported
character
sets? (I am particularly interested in Simplified/Traditional Chinese
support).
As of the current versions, there's no way to get a list of all the
supported encodings, though I don't know why you want to know such.
Since 4.3.4 all the usable encodings are enabled by default, so maybe
you don't have to worry about it, if you are concerned about the
certainty whether the encoding you'd like to use is available or not.
2) Outside of mb_ereg() and doing mb_substr() (in a loop) is there a
quick way
to break a multibyte string into an array of multibyte characters?
What do you want to do exactly with this idea? I've never been in a
situation
like that...
Moriyoshi
--- End Message ---
--- Begin Message ---
> As of the current versions, there's no way to get a list of all the
> supported encodings, though I don't know why you want to know such.
I need to see if the encodings I am interested in (BIG-5, gb2312) are
supported. According to the documentation those encodings are avaliable since
PHP 4.3.0, but not always enabled. So, for PHP version 4.3.0-4.3.3 I need a
way to determine their availability.
> What do you want to do exactly with this idea? I've never been in a
> situation like that
This is an interesting situation, I am trying to make a search system capable
of supporting multibyte languages. Currently (non-mutlibyte) systems works by
breaking the text into individual words to be indexed. This unfortunately
won't work for multibyte languages were there is rarely a space between
'words'. The solution I am tinkering with, involves indexing the text by
'characters', but to do that I need to good (fast & reliable) method of
breaking a text into individual multibyte characters.
So far my solution has been to do this:
preg_match_all('!(\W)!u', iconv("BIG-5", "UTF-8", $str), $words);
Ilia
P.S. Please CC me on your replies, I am not subscribed to the list.
--- End Message ---
--- Begin Message ---
On 2003/11/16, at 6:31, Ilia Alshanetsky wrote:
As of the current versions, there's no way to get a list of all the
supported encodings, though I don't know why you want to know such.
I need to see if the encodings I am interested in (BIG-5, gb2312) are
supported. According to the documentation those encodings are
avaliable since
PHP 4.3.0, but not always enabled. So, for PHP version 4.3.0-4.3.3 I
need a
way to determine their availability.
Well, so basically there's no apparent solution for now... But you can
check whether a certain encoding is supported or not, by
mb_internal_encoding()
or similar functions that take an encoding name for its argument. With
those
functions you just have to see if the return value is false or not.
What do you want to do exactly with this idea? I've never been in a
situation like that
This is an interesting situation, I am trying to make a search system
capable
of supporting multibyte languages. Currently (non-mutlibyte) systems
works by
breaking the text into individual words to be indexed. This
unfortunately
won't work for multibyte languages were there is rarely a space between
'words'. The solution I am tinkering with, involves indexing the text
by
'characters', but to do that I need to good (fast & reliable) method of
breaking a text into individual multibyte characters.
So far my solution has been to do this:
preg_match_all('!(\W)!u', iconv("BIG-5", "UTF-8", $str), $words);
Perhaps you can handle it with mb_split() when it comes to Japanese
encodings,
though mbregex functions cannot deal with Chinese encodings for now. So
I think the solution you proposed is the best possible workaround.
BTW, I suppose separating a set of chinese strings into individual
characters won't suffice, because lots of chinese words often occur as a
compound of two or more letters. (The same thing applies to other
multibyte languages.) You better refer to the codes out there that may
be called as morphological analyser, if you really want to get to the
right way. Things are not that simple at all.
P.S. Please CC me on your replies, I am not subscribed to the list.
Hmm, I think I did.. Maybe I'm not used to Apple mail client yet :)
Moriyoshi
--- End Message ---
--- Begin Message ---
> Well, so basically there's no apparent solution for now... But you
canmorphological analyser
> check whether a certain encoding is supported or not, by
> mb_internal_encoding() or similar functions that take an encoding name for
> its argument. With those functions you just have to see if the return value
> is false or not.
Sounds like a workable solution, thanks.
> BTW, I suppose separating a set of chinese strings into individual
> characters won't suffice, because lots of chinese words often occur as a
> compound of two or more letters. (The same thing applies to other
> multibyte languages.) You better refer to the codes out there that may
> be called as morphological analyser, if you really want to get to the
> right way. Things are not that simple at all.
It is not a perfect solution by any means, however it does seem to accomplish
the goal better then the current solution which does not work at all. BTW if
you do know of any documentation about morphological analyzer especially with
focus on multibyte languages I would be grateful if you could share that
information.
Thanks,
Ilia
P.S. Apparently Apple's mail client does not like me :( (no CC).
--- End Message ---
--- Begin Message ---
I hit on another idea later. If you don't have to use regular
expressions, you can also convert it to UCS-4, break the
resulting string into a collection of individual 4 octets,
and finally turn them back into the original encoding.
This must be the fastest.
It is not a perfect solution by any means, however it does seem to
accomplish
the goal better then the current solution which does not work at all.
BTW if
you do know of any documentation about morphological analyzer
especially with
focus on multibyte languages I would be grateful if you could share
that
information.
Please take a look over this list's archive, as I posted a pointer to
the resource already.
Moriyoshi
--- End Message ---