php-i18n Digest 7 Jan 2003 01:56:33 -0000 Issue 140

Topics (messages 382 through 393):

Re: Mb_output_handler and Shift-JIS
        382 by: moriyoshi.at.wakwak.com
        383 by: David Powers
        384 by: moriyoshi.at.wakwak.com
        385 by: David Powers
        388 by: Jean-Christian Imbeault
        390 by: David Powers

Mbstring.func_overload in 4.3.0
        386 by: David Powers
        387 by: Yasuo Ohgaki
        389 by: David Powers
        393 by: David Powers

Unicodish mail?
        391 by: a.h.s. boy
        392 by: Wez Furlong

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
Hi,

On Sat, Jan 04, 2003 at 08:58:38PM -0000, David Powers wrote:
> I upgraded to PHP 4.3.0 yesterday, and have gone through hell trying to
> get things sorted out. I hope everything is now back to normal, but
> wonder if there's an error in the PHP manual regarding the php.ini
> setting for SJIS users.

Did you mean everything has been right in the previous version?

> I used the configuration show in example 3 on
> http://www.php.net/manual/en/ref.mbstring.php, but that turned all my
> Japanese output, even in static html into garbage. Eventually, I got
> things back to normal by commenting out the line
> 
> output_handler = mb_output_handler
> 
> My local machine is Windows 2000 (English version, but with Japanese as
> default system locale); and my remote server is Linux Red Hat 6.2
> (English version), runnning PHP 4.3.0 on Apache 1.3.22.

I might give you more information if I can see the mbstring section of
your phpinfo().

As far as I looked at your description, I suspect that the pages are written
in Shift_JIS charset instead of EUC-JP. In case mbstring.internal_encoding is
set to EUC-JP and mbstring.http_output is set to Shift_JIS, you should write
your pages in EUC-JP, as mbstring output handler will automatically convert
the pages to Shift_JIS version.

Note that if the page contains a meta tag like below that specifies in which
charset the page is written, you have to make it the same as 
mbstring.http_output, not as mbstring.internal_encoding,
since it won't automatically be adjusted by mb_output_handler.

<meta http-equiv="Content-Type" content="text/html; charset=****">

> Is there anywhere that explains the use of mbstring in clear terms -
> preferably in English, but I don't mind reading it in Japanese if
> there's better material there?

As one of those involved in the manual, I admit the lack of 
explanation and the obscurity of words over mbstring functionalities.
If you find any misleading or unnatural sentences, please let us know 
so that we'd fix them quickly.

And there's a published book which is distributed as a bunch of PDF files.
Check out the following URL: http://www.net-newbie.com/support/pdf2
(Texts are written in Japanese)

Moriyoshi
--- End Message ---
--- Begin Message ---
[EMAIL PROTECTED] wrote:
> Hi,
>
> I might give you more information if I can see the mbstring section of
> your phpinfo().

Hi, Moriyoshi-san

Thank you for your speedy response.

This is a cut-down version of what I now have in my php.ini. As you will
see, I have commented out the output_handler line. When enabled, all I
got was mojibake.

output_buffering = On
;output_handler = mb_output_handler
default_mimetype = "text/html"
default_charset = "Shift_JIS"

[mbstring]
; language for internal character representation.
mbstring.language = Japanese

; internal/script encoding.
mbstring.internal_encoding = EUC-JP

; http input encoding.
mbstring.http_input = auto

; http output encoding. mb_output_handler must be
; registered as output buffer to function
mbstring.http_output = SJIS

mbstring.encoding_translation = Off

; automatic encoding detection order.
; auto means
mbstring.detect_order = auto

; substitute_character used when character cannot be converted
; one from another
mbstring.substitute_character = none;

; overload(replace) single byte functions by mbstring functions.
;mbstring.func_overload = 0


> As far as I looked at your description, I suspect that the pages are
> written in Shift_JIS charset instead of EUC-JP. In case
> mbstring.internal_encoding is set to EUC-JP and mbstring.http_output
> is set to Shift_JIS, you should write your pages in EUC-JP, as
> mbstring output handler will automatically convert the pages to
> Shift_JIS version.

This is the bit I don't understand. To the best of my knowledge, Windows
PCs create Japanese only in Shift_JIS. So how can I write pages in
EUC-JP? Or if I can't, what is the correct set-up?

Since most PCs in Japan also work in Shift_JIS, this must be a common
configuration, which is why I'm so confused.

> Note that if the page contains a meta tag like below that specifies
> in which charset the page is written, you have to make it the same as
> mbstring.http_output, not as mbstring.internal_encoding,
> since it won't automatically be adjusted by mb_output_handler.
>
> <meta http-equiv="Content-Type" content="text/html; charset=****">

My meta tag specifies charset="Shift_JIS", the same as my
mbstring.http_output. With mb_output_handler enabled, it produces
mojibake. With mb_output_handler disabled, everything is OK.

>> Is there anywhere that explains the use of mbstring in clear terms -
>> preferably in English, but I don't mind reading it in Japanese if
>> there's better material there?
>
> As one of those involved in the manual, I admit the lack of
> explanation and the obscurity of words over mbstring functionalities.
> If you find any misleading or unnatural sentences, please let us know
> so that we'd fix them quickly.

I think the real problem is the lack of basic explanation about how to
set up Japanese functionality. I will look at the PDF files you mention,
and hope they will lift some of the mystery. I see they are based on PHP
4.2.2 and PostgreSQL. Are there any major differences between 4.2.2 and
4.3.0 as far as Japanese is concerned? Also, is Postgres a better
database solution than MySQL for supporting Japanese?

David Powers

--- End Message ---
--- Begin Message ---
Hi,

On Sun, Jan 05, 2003 at 12:20:51AM -0000, David Powers wrote:
> This is a cut-down version of what I now have in my php.ini. As you will
> see, I have commented out the output_handler line. When enabled, all I
> got was mojibake.
> 
> output_buffering = On
> ;output_handler = mb_output_handler
> default_mimetype = "text/html"
> default_charset = "Shift_JIS"
> 
> [mbstring]
> ; language for internal character representation.
> mbstring.language = Japanese
> 
> ; internal/script encoding.
> mbstring.internal_encoding = EUC-JP
> 
> ; http input encoding.
> mbstring.http_input = auto
> 
> ; http output encoding. mb_output_handler must be
> ; registered as output buffer to function
> mbstring.http_output = SJIS
> 
> mbstring.encoding_translation = Off
> 
> ; automatic encoding detection order.
> ; auto means
> mbstring.detect_order = auto
> 
> ; substitute_character used when character cannot be converted
> ; one from another
> mbstring.substitute_character = none;
> 
> ; overload(replace) single byte functions by mbstring functions.
> ;mbstring.func_overload = 0

I see no problem in this.

> > As far as I looked at your description, I suspect that the pages are
> > written in Shift_JIS charset instead of EUC-JP. In case
> > mbstring.internal_encoding is set to EUC-JP and mbstring.http_output
> > is set to Shift_JIS, you should write your pages in EUC-JP, as
> > mbstring output handler will automatically convert the pages to
> > Shift_JIS version.
> 
> This is the bit I don't understand. To the best of my knowledge, Windows
> PCs create Japanese only in Shift_JIS. So how can I write pages in
> EUC-JP? Or if I can't, what is the correct set-up?

Nope. Both on Windows and *nix machines you can compose arbitrary Japanese
texts in Shift_JIS, EUC-JP, UTF-8, ISO-2022-JP, or UTF-8 with appropriate
editors, which you can find at numerous sites out there. Mere part of
them are listed in the following page. (Good luck!)

http://dir.yahoo.co.jp/Computers_and_Internet/Software

> Since most PCs in Japan also work in Shift_JIS, this must be a common
> configuration, which is why I'm so confused.

*Most* PC's should work with Shift_JIS as long as the software installed in
them are properly designed to cope with that charset(encoding).

But, I'm afraid that PHP binaries that come up with basic compile-time
configuration (most RPMs, DEBs, and Windows binaries distributed at the
official sites AFAIK) don't support Shift_JIS encoded scripts and pages.

If you are allowed to build and install your own php binary in the remote
computer, then try specifying --enable-zend-multibyte in configure parameters,
and setting both mbstring.script_encoding and mbstring.internal_encoding to
Shift_JIS.

> I think the real problem is the lack of basic explanation about how to
> set up Japanese functionality. I will look at the PDF files you mention,
> and hope they will lift some of the mystery. I see they are based on PHP
> 4.2.2 and PostgreSQL. Are there any major differences between 4.2.2 and
> 4.3.0 as far as Japanese is concerned? Also, is Postgres a better
> database solution than MySQL for supporting Japanese?

No significant changes have been made between these versions. All that
the mbstring developers did is bug fixing.

And I can't tell which database application is better at Japanese text
handling. I mean it's just up to your preference.

Moriyoshi
--- End Message ---
--- Begin Message ---
[EMAIL PROTECTED] wrote:
> Hi,
>
> On Sun, Jan 05, 2003 at 12:20:51AM -0000, David Powers wrote:
>> This is a cut-down version of what I now have in my php.ini. As you
>> will see, I have commented out the output_handler line. When
>> enabled, all I got was mojibake.
>>
>> output_buffering = On
>> ;output_handler = mb_output_handler
>
> I see no problem in this.

That brings me back to my original query. The PHP documentation says
SJIS users should set output_handler to mb_output_handler. Doing so
results in mojibake. Turning it off (by commenting it out with the
semi-colon) is the only way I can get my pages to display correctly. So,
either there is a mistake in the documentation or the explanation of
SJIS users needs to be clarified.

> Both on Windows and *nix machines you can compose arbitrary
> Japanese texts in Shift_JIS, EUC-JP, UTF-8, ISO-2022-JP, or UTF-8
> with appropriate editors, which you can find at numerous sites out
> there.

This would seem to add an unnecessary level of complication. I am using
PHP in combination with MySQL to provide an online database in both
Japanese and English. All input is done through a browser interface over
the internet, and most - if not all - users are on Windows. PHP seems to
do an excellent job of conversion without adding a further layer.

> But, I'm afraid that PHP binaries that come up with basic compile-time
> configuration (most RPMs, DEBs, and Windows binaries distributed at
> the official sites AFAIK) don't support Shift_JIS encoded scripts and
> pages.

I have just downloaded the Windows binaries for PHP 4.3.0 from a UK
mirror site and installed them on my Windows 2000 machine. With the same
php.ini settings as on my Linux 6.2, it has run the same Japanese
Shift_JIS pages without problem so far.

>> based on PHP 4.2.2 and PostgreSQL. Are there any major differences
>> between 4.2.2 and 4.3.0 as far as Japanese is concerned?
>
> No significant changes have been made between these versions. All that
> the mbstring developers did is bug fixing.

Again, this is where I get confused - or maybe I'm misunderstanding a
vital element. The PHP documentation states that as of
4.3.0, --enable-mbstr-enc-trans has been eliminated. Under 4.2.2, I
needed to use mb_convert_encoding($_POST['variable'], "SJIS") to gather
variables submitted by a form. Now I don't need to.

Since Japanese is not my native language, it's not as easy for me to
search for information in news groups and websites as it is in English.
I intend to study the PDF files you recommended, but I see they were
written before --enable-mbstr-enc-trans was eliminated, so any guidance
on how this affects the handling of Japanese would be useful.

>> Also, is
>> Postgres a better database solution than MySQL for supporting
>> Japanese?

> And I can't tell which database application is better at Japanese text
> handling. I mean it's just up to your preference.

Understood, but if I take your meaning correctly, Postgres is worth
investigating. It's not a bad choice.

Sorry to take up so much of your time, but your answers so far have been
helpful and are much appreciated.

David Powers

--- End Message ---
--- Begin Message ---
David Powers wrote:
Understood, but if I take your meaning correctly, Postgres is worth
investigating. It's not a bad choice.
I would definitely investigate PostgresQL. I was also wondering which to use for a PHP project that needed to store japanese and I eventually settled on Postgres. Setting Postgres to use japanese is a breeze, (just one option needs to be given at ./configure time).

Again it is a matter of taste but I knew nothing about databases before starting and learning PostgresQL wasn't that hard at all. And the ML is very helpful whenever you run into problems.

My two cents ...

Jc

--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault wrote:
>
> I would definitely investigate PostgresQL. I was also wondering which
> to use for a PHP project that needed to store japanese and I
> eventually settled on Postgres. Setting Postgres to use japanese is a
> breeze, (just one option needs to be given at ./configure time).

Thank you, that's useful to know. I will give Postgres a look in the
near future. Are there any books or online resources you could recommend
regarding the use of Postgres (not necessarily specifically Japanese
related)? I'm not a database or computer expert, although I've had just
over a year's experience with both PHP and MySQL.

David Powers

--- End Message ---
--- Begin Message ---
Following the problems I reported earlier with the changes to mbstring
in 4.3.0, I have now experienced unexpected behaviour with mb_send_mail.
When serving a string of Japanese and English to mb_send_mail() in a
script that had worked faultlessly under 4.2.2, the Japanese part of the
mail came out as a series of ????????. I experimented by sending the
same string to mail(), and everything went fine.

What is unexpected is that mbstring.func_overload in php.ini is set to
"0". Any explanation as to why this should happen? Does this mean
everyone on the server will have to set mb_language to English to send
mail in ISO-8859-1?

David Powers
--- End Message ---
--- Begin Message ---
David Powers wrote:
> Following the problems I reported earlier with the changes to mbstring
> in 4.3.0, I have now experienced unexpected behaviour with mb_send_mail.
> When serving a string of Japanese and English to mb_send_mail() in a
> script that had worked faultlessly under 4.2.2, the Japanese part of the
> mail came out as a series of ????????. I experimented by sending the
> same string to mail(), and everything went fine.
> 
> What is unexpected is that mbstring.func_overload in php.ini is set to
> "0". Any explanation as to why this should happen? Does this mean
> everyone on the server will have to set mb_language to English to send
> mail in ISO-8859-1?

I don't have problem with mb_send_mail()/mail()/over_load.

It sounds like you haven't set mb_language() to Japanese.
You have to set appropriate language to encode mail messages
correctly. Refer to mb_language() manual page.

--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
>
> I don't have problem with mb_send_mail()/mail()/over_load.
>
> It sounds like you haven't set mb_language() to Japanese.
> You have to set appropriate language to encode mail messages
> correctly. Refer to mb_language() manual page.

mb_language() is set to Japanese. Before reporting the unexpected
behaviour, I ran the following tests. First by explicitly setting
mb_language("ja"); in my script before calling mb_send_mail(). That
produced the series of ??????????? Then I did

$language = mb_language();
echo "The language being used is $language";
mb_send_mail(etc, etc)

That printed out "The language being used is Japanese" and sent a mail
that was just a series of ?????????

In desperation, I tried using mail(). The contents arrived in correctly
formed kanji and kana.

I have checked both php.ini and the output of phpinfo().
Mbstring.func_overload is definitely set to "0". FWIW, the installation
of PHP 4.3.0 was compiled from the bz2 distribution dated 27 December on
the main UK mirror site.

David Powers
--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
>
> Then you may be feeding messages with encoding other
> than internal encoding to mb_send_mail?

If that's the case, the question is how?

Nothing has changed in my code, except for the need to substitute mail()
for mb_send_mail(). Under 4.2.2, mb_send_mail() was needed to send
exactly the same output that now requires mail().

Here are the relevant sections of my php.ini (I am running on an English
version of Red Hat Linux 6.2 - without canna or any other Japanese
language software):

output_buffering = On
;output_handler = mb_output_handler
default_mimetype = "text/html"
default_charset = "Shift_JIS"

[mbstring]
; language for internal character representation.
mbstring.language = Japanese

; internal/script encoding.
mbstring.internal_encoding = EUC-JP

; http input encoding.
mbstring.http_input = auto

; http output encoding. mb_output_handler must be
; registered as output buffer to function
mbstring.http_output = SJIS

mbstring.encoding_translation = Off

; automatic encoding detection order.
; auto means
mbstring.detect_order = auto

; substitute_character used when character cannot be converted
; one from another
mbstring.substitute_character = none;

; overload(replace) single byte functions by mbstring functions.
;mbstring.func_overload = 0

As you will see, both output_handler and mbstring.func_overload are
commented out. The PHP documentation states that SJIS users should set
output_handler to mb_output_handler, but that produces mojibake.
Moreover, Moriyoshi-san commented about my php.ini on 5 January, "I see
no problem in this". This point needs clarification. If there is no
problem, why does the documentation say the exact opposite? The problem
may, of course, be with my understanding of what the documentation
means, but if that's so, I'm sure others are likely to encounter similar
difficulties.

The only thing I can think of is that, during configuration, I
included --enable-mbstr-enc-trans, which I have subsquently seen is no
longer used. The compilation process produced no error messages in this
regard, so I presumed its inclusion would have caused no harm. Is this a
likely cause of the unexpected behaviour?

David Powers
--- End Message ---
--- Begin Message --- I have a site set up to use UTF-8 encoding for display and form submission, but I want to be able to send _readable_ mail using submitted data. With no special headers, certain accented characters display in the email as garbage. I tried adding a

Content-type: text/plain; charset=utf-8

header appended to the mail() function, but it still doesn't display properly in my mail client (which is perfectly capable of display utf-8). Any ideas on what I need to do to have it display properly? Do I need to do some sort of encoding to the text before I send it?

Cheers,
spud.

-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------

--- End Message ---
--- Begin Message ---
Make sure the body really is utf-8 encoded, and also add a

Mime-Version: 1.0

Header to the message.

--Wez.

On Mon, 6 Jan 2003, a.h.s. boy wrote:

> I have a site set up to use UTF-8 encoding for display and form
> submission, but I want to be able to send _readable_ mail using
> submitted data. With no special headers, certain accented characters
> display in the email as garbage. I tried adding a
>
> Content-type: text/plain; charset=utf-8
>
> header appended to the mail() function, but it still doesn't display
> properly in my mail client (which is perfectly capable of display
> utf-8). Any ideas on what I need to do to have it display properly? Do
> I need to do some sort of encoding to the text before I send it?
>
> Cheers,
> spud.
>
> -------------------------------------------------------------------
> a.h.s. boy
> spud(at)nothingness.org            "as yes is to if,love is to yes"
> http://www.nothingness.org/
> -------------------------------------------------------------------
>
>
> --
> PHP Internationalization Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
>

--- End Message ---

Reply via email to