php-i18n Digest 9 Jan 2003 17:20:37 -0000 Issue 141

Topics (messages 394 through 399):

Re: Mbstring.func_overload in 4.3.0
        394 by: Yasuo Ohgaki
        396 by: David Powers

Re: Mb_output_handler and Shift-JIS
        395 by: Moriyoshi Koizumi
        397 by: David Powers
        398 by: Jean-Christian Imbeault

ctype extension mb safe?
        399 by: Jan Schneider

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
David Powers wrote:
> Yasuo Ohgaki wrote:
> 
>>Then you may be feeding messages with encoding other
>>than internal encoding to mb_send_mail?
> 
> 
> If that's the case, the question is how?

For instance, you may be reading mail messages from
database or text file. Encoding can be any encoding with
this.

> Nothing has changed in my code, except for the need to substitute mail()
> for mb_send_mail(). Under 4.2.2, mb_send_mail() was needed to send
> exactly the same output that now requires mail().

Which encoding results in mojibake?
If you have problem ISO-8859-1, set language to 'en' before
using mb_send_mail.

> Here are the relevant sections of my php.ini (I am running on an English
> version of Red Hat Linux 6.2 - without canna or any other Japanese
> language software):
> 
> output_buffering = On
> ;output_handler = mb_output_handler
> default_mimetype = "text/html"
> default_charset = "Shift_JIS"
> 
> [mbstring]
> ; language for internal character representation.
> mbstring.language = Japanese
> 
> ; internal/script encoding.
> mbstring.internal_encoding = EUC-JP
> 
> ; http input encoding.
> mbstring.http_input = auto
> 
> ; http output encoding. mb_output_handler must be
> ; registered as output buffer to function
> mbstring.http_output = SJIS
> 
> mbstring.encoding_translation = Off
> 
> ; automatic encoding detection order.
> ; auto means
> mbstring.detect_order = auto
> 
> ; substitute_character used when character cannot be converted
> ; one from another
> mbstring.substitute_character = none;
> 
> ; overload(replace) single byte functions by mbstring functions.
> ;mbstring.func_overload = 0
> 
> As you will see, both output_handler and mbstring.func_overload are
> commented out. The PHP documentation states that SJIS users should set
> output_handler to mb_output_handler, but that produces mojibake.
> Moreover, Moriyoshi-san commented about my php.ini on 5 January, "I see
> no problem in this". This point needs clarification. If there is no
> problem, why does the documentation say the exact opposite? The problem
> may, of course, be with my understanding of what the documentation
> means, but if that's so, I'm sure others are likely to encounter similar
> difficulties.

Moriyoshi should have overlooked commented out mb_output_handler.
You need mb_output_handler to convert encoding from EUC-JP to SJIS.
(If you really want to output SJIS)

There are many cases that can result in mojibake.
I suppose you have 2 problems
 1. Something wrong in HTTP output.
 2. Something wrong in mail message sent from mb_send_mail.

For 1, check out actual encoding sent to browser.
e.g. Use emacs or whatever tools that can detect encoding.
Check output buffer status using ob_get_status().

For 2, set appropriate language before you send message with
mb_send_mail. mb_send_mail is smart enough to encode mail
message appropriately according to language setting.
i.e. change language before you send mails.

(We may be forgetting to document changes, but I don't
have time to check it :)

> 
> The only thing I can think of is that, during configuration, I
> included --enable-mbstr-enc-trans, which I have subsquently seen is no
> longer used. The compilation process produced no error messages in this
> regard, so I presumed its inclusion would have caused no harm. Is this a
> likely cause of the unexpected behaviour?

No. It will not.
--enable-mbstr-enc-trans option will not affect PHP4.3.0
or later builds.

--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
>
> For instance, you may be reading mail messages from
> database or text file. Encoding can be any encoding with
> this.

The mail message is created by using PHP to retrieve material stored as
Shift_JIS in a MySQL database. PHP then turns the array of retrieved
material into a long string which is then fed to the PHP mailing
routine. Under 4.2.2, mb_send_mail() handled the process perfectly. Now
it doesn't. I have to use mail() instead.

> Which encoding results in mojibake?
> If you have problem ISO-8859-1, set language to 'en' before
> using mb_send_mail.

As I explained before, even when mb_language() reports "Japanese",
mb_send_mail() will not send Japanese, but mail() does.

> Moriyoshi should have overlooked commented out mb_output_handler.
> You need mb_output_handler to convert encoding from EUC-JP to SJIS.
> (If you really want to output SJIS)

Even on 4.2.2, mb_output_handler had to be turned off. I am running an
English language version of RH 6.2 without canna or any other support
for multi-byte characters. Perhaps that is what is causing the
difference.

It's not important to me whether the output is SJIS or EUC-JP. I use PHP
to create databases in English and Japanese to supply content to dynamic
websites, and to send content from the same databases by email. I know
that Windows machines can display EUC-JP websites correctly (if the
correct charset is set in the web page), and that email is sent as JIS.

The reason I use SJIS is because the content is entered into the
databases by means of online forms. The original forms were developed by
someone else in ASP, and nearly two years' worth of material is now in
the database in SJIS. I suppose it would be possible to convert the
content, and switch the EUC-JP, but how would that affect inputting? All
the material is created in Word on Windows machines, and therefore in
SJIS. Would PHP set to EUC-JP convert such input, and would it be more
efficient that way?

> (We may be forgetting to document changes, but I don't
> have time to check it :)

I am extremely grateful that the Open Source community has created
something as powerful and useful as PHP, but if changes are not properly
documented, it will lose its attraction and value. That would be a great
pity.

David Powers
--- End Message ---
--- Begin Message ---
On Sun, Jan 05, 2003 at 03:37:47PM -0000, David Powers wrote:
> > On Sun, Jan 05, 2003 at 12:20:51AM -0000, David Powers wrote:
> >> This is a cut-down version of what I now have in my php.ini. As you
> >> will see, I have commented out the output_handler line. When
> >> enabled, all I got was mojibake.
> >>
> >> output_buffering = On
> >> ;output_handler = mb_output_handler
> >
> > I see no problem in this.
> 
> That brings me back to my original query. The PHP documentation says
> SJIS users should set output_handler to mb_output_handler. Doing so
> results in mojibake. Turning it off (by commenting it out with the
> semi-colon) is the only way I can get my pages to display correctly. So,
> either there is a mistake in the documentation or the explanation of
> SJIS users needs to be clarified.
(snip)
> This would seem to add an unnecessary level of complication. I am using
> PHP in combination with MySQL to provide an online database in both
> Japanese and English. All input is done through a browser interface over
> the internet, and most - if not all - users are on Windows. PHP seems to
> do an excellent job of conversion without adding a further layer.

As I said in the previous mail, mojibake is because you are composing your
pages in Shift_JIS whereas you are supposed to use EUC-JP actually.

In most cases PHP is likely to process Shift_JIS encoded pages without
problems, but sometimes it ends up giving a buggy result you could hardly
know what is going wrong there. This is because several (not many) Shift_JIS
kanji characters consist of any character which can be a lead-byte of the
double-byte character set and '\' (backslash / yen sign), though '\' is
also used to form escape sequences in string literals enclosed by
single-quotes or double-quotes. Besides the same problem is known to be
caused by other east-asian(CJKV) charatcter sets like CP936
(a superset of GB2312 which is adopted by Microsoft; also known as GBK),
GB18030 (a huge character set defined as a Chinese national standard),
or BIG5 (used to represent traditional Chinese text). 

If you haven't experienced such a "phenomenon" ever, you are definitely
lucky so far :-)

Unfortunately I don't seem to be allowed to use Japanese characters in this
list, I couldn't give you any example in this mail.
I'll come up with those again if you can read Japanese mails with your mail
client. 

> >> based on PHP 4.2.2 and PostgreSQL. Are there any major differences
> >> between 4.2.2 and 4.3.0 as far as Japanese is concerned?
> >
> > No significant changes have been made between these versions. All that
> > the mbstring developers did is bug fixing.
> 
> Again, this is where I get confused - or maybe I'm misunderstanding a
> vital element. The PHP documentation states that as of
> 4.3.0, --enable-mbstr-enc-trans has been eliminated. Under 4.2.2, I
> needed to use mb_convert_encoding($_POST['variable'], "SJIS") to gather
> variables submitted by a form. Now I don't need to.

Sorry for the confusion. I said "no significant" in a technical point of
view. As for --enable-mbstr-enc-trans, this compile-time option is removed
for convenience and now replaced by "mbstring.encoding_translation"
runtime option. You can use mb_parse_str() as well in case it's turned off.

> Since Japanese is not my native language, it's not as easy for me to
> search for information in news groups and websites as it is in English.
> I intend to study the PDF files you recommended, but I see they were
> written before --enable-mbstr-enc-trans was eliminated, so any guidance
> on how this affects the handling of Japanese would be useful.

Hmm... English information about Japanese text handling with PHP is very
limited since a small number of developers who fluently speak English
use Japanese or other east-asian languages in his/her project, and since
I don't have much time to add more explanation to the manual. I think
all I can do for now is fill up this list's archive with (hopefully) detailed
mails.

Moriyoshi
--- End Message ---
--- Begin Message ---
Moriyoshi Koizumi wrote:
>
> As I said in the previous mail, mojibake is because you are composing
> your pages in Shift_JIS whereas you are supposed to use EUC-JP
> actually.

Perhaps I am misunderstanding how PHP works in Japanese. Let me explain
briefly what it is I am doing. I'm sure it's the way a lot of PHP users
in Japan are operating, but since I live in London, it's not possible
for me to visit a user group or do some tachiyomi in a computer
bookstore.

I have several websites that use PHP and MySQL to generate content in
both English and Japanese. The remote server is Red Hat Linux 6.2 with
PHP 4.3.0 and mbstring enabled. Most of the content is generated in
Microsoft Word, and entered into the database using forms generated and
processed by PHP. I also need visitors to be able to fill in online
forms that can be mailed back to me. Since most Japanese people are
likely to be using Windows, their input is likely to be in SJIS.

What is the best configuration? If I set everything to EUC-JP, how would
the input from the online forms be handled? If Japanese ISPs offer PHP
services, they must have thousands of customers using Windows machines
to create their input, or is everything like that run on ASP?

> Unfortunately I don't seem to be allowed to use Japanese characters
> in this list, I couldn't give you any example in this mail.
> I'll come up with those again if you can read Japanese mails with
> your mail client.

Feel free to mail me directly in Japanese.

> Hmm... English information about Japanese text handling with PHP is
> very limited since a small number of developers who fluently speak
> English use Japanese or other east-asian languages in his/her
> project, and since I don't have much time to add more explanation to
> the manual. I think all I can do for now is fill up this list's
> archive with (hopefully) detailed mails.

I am not a computer expert, although most people would regard me as an
advanced user of HTML and CSS. If you can point me in the direction of g
ood quality explanations in Japanese, I will be quite happy to study
them. I've been reading Japanese for more than 20 years, so that is not
a problem. It's searching through hundreds of irrelevant messages in a
newsgroup that I don't really have time for.

Once I get things sorted out, I would be very happy to assist by
creating a brief guide in English to setting up PHP to handle Japanese.
A lot of the current PHP documentation is difficult for the non-expert
to understand. I review web design books on a regular basis, and a
leading computer publisher has told me it is very interested in
including more on i18n and l10n in its publications, so this could serve
a dual purpose.

David Powers

--- End Message ---
--- Begin Message ---
David Powers wrote:
 I also need visitors to be able to fill in online
forms that can be mailed back to me. Since most Japanese people are
likely to be using Windows, their input is likely to be in SJIS.
> What is the best configuration? If I set everything to EUC-JP, how would
> the input from the online forms be handled? If Japanese ISPs offer PHP
> services, they must have thousands of customers using Windows machines
> to create their input, or is everything like that run on ASP?

Just a quick reply. I don't remember the settings but PHP can automatically convert incoming SJIS form data to EUC for you in a transparent way. (i.e. it translates in coming data into whatever the 'internal encoding' is set to)

Also if I remember correctly if you specify an character encoding of EUC-JP on the web page itself then all form data is sent to you as EUC-JP. The user enters the form data and the browser automatically converts it to to the specified encoding.

Once I get things sorted out, I would be very happy to assist by
creating a brief guide in English to setting up PHP to handle Japanese.
That would be great as I agree with you completely that there is a lack of english information on this subject. I'm even worse of than you in that I don't read japanese fluently. Setting up my server and scripts to handle japanese was a task in itself ...

Which reminds me ... I posted the very same question you are now a while back (maybe 6 months ago) and got a detailed explanation on some testing to do and the proper settings to use. Hopefully you can still find those posts in this list's archive ...

Jc

--- End Message ---
--- Begin Message ---
The subject says all. Is it?

Cheers, Jan.

--- End Message ---

Reply via email to