php-i18n Digest 11 Jul 2002 07:47:43 -0000 Issue 114

Topics (messages 266 through 287):

Re: mbstring: Japanese: encoding conversion not workingfor me
        266 by: Yasuo Ohgaki
        267 by: Jean-Christian Imbeault
        269 by: David Emery
        270 by: Yasuo Ohgaki

mbstring: php.ini: need help understanding some settings/functions
        268 by: Jean-Christian Imbeault

Re: mbstring: Japanese: encoding conversion not
        271 by: Jean-Christian Imbeault
        272 by: David Emery
        273 by: Jean-Christian Imbeault
        274 by: Yasuo Ohgaki
        275 by: Jean-Christian Imbeault
        276 by: David Emery
        277 by: Yasuo Ohgaki
        278 by: Jean-Christian Imbeault
        279 by: Jean-Christian Imbeault
        280 by: David Emery
        281 by: David Emery
        282 by: Jean-Christian Imbeault
        283 by: Jean-Christian Imbeault
        284 by: Jean-Christian Imbeault
        285 by: Jean-Christian Imbeault
        286 by: David Emery
        287 by: Jean-Christian Imbeault

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
Jean-Christian Imbeault wrote:
> Warning:  PostgreSQL query failed:  ERROR:  Invalid EUC_JP character
> sequence found (0x8140) in /www/htdocs/test.php on line 31

I guess you are using PostgreSQL 7.2.x.
PostgreSQL 7.2.x detects invalid multibyte character sequence, and
you are supposed to fix invalid char sequence before feeding them
to PostgreSQL if there is.

Check your db encoding also with "psql -l", database encoding
should match with your PHP internal encoding.

--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:

> 
> I guess you are using PostgreSQL 7.2.x.


Yes, 7.2.1.


> PostgreSQL 7.2.x detects invalid multibyte character sequence, and
> you are supposed to fix invalid char sequence before feeding them
> to PostgreSQL if there is.


That makes sense.

 
> Check your db encoding also with "psql -l", database encoding
> should match with your PHP internal encoding.


It is correctly set to EUC_JP.

I tought I had read that PHP 4.x with the mbstring library installed
will automatically convert and input ($_POST vars for example) into the
correct internal encoding. Is this true?

If this is true it seems not to be working for me. I have the following
diagnostics:

// The internal encoding is EUC-JP but the user input was not converted
// automatically. I is still in SJIS!

mb_internal_encoding()   : EUC-JP
$_POST["textfield"]      : 111? ?????@????????1235
mb_detect_encoding()     : SJIS

// PHP has mb_http_input set to pass even though php.ini is
// mbstring.http_input = auto and I compiled with
// --enable-mbstring-enc-trans
// also mb_convert_encoding says that my encoding is set to "pass"
// but I compiled with --enable-mbstring-enc-trans!

mb_detect_order()                    : ASCII, JIS, UTF-8, EUC-JP, SJIS
mb_http_input()                      : FALSE
mb_convert_variables say encoding is : pass
mb_http_output()                     : pass


Am I right in thinking that if I set the internal encoding then I
shouldn't need to do any conversion before putting data into my DB
(which only accepts EUC)? From the tests I have run it seems like PHP
accepts the data and *keeps* it in it's original encoding ...

Thank you for all you help!

Jc
--- End Message ---
--- Begin Message ---
At 18:37 +0900 02.7.9, Jean-Christian Imbeault wrote:
>
>Am I right in thinking that if I set the internal encoding then I
>shouldn't need to do any conversion before putting data into my DB
>(which only accepts EUC)? From the tests I have run it seems like PHP
>accepts the data and *keeps* it in it's original encoding ...
>

Something's fishy there. The script you put in your mail is a bit hard for 
me to follow so here's a really simple test that works as it should on my 
system. Try it on yours and see if it's ok. If not then you probably have 
some kind of configuration problem. I also included the relevant output of 
phpinfo() on my system.

HTH,
Dave

Attachment: %phptest.php
Description: application/applefile

Attachment: phptest.php
Description: Binary data

--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault wrote:
> I tought I had read that PHP 4.x with the mbstring library installed
> will automatically convert and input ($_POST vars for example) into the
> correct internal encoding. Is this true?

Half true.
You need to compile mbstring & php with --enable-enc-trans.

I think you cannot just install mbstring module via. php.ini
or dl() to enable automatic encoding translation if php is
not compiled with mbstring and --enable-enc-trans.

--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
I am trying to use the mbstring library so I can receive user input in 
japanese and then put it into my pgsql DB. But the documentation is 
really sparse and sometimes confusing.

I can't understand some of the functions provided by mbstring or it's 
settings in php.ini. Some functions/settings don't seem to do what they 
say or the explanation of how to use them is missing.

For example I tought internal_encoding was used along with 
--enable-mbstring-enc-trans so that any input would be automatically 
converted to the internal encoding.

But it is isn't. When I check what charset POST data is in it is always 
in the charset the user used, never in the internal_encoding charset. I 
have to manually convert myself using mbstring functions.

I guess I don't understand what the settings of mbstring in php.ini and 
at compile time really do.


Can someone explain to me what is the use of:

--enable-mbstring-enc-trans at compile time
mb_http_input()
mb_http_output()
mb_internal_encoding()
mb_language()
mb_output_handler()

Thanks!

Jc

--- End Message ---
--- Begin Message ---
David Emery wrote:

>

> Something's fishy there.


Very.

> so here's a really simple test that works as it should 
> on my system. Try it on yours and see if it's ok.


It works, but there is one problem.

When I enter some japanese into the form your program says:

"The encoding of variable test when it was received by the script was: SJIS
(should be EUC-JP when Japanese has been entered in the form field)"

As you can see the encoding should have been EUC-JP but was in fact 
SJIS. So the user input was not automatically translated into the 
internal encoding ...

Even worse ... the user input become mojibake (garbage). The input 
become ?????> and because of the ">" ruins the html code! Yuck.

So the problem seems to be that the user input is received as SJIS (the 
original user encoding) and not converted automatically to the internal 
encoding (so what is the point of having an internal encoding?)

If I try and output the value of "$test" back to the browser it comes 
out as garbage and if I try mb_convert_encoding($test, "SJIS") it still 
comes out as garbage.

My browser is set to display SJIS (possibly forced because of the 
"nihongo" string you output at the beginning of the script) so I don't 
understand why I can't output $test straight back to the browser since 
it comes in as SJIS, or why I can't display it even after conversion!

Any help and clarifications are greatly appreciated!

I've include my modifications of your code ...

Jc

PS

Here are my settings, they are just like yours except that I don't have 
(can't find) a setting for function overloading.

Multibyte (Japanese) Support enabled

output_buffering: 1
output_handler: mb_output_handler

mbstring.detect_order: auto
mbstring.func_overload: 0
mbstring.http_input: auto
mbstring.http_output: SJIS
mbstring.internal_encoding: EUC-JP
mbstring.substitute_character: no value
--- End Message ---
--- Begin Message ---
At 13:48 +0900 02.7.10, Jean-Christian Imbeault wrote:
>David Emery wrote:
>>so here's a really simple test that works as it should on my system. Try 
>>it on yours and see if it's ok.
>
>
>It works, but there is one problem.
>
>When I enter some japanese into the form your program says:
>
>"The encoding of variable test when it was received by the script was: SJIS
>(should be EUC-JP when Japanese has been entered in the form field)"
>
>As you can see the encoding should have been EUC-JP but was in fact SJIS. 
>So the user input was not automatically translated into the internal 
>encoding ...

So it *doesn't* work on your system. Basically, if the input conversion 
doesn't happen then there is something wrong with your set-up. Maybe it's 
time to start over from scratch. Making sure you compile with both 
--enable-mbstring and --enable-mbstr-enc-trans might be the key. Or not. 
Anyway it's probably some small configuration error that you've overlooked 
along the way.

>Even worse ... the user input become mojibake (garbage). The input become 
>?????> and because of the ">" ruins the html code! Yuck.

I'd recommend forgetting about this particular error until you get past 
step one, which is getting the input conversion working - the source of 
both problems is surely the same.

Gambatte,
-dave
--- End Message ---
--- Begin Message ---
David Emery wrote:

>

> So it *doesn't* work on your system. Basically, if the input conversion 
> doesn't happen then there is something wrong with your set-up. Maybe 
> it's time to start over from scratch. Making sure you compile with both 
> --enable-mbstring and --enable-mbstr-enc-trans might be the key. Or not. 
> Anyway it's probably some small configuration error that you've 
> overlooked along the way.



Before replying to your first email I did just that ,upgrading from 4.1 
to the newest 4.2 release. I did compile with enable-mbstring and 
--enable-mbstring-enc-trans. (my compile-time options given at the end 
of this email).

I totally agree that getting input conversion to work is the key to 
solving all my japanese related problems. The question now though is 
what is different from my system and yours?

Any suggestions as to any other tests I can do to try and pin down what 
is wrong my my setup?

Thanks!

Jc

My php compile-time options

  './configure' '--with-pgsql' '--without-mysql' 
'--with-apache=../apache_1.3.26' '--enable-track-vars' 
'--enable-mbstring' '--enable-mbstring-enc-trans'

--- End Message ---
--- Begin Message ---
mbstring is working well for me as well as many
others.

Automatic encoding detection is not perfect since
one encoding is smilar to another.
If things are setting up right, all you need to
do is adding dummy input. Add something like

<input type="hidden" name="dummy" value="日本語自動認識用ダミー文字列">

in your form.
Then it may work as you want.

--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:

> mbstring is working well for me as well as many
> others.


I am not saying there is something wrong with mbstring.

 
> Automatic encoding detection is not perfect since
> one encoding is smilar to another.
> If things are setting up right, all you need to
> do is adding dummy input. Add something like
> 
> <input type="hidden" name="dummy" value="日本語自動認識用ダミー文字列">


I tried to input but unfortunately my linux machine and my terminal
emulator don't support japanese input.

I beleive that mbstring is working and it is my settings that are wrong
*or* some environment variable (in either PHP, Apache, or my browsers).

I am only asking for help in trying to identify and then fix the problem ^_^

Jc
--- End Message ---
--- Begin Message ---
At 17:23 +0900 02.7.10, Jean-Christian Imbeault wrote:
>I totally agree that getting input conversion to work is the key to 
>solving all my japanese related problems. The question now though is what 
>is different from my system and yours?
>
>Any suggestions as to any other tests I can do to try and pin down what is 
>wrong my my setup?

How about sending the entire output of phpinfo() (or putting it online 
somewhere I can see it)? I'll compare it to what comes up on my system and 
see if anything relevant is different.

-dave
--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault wrote:
> I beleive that mbstring is working and it is my settings that are wrong
> *or* some environment variable (in either PHP, Apache, or my browsers).
> 
> I am only asking for help in trying to identify and then fix the problem ^_^

So far, I cannot tell what's wrong. There are common pitfalls when
you are using mutlti-byte chars with PHP, databases, etc.

Most mbstring users find out pitfalls from Japanese PHP users mail
list archive...

If input string is too short, encoding cannot be detected correctly
since they are similar to each other.

As I worte in previous mail, add dummy hidden input that contains
long enough Japanese text. Then it should detect encoding correctly.

--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
I sent a reply to this thread last night only to find out this morning 
that messages over 40000 bytes can't make it through. So here I go again :

David Emery wrote:

 >
 > How about sending the entire output of phpinfo() (or putting it online
 > somewhere I can see it)? I'll compare it to what comes up on my system
 > and see if anything relevant is different.


Unfortunately my server is behind a firewall (testing phase now). I'm
sending you the output as an html attachment (zipped because the plain 
HTML is larger than 40k).

As an aside, my Netscape 6.2 browser won't let me view the source of a
page if it is .php file. Would you know how I can get Netscape to let do
a "view source"?

If you need anymore data just ask :

Many thanks for helping me out!

Jc

Attachment: phpinfo.zip
Description: Zip compressed data

--- End Message ---
--- Begin Message ---
I sent a reply to this thread last night only to find out this morning 
that messages over 40000 bytes can't make it through. So here I go again :

David Emery wrote:

 >
 > How about sending the entire output of phpinfo() (or putting it online
 > somewhere I can see it)? I'll compare it to what comes up on my system
 > and see if anything relevant is different.


Unfortunately my server is behind a firewall (testing phase now). I'm
sending you the output as an html attachment (zipped because the plain 
HTML is larger than 40k).

As an aside, my Netscape 6.2 browser won't let me view the source of a
page if it is .php file. Would you know how I can get Netscape to let do
a "view source"?

If you need anymore data just ask :

Many thanks for helping me out!

Jc

Attachment: phpinfo.zip
Description: Zip compressed data

--- End Message ---
--- Begin Message ---
At 10:38 +0900 02.7.11, Jean-Christian Imbeault wrote:
>I sent a reply to this thread last night only to find out this morning 
>that messages over 40000 bytes can't make it through. So here I go again :
>
>David Emery wrote:
>
>>
>> How about sending the entire output of phpinfo() (or putting it online
>> somewhere I can see it)? I'll compare it to what comes up on my system
>> and see if anything relevant is different.
>
>
>Unfortunately my server is behind a firewall (testing phase now). I'm
>sending you the output as an html attachment (zipped because the plain 
>HTML is larger than 40k).

I found this...

You have default_charset set to EUC-JP. It should be Shift_JIS. PHP will 
set the outgoing headers to this value (that's what it's for), so I think 
what is happening is that you're outputting SJIS encoded characters with a 
header saying it's EUC. Try changing that.

-dave
--- End Message ---
--- Begin Message ---
At 11:30 +0900 02.7.11, David Emery wrote:
>>
>>Unfortunately my server is behind a firewall (testing phase now). I'm
>>sending you the output as an html attachment (zipped because the plain 
>>HTML is larger than 40k).
>
>I found this...
>
>You have default_charset set to EUC-JP. It should be Shift_JIS. PHP will 
>set the outgoing headers to this value (that's what it's for), so I think 
>what is happening is that you're outputting SJIS encoded characters with a 
>header saying it's EUC. Try changing that.
>

There's more, and this is the biggie...

'--enable-mbstring-enc-trans' should be '--enable-mbstr-enc-trans'

so the input encoding translation isn't actually compiled into your system. 
When it is you'll get an additional line in phpinfo() just below

Multibyte (Japanese) Support: enabled
http input encoding translation: enabled

That second line is missing in your case.

-dave
--- End Message ---
--- Begin Message ---
David Emery wrote:

> 
> There's more, and this is the biggie...
> 
> '--enable-mbstring-enc-trans' should be '--enable-mbstr-enc-trans'
> 
> so the input encoding translation isn't actually compiled into your 
> system.


That would explain a lot. I would have thought that ./configure should 
have complained that I had passed it an invalid flag ...

Let me recompile and see what happens.

Thanks!

Jc

--- End Message ---
--- Begin Message ---
David Emery wrote:

> 
> I found this...
> 
> You have default_charset set to EUC-JP. It should be Shift_JIS. PHP will 
> set the outgoing headers to this value (that's what it's for)


So what is the difference between mbstring.internal_encoding and 
mbstring.http_output?

Jc


--- End Message ---
--- Begin Message ---
David Emery wrote:

> 
> I found this...
> 
> You have default_charset set to EUC-JP. It should be Shift_JIS. PHP will 
> set the outgoing headers to this value (that's what it's for)


So what is the difference between mbstring.internal_encoding and 
mbstring.http_output?

Jc



--- End Message ---
--- Begin Message ---
David Emery wrote:

>

> There's more, and this is the biggie...
> 
> '--enable-mbstring-enc-trans' should be '--enable-mbstr-enc-trans'


That fixed most of my problems! Thanks!

Now I just have a question concerning the use of "internal encoding".

When I receive $test it is in EUC-JP (because I have internal encoding 
set to EUC-JP?). If I echo $test back to the browser it comes out as 
SJIS. This is all good.

But if I do this:

echo(mb_convert_encoding($test, "SJIS","EUC-JP"));

I get mojibake. Why? $test is internally encoded in EUC-JP and I want to 
spew it back out as SJIS and I have mbstring.http_output set to SJIS, so 
why won't it print properly?

Thanks for all the help so far! Things seem to be working fine now. It's 
just my understanding that is a flaky I think. If I can get to 
understand the purpose/use of the settings and functions it will go a 
long way in preventing future errors on my part ^_^

Jc


--- End Message ---
--- Begin Message ---
At 13:22 +0900 02.7.11, Jean-Christian Imbeault wrote:
>David Emery wrote:
>
>>
>
>>There's more, and this is the biggie...
>>
>>'--enable-mbstring-enc-trans' should be '--enable-mbstr-enc-trans'
>
>
>That fixed most of my problems! Thanks!
>
>Now I just have a question concerning the use of "internal encoding".
>
>When I receive $test it is in EUC-JP (because I have internal encoding set 
>to EUC-JP?). If I echo $test back to the browser it comes out as SJIS. 
>This is all good.
>
>But if I do this:
>
>echo(mb_convert_encoding($test, "SJIS","EUC-JP"));
>
>I get mojibake. Why? $test is internally encoded in EUC-JP and I want to 
>spew it back out as SJIS and I have mbstring.http_output set to SJIS, so 
>why won't it print properly?

You've converted the encoding from EUC to SJIS inside the script and then 
the entire output buffer gets converted from EUC (assumed since that's what 
internal encoding is set to) to SJIS, trying to convert that SJIS strirng 
from EUC to SJIS, resulting in a mess.

You've set things up for encoding conversion to happen automatically, so 
you don't need to mess with it by doing things like 
mb_convert_encoding($test, "SJIS","EUC-JP");.


>
>Thanks for all the help so far! Things seem to be working fine now. It's 
>just my understanding that is a flaky I think. If I can get to understand 
>the purpose/use of the settings and functions it will go a long way in 
>preventing future errors on my part ^_^
>
>Jc
>
>
>
>--
>PHP Internationalization Mailing List (http://www.php.net/)
>To unsubscribe, visit: http://www.php.net/unsub.php

--- End Message ---
--- Begin Message ---
Aha! Now I understand what's going on. Thanks for the lucid explanation 
David!

I'll use EUC-JP for internal since that's what my pgsql wants its input 
to be in and I'll spit out SJIS to browsers since most of them support 
that automatically.

Yippee, I'm on my to being a PHP programmer ... eventually ;)

Jc

--- End Message ---

Reply via email to