php-i18n Digest 21 Feb 2003 22:30:51 -0000 Issue 153

php-i18n-digest-help Fri, 21 Feb 2003 14:29:51 -0800

php-i18n Digest 21 Feb 2003 22:30:51 -0000 Issue 153

Topics (messages 460 through 464):


Re: Internationalized feeding of MySQL
        460 by: a.h.s. boy
        461 by: Gary Ross
        462 by: Gary Ross
        463 by: Dennis Heuer

UTF-8 and unaltered transmissions between apache and PHP module
        464 by: Dennis Heuer

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------

--- Begin Message --- Dennis --

A confusing issue it is, and I've been coping with the same situation. My HTML pages and forms are set to use UTF-8, and users are inputting English, Greek, Japanese, Turkish, etc.

My MySQL default character set is still ISO-8859-1, but surprisingly, it all just works. Mostly. I was surprised to see that the garbled mess that Japanese appears to be in MySQL returned to the web display page just as it had been entered.

Fulltext index searching, however, seems to fail miserably with the Japanese text...it just plain doesn't work. No results returned ever. Anyone know anything about that?
Cheers,
spud.
On Tuesday, February 18, 2003, at 05:41 PM, Dennis Heuer wrote:
Hello -

Sorry but I am confused by the manual. If I want to let a user enter internationalized input into a HTML-form and then store it in MySQL without loss and retrieve it back for displaying, how do I do this best?

Thanks

Dennis Heuer
--
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org            "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------
--- End Message ---

--- Begin Message --- I use Japanese, Chinese, Korean and others on my site. Essentially, utf-8 seems to be the way to go. As for fulltext, it's on the todo list for mysql so it seems that for the present it won't work at all. Gary

On Thursday, February 20, 2003, at 02:19 am, a.h.s. boy wrote:
Dennis --

A confusing issue it is, and I've been coping with the same situation. My HTML pages and forms are set to use UTF-8, and users are inputting English, Greek, Japanese, Turkish, etc.

My MySQL default character set is still ISO-8859-1, but surprisingly, it all just works. Mostly. I was surprised to see that the garbled mess that Japanese appears to be in MySQL returned to the web display page just as it had been entered.

Fulltext index searching, however, seems to fail miserably with the Japanese text...it just plain doesn't work. No results returned ever. Anyone know anything about that?
Cheers,
spud.
On Tuesday, February 18, 2003, at 05:41 PM, Dennis Heuer wrote:
Hello -

Sorry but I am confused by the manual. If I want to let a user enter internationalized input into a HTML-form and then store it in MySQL without loss and retrieve it back for displaying, how do I do this best?

Thanks

Dennis Heuer
--
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org            "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------
--
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---

--- Begin Message ---

Dear Spud and Gary -

Thanks for your replies, they got my mind thinking into the right direction...

But, first, one comment: 

You should not send answers with CC because, Spud, your email I got two times and, 
Gary, yours I got four times ;)

So, now, just to sum up:

If I set all my HTML pages and forms to working with UTF-8, all transfers in both 
directions should be working fine. None of the involved software - means: (modern) 
browser, (apache) server, php and MySQL - seems to make compromises except when doing 
fulltext searches in MySQL.

About the fulltext search:

I don't know about this particular problem but if you store the japanese keyword in a 
hash table and then compare between hash table and your other tables then the 
comparison could work since both the keyword and the db content are now formatted the 
same way by MySQL itself.

Just a thought.

Greetings
Dennis

--- End Message ---

--- Begin Message ---

Hello -

Some days ago I had some questions about internationalized transmissions. I had a hard 
time of understanding the principles of transmitting content inside HTTP transfers and 
such. But, there is still something unclear to me. That is the altered and unaltered 
content exchange between apache and the PHP module and what this means for the 
programmer. I have appended a discussion I had with someone from W3C. After reading 
it, it should be clear what I am still missing. If you think that I should write this 
to another mailing list then please let me know:

ME:
Please let me sum up here what I understand so far. The following is
written in respect to sending form content with the PUT method and in
UTF-8 encoding:

1. No matter if the transmitted content gets converted to another
character encoding or not on _server_side_, the result is always
handled as _string_ and may have to be converted to other types (int,
float or such) manually by the programmer.

OTHER:
Correct

ME:
2. As long as the transmitted characters are in the range of ASCII
(0-127), possibly following conversions on _server_side_ should not
change their values (the octets).

OTHER:
Correct

ME:
3. If the character values are above 127, these characters, on
_client_side_, get converted to _ASCII_strings_ which represent the
hex values of these characters (like '%FF') - or the whole is sent in
UTF-8.

OTHER:
Correct

ME (continued):
These encoded strings will be converted by the server to the
character encoding that is expected by the receiving application (or
just set by default, probably ISO 8859-1).

OTHER:
Wrong. Because this is PUT data, the HTTP Content-Encoding header will
say what the client used - the server should use the same encoding in
reverse, so that the 'characters' come out exactly the same for all
values (or that's how I would expect/require a web server to behave -
see below).

ME:
4. How an application handles UTF-8 character values above ASCII
(0-127) debends on the application itself. If it is UTF-8 aware, it
will hide possibly necessary conversions to the programmer. If not,
the string may content hex comments like '%ff' or whatever. This
depends on the conversion rules, probably on iconv, and should be
asked for in a appropriate mailinglist?

OTHER:
Wrong - if we are still talking about PUT data from forms and feeding
something like a Java Servlet container. The server will have decoded
all %hh and turned the data back into 16-bit UNICODE characters. I'm
not sure what happens with CGI interfaces to wierd languages like
perl - you may need to check your server spec. It is just possible
that the raw octets are fed through, so that you *do* have to do your
own decoding of %hh or even of UNICODE values above 127 (not an easy
task).  I strongly recommend you consult the documentation for /
experiment with your server and its application interface. I only know
the standard java Servlet interface.

If you _do_ have to do your own UTF-8 decoding, look for a library
that does it for you, unless you really enjoy low-level probramming at
the bit-by-bit level!

ME:
5. Binary files should always be sent as binaries since they
content octet values above 127.

OTHER:
Well it would be better to say that they should be sent unaltered.

ME (continued):
If the binary files were encoded in UTF-8 before being transmitted,
they would probably get manipulated by the server when being converted
to the local encoding or the encoding expected by the scripting
language.

OTHER:
Not really, the server will pass non-form data through unaltered,
unless it is told that it is text and its encoding is declared. Look
at the HTTP and MIME RFCs

END
---

So, if the server keeps unaltered and UTF-8 code as is, how does a unicode-unaware 
language like PHP deal with this when receiving the non-changed 'stream' over the PHP 
module for apache? How does the content of the received _string_ look like for the 
programmer? Can anybody answer this to me?

Thanks

Dennis Heuer

--- End Message ---

php-i18n Digest 21 Feb 2003 22:30:51 -0000 Issue 153

Reply via email to