[PHP-I18N] UTF-8 and unaltered transmissions between apache and PHP module

Dennis Heuer Fri, 21 Feb 2003 14:29:50 -0800

Hello -

Some days ago I had some questions about internationalized transmissions. I had a hard 
time of understanding the principles of transmitting content inside HTTP transfers and 
such. But, there is still something unclear to me. That is the altered and unaltered 
content exchange between apache and the PHP module and what this means for the 
programmer. I have appended a discussion I had with someone from W3C. After reading 
it, it should be clear what I am still missing. If you think that I should write this 
to another mailing list then please let me know:


ME:
Please let me sum up here what I understand so far. The following is
written in respect to sending form content with the PUT method and in
UTF-8 encoding:

1. No matter if the transmitted content gets converted to another
character encoding or not on _server_side_, the result is always
handled as _string_ and may have to be converted to other types (int,
float or such) manually by the programmer.

OTHER:
Correct

ME:
2. As long as the transmitted characters are in the range of ASCII
(0-127), possibly following conversions on _server_side_ should not
change their values (the octets).

OTHER:
Correct

ME:
3. If the character values are above 127, these characters, on
_client_side_, get converted to _ASCII_strings_ which represent the
hex values of these characters (like '%FF') - or the whole is sent in
UTF-8.

OTHER:
Correct

ME (continued):
These encoded strings will be converted by the server to the
character encoding that is expected by the receiving application (or
just set by default, probably ISO 8859-1).

OTHER:
Wrong. Because this is PUT data, the HTTP Content-Encoding header will
say what the client used - the server should use the same encoding in
reverse, so that the 'characters' come out exactly the same for all
values (or that's how I would expect/require a web server to behave -
see below).

ME:
4. How an application handles UTF-8 character values above ASCII
(0-127) debends on the application itself. If it is UTF-8 aware, it
will hide possibly necessary conversions to the programmer. If not,
the string may content hex comments like '%ff' or whatever. This
depends on the conversion rules, probably on iconv, and should be
asked for in a appropriate mailinglist?

OTHER:
Wrong - if we are still talking about PUT data from forms and feeding
something like a Java Servlet container. The server will have decoded
all %hh and turned the data back into 16-bit UNICODE characters. I'm
not sure what happens with CGI interfaces to wierd languages like
perl - you may need to check your server spec. It is just possible
that the raw octets are fed through, so that you *do* have to do your
own decoding of %hh or even of UNICODE values above 127 (not an easy
task).  I strongly recommend you consult the documentation for /
experiment with your server and its application interface. I only know
the standard java Servlet interface.

If you _do_ have to do your own UTF-8 decoding, look for a library
that does it for you, unless you really enjoy low-level probramming at
the bit-by-bit level!

ME:
5. Binary files should always be sent as binaries since they
content octet values above 127.

OTHER:
Well it would be better to say that they should be sent unaltered.

ME (continued):
If the binary files were encoded in UTF-8 before being transmitted,
they would probably get manipulated by the server when being converted
to the local encoding or the encoding expected by the scripting
language.

OTHER:
Not really, the server will pass non-form data through unaltered,
unless it is told that it is text and its encoding is declared. Look
at the HTTP and MIME RFCs

END
---

So, if the server keeps unaltered and UTF-8 code as is, how does a unicode-unaware 
language like PHP deal with this when receiving the non-changed 'stream' over the PHP 
module for apache? How does the content of the received _string_ look like for the 
programmer? Can anybody answer this to me?

Thanks

Dennis Heuer

-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-I18N] UTF-8 and unaltered transmissions between apache and PHP module

Reply via email to