Hello - Some days ago I had some questions about internationalized transmissions. I had a hard time of understanding the principles of transmitting content inside HTTP transfers and such. But, there is still something unclear to me. That is the altered and unaltered content exchange between apache and the PHP module and what this means for the programmer. I have appended a discussion I had with someone from W3C. After reading it, it should be clear what I am still missing. If you think that I should write this to another mailing list then please let me know:
ME: Please let me sum up here what I understand so far. The following is written in respect to sending form content with the PUT method and in UTF-8 encoding: 1. No matter if the transmitted content gets converted to another character encoding or not on _server_side_, the result is always handled as _string_ and may have to be converted to other types (int, float or such) manually by the programmer. OTHER: Correct ME: 2. As long as the transmitted characters are in the range of ASCII (0-127), possibly following conversions on _server_side_ should not change their values (the octets). OTHER: Correct ME: 3. If the character values are above 127, these characters, on _client_side_, get converted to _ASCII_strings_ which represent the hex values of these characters (like '%FF') - or the whole is sent in UTF-8. OTHER: Correct ME (continued): These encoded strings will be converted by the server to the character encoding that is expected by the receiving application (or just set by default, probably ISO 8859-1). OTHER: Wrong. Because this is PUT data, the HTTP Content-Encoding header will say what the client used - the server should use the same encoding in reverse, so that the 'characters' come out exactly the same for all values (or that's how I would expect/require a web server to behave - see below). ME: 4. How an application handles UTF-8 character values above ASCII (0-127) debends on the application itself. If it is UTF-8 aware, it will hide possibly necessary conversions to the programmer. If not, the string may content hex comments like '%ff' or whatever. This depends on the conversion rules, probably on iconv, and should be asked for in a appropriate mailinglist? OTHER: Wrong - if we are still talking about PUT data from forms and feeding something like a Java Servlet container. The server will have decoded all %hh and turned the data back into 16-bit UNICODE characters. I'm not sure what happens with CGI interfaces to wierd languages like perl - you may need to check your server spec. It is just possible that the raw octets are fed through, so that you *do* have to do your own decoding of %hh or even of UNICODE values above 127 (not an easy task). I strongly recommend you consult the documentation for / experiment with your server and its application interface. I only know the standard java Servlet interface. If you _do_ have to do your own UTF-8 decoding, look for a library that does it for you, unless you really enjoy low-level probramming at the bit-by-bit level! ME: 5. Binary files should always be sent as binaries since they content octet values above 127. OTHER: Well it would be better to say that they should be sent unaltered. ME (continued): If the binary files were encoded in UTF-8 before being transmitted, they would probably get manipulated by the server when being converted to the local encoding or the encoding expected by the scripting language. OTHER: Not really, the server will pass non-form data through unaltered, unless it is told that it is text and its encoding is declared. Look at the HTTP and MIME RFCs END --- So, if the server keeps unaltered and UTF-8 code as is, how does a unicode-unaware language like PHP deal with this when receiving the non-changed 'stream' over the PHP module for apache? How does the content of the received _string_ look like for the programmer? Can anybody answer this to me? Thanks Dennis Heuer -- PHP Internationalization Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php