S. Isaac Dealey wrote:
> 
> This brings up a good question and something I've been thinking about for
> several months. At some point I plan to ( or expect I will be asked to )
> provide support for multiple languages / internationalization in my content
> management system.
> 
> How will character sets from the client browser affect the form or url
> submissions?

Always inconsistent and always wrong. :)


> Is it a good idea to simply declare a default character set on every page
> and then perform some sort of conversion when necessary for alternate
> languages? Will this make the app more stable or will it only make it more
> difficult to support other character sets?

First, disconnect the ideas of charset and language. ISO 8859-15 (a.k.a. 
LATIN9) for instance has a charset that is sufficient for most West 
European languages. On the other hand, if you want to properly support 
Japanese you need to support Hiragana, Katakana and Kanji (maybe they 
are integrated into one package such as SHIFT-JIS, but you get the point).

So first you have to determine whether you actually need different 
charsets, or just different languages. If you just need different 
languages from the same charset, use that charset and all you have to 
worry about is making sure you have a translator for the content.
If you need different charsets, you have a problem. It is not possible 
to have multiple charsets on one page.

The solution to this mess is unicode. Unicode is designed from the 
ground up to be the charset has ALL characters. Every character from 
every charset. The charset that will end all charsets :)
One of the funny results is that unicode has over 20 whitespace 
characters, and all have a different meaning. But it does work, and all 
characters are in unicode (OK, maybe not Summerian nail-writing from 
4000 BC, but if not they are certainly working on it).

So how to use a specific charset? CF MX internally is no problem. It 
will use the charset the templates are in (if detected by the BOM) or 
the system locale. You can override this by using 
<cfprocessingdirective> for each template.

Databases might or might not be a problem. Many will require you to use 
N-type fields (N = national = SQL-92 name) if you want to use multi-byte 
characters. Some will just fail. Check the documentation for specifics 
(don't forget to look for a Translate() function, which is a new SQL:99 
function that could translate from one charset to another if 
implemented). Read, read, read. Test, test, test.
Then of course there is the issue of the database drivers supporting the 
required charsets. For instance, the Access drivers that come with MX 
will not support Unicode.

Forget about the webserver, it is not important for this.

So we get to the browser. First thing is that you have to tell the 
browser what exactly you are sending to it. Use cfcontent for that, it 
has the highest priority of all options. [1] This should solve all 
issues with characters being displayed incorrectly in the browser.
If I am not mistaken, if you see question marks, it means that the font 
does not have the approriate glyph and if you see a square, the 
character is not present in that charset (in which case it is time to 
check if your browser is on auto-detect and run your HTML through a 
validator such as http://validator.w3.org/).
(A safe font used to be Arial Unicode which has a very large collection 
of glyphs, but in the neverending push for revenue this font is no 
longer available for download from the MS website and is only 
distributed together with Office. If somebody happens to have a copy of 
the install file, please mail me off-list.)

Last is the data being returned from the browser. Use the setEncoding() 
function to specify the correct charset for it.
It is possible for browsers to break this on purpose. Typical case of 
"garbage in, garbage out", if you deliberately overrule the charset (in 
your browser under View) you can send something to the server that the 
server doesn't expect.

I am sure some people have lots to add, but I think these are the basics.

[1] http://www.w3.org/International/O-charset.html

Links of interest:
http://www.macromedia.com/support/coldfusion/internationalization.html
http://www.unicode.org/
ftp://ftp.isi.edu/in-notes/rfc2277.txt

Jochem

______________________________________________________________________
Signup for the Fusion Authority news alert and keep up with the latest news in 
ColdFusion and related topics. http://www.fusionauthority.com/signup.cfm
FAQ: http://www.thenetprofits.co.uk/coldfusion/faq
Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists

Reply via email to