While my first thought over the "String Type" or "End of World" threads was this is another "how many angels to the pinhead" type discussion. However, having worked through it, I believe that there is an issue here and Pascal could be improved by including (for string types) the code page as part of the string data itself rather than having to infer it.

As a programmer, I want the freedom to choose which was the appropriate character encoding for my application - or even to mix encodings in the same application.

- I would always choose UTF-8 for database columns as that is the best compromise between international support and compact encoding (and hope that my RDBMS was not so dumb as to allocate four times the max character width for every UTF-8 string).

- If I was doing a lot of intensive CPU string processing of strings with international support then UTF-16 is what I would want to use for internal representation - as long as the cost of UTF-8 to UTF-16 transliteration was justified when reading/writing to disk.

- On the other hand, if I am working on an in house application that I know is always going to be working in English (or Western Europe) then use of a National Character set (or more likely ISO 10589-1) seems the obvious choice.

Pascal does seem to support what I want. It has the unicodestring type for UTF-16 and the string type (with code page) for UTF-8 and national character sets. However, the problem is that Pascal (or FPC) permits an ambiguity between the use of UTF-8 and national character sets.

If you program is in English and your data is in English then UTF-8 and Ansistrings (or even different 8-bit code pages) look the same and is very easy to get sloppy, use the basic string type all over the place, and to get very confused as to what your string code page really is. The whole thing then just falls apart when you try and internationalise it.

I would argue that this problem would be avoided if the code page was part of the string data (just as the byte count is already) and that strings defined without an explicit code page could have a string with any code page assigned to them, while strings with an explicit code code as part of their type could only be assigned a string of that code page (perhaps with automatic transliteration on assignment from another code page). Also, byte length and character length could then be returned by standard routines.

This is in contrast to the current situation where strings without an explicit code page setting are simply assumed to use the DefaultSystemCodePage with limited run time checking (often none).

Indeed, if the code page was part of the string data, then the "string" type should be able to unify both wide string and ansistrings.


_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to