[fpc-pascal] String theory

Tony Whyman Tue, 10 May 2016 03:44:50 -0700

While my first thought over the "String Type" or "End of World" threadswas this is another "how many angels to the pinhead" type discussion.However, having worked through it, I believe that there is an issue hereand Pascal could be improved by including (for string types) the codepage as part of the string data itself rather than having to infer it.

As a programmer, I want the freedom to choose which was the appropriatecharacter encoding for my application - or even to mix encodings in thesame application.

- I would always choose UTF-8 for database columns as that is the bestcompromise between international support and compact encoding (and hopethat my RDBMS was not so dumb as to allocate four times the maxcharacter width for every UTF-8 string).

- If I was doing a lot of intensive CPU string processing of stringswith international support then UTF-16 is what I would want to use forinternal representation - as long as the cost of UTF-8 to UTF-16transliteration was justified when reading/writing to disk.

- On the other hand, if I am working on an in house application that Iknow is always going to be working in English (or Western Europe) thenuse of a National Character set (or more likely ISO 10589-1) seems theobvious choice.

Pascal does seem to support what I want. It has the unicodestring typefor UTF-16 and the string type (with code page) for UTF-8 and nationalcharacter sets. However, the problem is that Pascal (or FPC) permits anambiguity between the use of UTF-8 and national character sets.

If you program is in English and your data is in English then UTF-8 andAnsistrings (or even different 8-bit code pages) look the same and isvery easy to get sloppy, use the basic string type all over the place,and to get very confused as to what your string code page really is. Thewhole thing then just falls apart when you try and internationalise it.

I would argue that this problem would be avoided if the code page waspart of the string data (just as the byte count is already) and thatstrings defined without an explicit code page could have a string withany code page assigned to them, while strings with an explicit code codeas part of their type could only be assigned a string of that code page(perhaps with automatic transliteration on assignment from another codepage). Also, byte length and character length could then be returned bystandard routines.

This is in contrast to the current situation where strings without anexplicit code page setting are simply assumed to use theDefaultSystemCodePage with limited run time checking (often none).

Indeed, if the code page was part of the string data, then the "string"type should be able to unify both wide string and ansistrings.



_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

[fpc-pascal] String theory

Reply via email to