> http://mortoray.com/2013/11/27/the-string-type-is-broken/
Failing at case conversion or proper detection of code point boundaries is of course a bug for string types abstracting a sequence of code points. The latter may be expected behaviour for an abstraction of a sequence of UTF16-Entities though... But in a lot of languages the string type is not meant to be a text type. It is just not an abstraction of a sequence of grapheme clusters but deals with code points (or even smaller entities) only. Reversing a list of code points or splitting between them is just not, what you want to do in most cases. So the language's string type is often just not designed to be the right tool to use with text. And that is the actual problem. The string type is not broken. It just is not a text type. And most often, there is no text type in the standard library despite of practical needs in a world full of text! From here, only selfish thoughts of my needs and what i dream of follow: When dealing with text in most cases i need (aside from the obligatory stripping and normalizing) functionality using grapheme cluster, word or sentence granularity. The latter two are maybe impossible to do perfect even for English text. Often i need length limiting based on bytes or code points but respecting grapheme cluster or higher level boundaries. I admit, that i almost always only care about languages using latin-based scripts and am ready to accept most-often-correct behaviour because human language processing is a hard and almost unsolved problem. I feel bad about not beeing able to always split words and sentences properly. But at least i do not split between hangul syllables representing a single korean glyph... The string type of my dreams provides appropriate operations at code point, grapheme cluster, word, sentence and paragraph granularity. Some would even operate on multiple and even lover levels at once (such as slicing, length limiting...). It would support the full Unicode standard (including the formally "optional" stuff). It would let me optionally bring my own Unicode Character Database allowing me to conform to the latest, old or even customized standard versions. It would let me plug in my own boundary detection algorithms to allow to improve behaviour for selected languages. The string type of my dreams is a behemoth. It would be awesome to just be able to use the same standard library string type for all the basic text processing in all languages. It is hard and time consuming to implement it so that the result is "fun" to use. From a langsec point of view, a monster like that surely has a lot of potential for unintended weirdness - especially when implementing it in C or C++ (likely target languages for implementing core functionality like that as a performant and reuse-friendly lib to be encapsulated by target-language-specific bindings). -- Allan Wegan Jabber: [email protected] OTR-Fingerprint: 97ED4E4FA9CEFAFC0EF783F8D010154829529E9E Jabber: [email protected] OTR-Fingerprint: A1AAA1B9C067F9884A424D339834346929164587 ICQ: 209459114 OTR-Fingerprint: 71DE5B5E67D6D758A93BF1CE7DA06625205AC6EC
signature.asc
Description: OpenPGP digital signature
_______________________________________________ langsec-discuss mailing list [email protected] https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
