Re: [langsec-discuss] The String Type is Broken

Allan Wegan Wed, 27 Nov 2013 23:30:42 -0800

> http://mortoray.com/2013/11/27/the-string-type-is-broken/


Failing at case conversion or proper detection of code point boundaries
is of course a bug for string types abstracting a sequence of code
points. The latter may be expected behaviour for an abstraction of a
sequence of UTF16-Entities though...
But in a lot of languages the string type is not meant to be a text
type. It is just not an abstraction of a sequence of grapheme clusters
but deals with code points (or even smaller entities) only.
Reversing a list of code points or splitting between them is just not,
what you want to do in most cases. So the language's string type is
often just not designed to be the right tool to use with text.

And that is the actual problem. The string type is not broken. It just
is not a text type. And most often, there is no text type in the
standard library despite of practical needs in a world full of text!


From here, only selfish thoughts of my needs and what i dream of follow:

When dealing with text in most cases i need (aside from the obligatory
stripping and normalizing) functionality using grapheme cluster, word or
sentence granularity. The latter two are maybe impossible to do perfect
even for English text.
Often i need length limiting based on bytes or code points but
respecting grapheme cluster or higher level boundaries.

I admit, that i almost always only care about languages using
latin-based scripts and am ready to accept most-often-correct behaviour
because human language processing is a hard and almost unsolved problem.
I feel bad about not beeing able to always split words and sentences
properly. But at least i do not split between hangul syllables
representing a single korean glyph...

The string type of my dreams provides appropriate operations at code
point, grapheme cluster, word, sentence and paragraph granularity. Some
would even operate on multiple and even lover levels at once (such as
slicing, length limiting...). It would support the full Unicode standard
(including the formally "optional" stuff).
It would let me optionally bring my own Unicode Character Database
allowing me to conform to the latest, old or even customized standard
versions. It would let me plug in my own boundary detection algorithms
to allow to improve behaviour for selected languages.

The string type of my dreams is a behemoth.
It would be awesome to just be able to use the same standard library
string type for all the basic text processing in all languages. It is
hard and time consuming to implement it so that the result is "fun" to use.
From a langsec point of view, a monster like that surely has a lot of
potential for unintended weirdness - especially when implementing it in
C or C++ (likely target languages for implementing core functionality
like that as a performant and reuse-friendly lib to be encapsulated by
target-language-specific bindings).



-- 
Allan Wegan
Jabber: [email protected]
 OTR-Fingerprint: 97ED4E4FA9CEFAFC0EF783F8D010154829529E9E
Jabber: [email protected]
 OTR-Fingerprint: A1AAA1B9C067F9884A424D339834346929164587
ICQ: 209459114
 OTR-Fingerprint: 71DE5B5E67D6D758A93BF1CE7DA06625205AC6EC

signature.asc
Description: OpenPGP digital signature

_______________________________________________
langsec-discuss mailing list
[email protected]
https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

Re: [langsec-discuss] The String Type is Broken

Reply via email to