> To me, the crucial difference is that strings need *parsing* > -- however simple, e.g., special handling for escape characters and > separators -- to become interpretable text. > And this parsing is what's broken.
Parsing is of course needed to detect grapheme cluster boundaries at least once for each given string of code points while converting to a string of grapheme clusters (text). But whether handling of escape characters is needed, depends on the source format. Length-prefixed formats do not need escaping whether on the move or at rest and therefore can skip the parsing needed for (un)escaping. In a lot of cases the resulting string of grapheme clusters is itself escaped or is expected to carry a message to be understood by the software processing it. Then of course another round(s) of parsing may be needed for unescaping and transformation to a symbol list or tree... The lack of a grapheme-cluster-based string type is not necessarily harmfull for (un)escaping - or parsing of programming languages. The evilness lies in the subtleness of the failure modes. There may be security implications in some places when it comes to sanitizing (or annotation of irregularities to enable a human to actually see them as there are a lot of "invisible" or human-"indistinguishable" code points in the Unicode standard). There may be security implications when slicing text where the meaning of a glyph, word, or sentence changes because of splitting inside it. In most cases the results of failing just annoy users not finding what they searched for or looking at strangely crippled truncated text. It most often is just a "GUI issue" and not security relevant. I guess, that is why most software designers do not really care. And it is one of that problems where one has to chose a non-perfect solution because human language processing is just not there now. I surely would choose the implemention for truncating article teaser texts, that detects the exact sweat spot where most readers just read enough of it to be teased - if there would be one readily available. But instead i have to chose how much processing is enough. The traditional answer seems to be "well, lets just take that string, truncate after N whatever-is-stored-in-that-thing and live with the result" for an arbitrarily chosen N. I go for a slightly better approach falling back from primitive sentence and word detection to grapheme clusters as smalles unit of text. I still use that arbitrarily chosen N as maximum length and would go for full text recognition instead, but can not afford it. A lot of software designers can not even afford using grapheme clusters as atomic unit because of the lack of support by the programming language of choice (regardless of whose choice that was). So the real solution is to make it easy to do right (grapheme cluster level) or at least better (word and sentence level). > A classification of such kinds of "string to text" pasring might > help properly frame and resolve this issue. Hmm, that sort of parsing, i mean, is only the set of boundary detections as defined in the Unicode standard. (Un)escaping is another huge problem domain containing a lot of opportunities for failure. But i think, it is a well understood field. And we already got the tools for that almost right in the big languages. (Un)escaping for processing or transportation is perfectly possible at the code point or even byte level. It therefore gets not better by introducing a type dealing with text at the grapheme cluster or more abstract level. Human language processing is of course needed for better word and sentence boundary detection. And there is a lot of parsing going on, i guess. But for the discussion about the need for a better base type representing text - i do not think, that matters. And human Language processing is a huge field where one could get lost really fast. That sort of parsing would have to be pluggable to augment or replace the default Unicode standard algorithms. > I suppose that the reason may be that the required parsing is > considered elementary, and elementary almost always means "dealt with > in an ad-hoc way". Yes, that has to be the cause. ;) -- Allan Wegan Jabber: [email protected] OTR-Fingerprint: 97ED4E4FA9CEFAFC0EF783F8D010154829529E9E Jabber: [email protected] OTR-Fingerprint: A1AAA1B9C067F9884A424D339834346929164587 ICQ: 209459114 OTR-Fingerprint: 71DE5B5E67D6D758A93BF1CE7DA06625205AC6EC
signature.asc
Description: OpenPGP digital signature
_______________________________________________ langsec-discuss mailing list [email protected] https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
