Re: a little parsing challenge ☺

Billy Mays Mon, 18 Jul 2011 19:13:25 -0700

On 7/18/2011 7:56 PM, Steven D'Aprano wrote:

Billy Mays wrote:

On 07/17/2011 03:47 AM, Xah Lee wrote:

2011-07-16


I gave it a shot.  It doesn't do any of the Unicode delims, because
let's face it, Unicode is for goobers.


Goobers... that would be one of those new-fangled slang terms that the young
kids today use to mean its opposite, like "bad", "wicked" and "sick",
correct?

I mention it only because some people might mistakenly interpret your words
as a childish and feeble insult against the 98% of the world who want or
need more than the 127 characters of ASCII, rather than understand you
meant it as a sign of the utmost respect for the richness and diversity of
human beings and their languages, cultures, maths and sciences.

TL;DR version: international character sets are a problem, and Unicodeis not the answer to that problem).

As long as I have used python (which I admit has only been 3 years)Unicode has never appeared to be implemented correctly. I'm probablyrepeating old arguments here, but whatever.

Unicode is a mess. When someone says ASCII, you know that they can onlymean characters 0-127. When someone says Unicode, do the mean realUnicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?When using the 'u' datatype with the array module, the docs don't eventell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure thatall the of these can be figured out, but the problem is now I have toask every one of these questions whenever I want to use strings.

Secondly, Python doesn't do Unicode exception handling correctly. (but Isuspect that its a broader problem with languages) A good example ofthis is with UTF-8 where there are invalid code points ( such as 0xC0,0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, aswell as everyone else who wants to use strings for some reason).

When embedding Python in a long running application where user input isreceived, it is very easy to make mistake which bring down the wholeprogram. If any user string isn't properly try/excepted, a user couldcraft a malformed string which a UTF-8 decoder would choke on. UsingASCII (or whatever 8 bit encoding) doesn't have these problems since allcodepoints are valid.

Another (this must have been a good laugh amongst the UniDevs) 'feature'of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).Any string can masquerade as any other string by placing few of thesein a string. Any word filters you might have are now defeated by somecheesy Unicode nonsense character. Can you just just check for thesecharacters and strip them out? Yes. Should you have to? I would say no.

Does it get better? Of course! international character sets used fordomain name encoding use yet a different scheme (Punycode). Are thefollowing two domain names the same: tést.com , xn--tst-bma.com ? Whoknows!

I suppose I can gloss over the pains of using Unicode in C with everystring needing to be an LPS since 0x00 is now a valid code point inUTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to dostrlen or concatenation operations.

Can it get even better? Yep. We also now need to have a Byte orderMark (BOM) to determine the endianness of our characters. Are theylittle endian or big endian? (or perhaps one of the two possible middleendian encodings?) Who knows? String processing with unicode isunpleasant to say the least. I suppose that's what we get when wethings are designed by committee.

But Hey! The great thing about standards is that there are so many tochoose from.


--
Bill






--
http://mail.python.org/mailman/listinfo/python-list

Re: a little parsing challenge ☺

Reply via email to