Glenn Maynard said: > On Thu, Apr 08, 2004 at 08:35:21PM -0400, Michael B Allen wrote: >> This is probably states the definitive position for text handling: >> >> http://www.w3.org/TR/1999/WD-charmod-19991129/#Encodings >> >> But even though the encoding is not clearly stated as UTF-16, the >> Document >> Object Model (DOM) which is basically the document tree inside a web >> browser and key to all HTML and XML processing including JavaScript and >> XSLT processing *requires* the encoding be UTF-16: >> >> http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-C74D1578 > > "The UTF-16 encoding was chosen because of its widespread industry > practice." > > Very funny; it was chosen since it's what Windows is stuck with.
Yeah, "widespread industry practice" is ridiculous. But as someone else pointed out Java was used almost exclusively when developing the DOM APIs. That coupled with IE ensured UTF-16 was the premier encoding but they could have just left it open. > That aside, "all" above is incorrect. You don't have to use DOM to > process HTML and XML. So what kind of processing are you talking about exactly? You can do basic filtering with a SAX type of api. Grep and sed might be adequate for really simple stuff. But I think the DOM is used for really serious manipulation. > (Ultimately, if one *had* to use UTF-16 to process HTML, > then something along the line is horribly wrong: a language > specification can't legitimately make any requirements about > transparent implementation details.) That's exactly what I argued (I think). I claimed the string encoding should just be left open. Why does it matter to a high level API how strings are encoded? You just want to iterate over characters, index, create substrings, etc. Only one W3C DOM person answered me and even then it was mostly offline. I never got a straight answer AFAICR. Ultimately however, if a DOM implementation whether it be for Python, Perl, or C must be UTF-16 *to be W3C compliant*. I suspect the requirement is readily ignored. That's the punchline to this thread... Mike -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
