On Saturday 11 September 2004 14:18, Phil Taylor wrote: > On 11 Sep 2004, at 07:17, Paul Rosen wrote: > > I just thought of something. Should we make the strings Unicode, or > > have > > some mechanism for other alphabets? Perhaps that needs to be a switch > > passed > > to the parser, or a field in the returned structure or something. > > I'm a bit hazy on the capabilities of Unicode. While it would be nice > to > support lyrics in cyrillic script, would this not mean that the abc > would > have to be written using cyrillic characters? Or do all the Unicode > variants > have plain ascii as a subset?
No and yes. The Unicode character set contains *all* characters of (nearly) *all* alphabets, including the characters of the ASCII set. This means that you can combine just about every language in one text document, which is impossible with the iso-8859-x encodings. The UTF-8 implementation of the Unicode standard (the most widely used, at least on UNIX platforms -- I don't know about Windows) is backwards compatible with plain US-ASCII. More specifically, the first 128 Unicode characters are the same as in US-ASCII and are encoded in the same way (i.e. 1 byte per character). Other characters may be encoded using 2, 3 or even 4 bytes. This means that all old abc files in ASCII read by a parser that supports Unicode are read without problems. However, if you pass an abc file encoded in UTF-8 that contains characters beyond the US-ASCII range to a program like abcm2ps (that only supports the iso-8859 family), you will see some strange things in the output, e.g. "sötetkék" where you expected "s�tetk�k" (the accented characters are encoded using 2 bytes, obviously. For more information about Unicode and utf-8, see: http://www.unicode.org/ http://www.utf-8.com/ http://www.faqs.org/rfcs/rfc3629.html And a nice one: an article on "Joel on software" about Unicode, titled "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" : http://www.joelonsoftware.com/articles/Unicode.html A quote from the article to finish off: <<So I have an announcement to make: if you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I *catch* you, I'm going to punish you by making you peel onions for 6 months in a submarine. I swear I will.>> Cheers, bert -- Bert Van Vreckem <http://flanders.blackmill.net/> Te audire non possum. Musa sapientum fixa est in aure. To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html
