On Saturday 11 September 2004 14:18, Phil Taylor wrote:
> On 11 Sep 2004, at 07:17, Paul Rosen wrote:
> > I just thought of something. Should we make the strings Unicode, or
> > have
> > some mechanism for other alphabets? Perhaps that needs to be a switch
> > passed
> > to the parser, or a field in the returned structure or something.
>
> I'm a bit hazy on the capabilities of Unicode.  While it would be nice
> to
> support lyrics in cyrillic script, would this not mean that the abc
> would
> have to be written using cyrillic characters?  Or do all the Unicode
> variants
> have plain ascii as a subset?

No and yes. The Unicode character set contains *all* characters of (nearly) 
*all* alphabets, including the characters of the ASCII set. This means that 
you can combine just about every language in one text document, which is 
impossible with the iso-8859-x encodings.

The UTF-8 implementation of the Unicode standard (the most widely used, at 
least on UNIX platforms -- I don't know about Windows) is backwards 
compatible with plain US-ASCII. More specifically, the first 128 Unicode 
characters are the same as in US-ASCII and are encoded in the same way (i.e. 
1 byte per character). Other characters may be encoded using 2, 3 or even 4 
bytes.

This means that all old abc files in ASCII read by a parser that supports 
Unicode are read without problems. However, if you pass an abc file encoded 
in UTF-8 that contains characters beyond the US-ASCII range to a program like 
abcm2ps (that only supports the iso-8859 family), you will see some strange 
things in the output, e.g. "sötetkék" where you expected "s�tetk�k" (the 
accented characters are encoded using 2 bytes, obviously.

For more information about Unicode and utf-8, see:
http://www.unicode.org/
http://www.utf-8.com/
http://www.faqs.org/rfcs/rfc3629.html

And a nice one: an article on "Joel on software" about Unicode, titled "The 
Absolute Minimum Every Software Developer Absolutely, Positively Must Know 
About Unicode and Character Sets (No Excuses!)" :
http://www.joelonsoftware.com/articles/Unicode.html

A quote from the article to finish off:

<<So I have an announcement to make: if you are a programmer working in 2003 
and you don't know the basics of characters, character sets, encodings, and 
Unicode, and I *catch* you, I'm going to punish you by making you peel onions 
for 6 months in a submarine. I swear I will.>>

Cheers,

bert

-- 
Bert Van Vreckem      <http://flanders.blackmill.net/>
Te audire non possum. Musa sapientum fixa est in aure.
To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html

Reply via email to