On Tue, Mar 18, 2008 at 1:53 PM, John Cowan <[EMAIL PROTECTED]> wrote: > > Let's see... ASCII is valid UTF-8, so all ASCII external > > representations wouldn't need any encoding or decoding work.
That is a huge advantage. I think unless there are some insurmountable gotcha's, or it causes major efficiency problems, there are some good arguments for using UTF-8 for strings in Chicken. > True. However, pure ASCII is less comment than people believe, as > indicated by the 59K Google hits for "8-bit ASCII". Less common you mean? I think ASCII is the most common representation for everything. The popularity of XML goes to show what pains people are taking to make data human-readable. (I disagree with the need for that a lot of the time, but whatever.) Source code written by non-English speakers is usually ASCII nevertheless. (Must be harder to learn a language when you don't know what the keywords mean in English.) My favorite editor, Scite, BTW supports UTF-8 nicely... it preserves the BOM if it is there, assumes ASCII if it is not there, and can be told to switch to UTF8 mode if the file does not have a BOM but actually is UTF8... then when you write the file it prepends the BOM. All exactly as it should be. I am seeing fewer web pages in other 8-bit codepages (like KOI8-R, CP1251 etc.) than there used to be, and/or modern browsers are doing a better job detecting the codepage and making it transparent anyway. On one hand it was nice to pick your language and still have 8-bit strings. OTOH it was really messy having 4 or so code pages to choose from for Cyrillic (2 of which were used a lot); and it's also nice to be able to mix languages, and insert the Euro symbol into any string, etc. MP3ID tags lagged for a long time... Russian MP3s tend to have CP1251 (with no way to declare that's what it is, you just have to know) but now UTF8 can be used there too. > > Most recent formats and protocols require or strongly recommend UTF-8 > > (see XML etc.) so those wouldn't need any encoding/decoding either. > > Well, there's an awful lot of content on the Internet and on local hard > disks that is neither true ASCII nor UTF-8. In particular, UTF-16 is > the usual representation of Unicode on Windows, and various non-Unicode > character sets are the usual representation of text on Windows, and > consequently on the Web too. UTF-8 is something of an oddity there. I disagree. Text and HTML files you may find lying about on hard drives and web servers all over the world tend to be either ASCII or UTF-8, as far as I've seen. Windows programs may use UTF-16 for string variables in memory, and maybe for serialization to "binary" files, but not for files that are meant to be human-readable. > I'm fine with using UTF-8 as our internal representation. Sounds good to me. > > Unicode/UTF8-aware string operations will perform a correct > > replacement and insert the two extra bytes, if the source string > > really is plain ASCII. Insertion has a linear cost though, because the string is a contiguous array, right? This is probably the reason Java sidestepped the issue by specifying that strings are immutable. In a fairly pure functional language that policy would make sense too (you can modify a string only by copying it and throwing the old one away - that way you see more clearly what is the cost of the operations you are doing) but we can't go breaking existing programs can we... So char has to be 16 or 32 bits right? (depending on how much of Unicode we wish to support... 16 bits is almost always enough) When you do string-ref on a UTF8 string it will need to return the Unicode character at that character index, not the byte from the bytewise index, right? Then unfortunately you have to iterate the string in order to count characters, can't just do an offset from the beginning. (This is where UTF-16 as an in-memory representation has an advantage.) For Display Scheme I was planning to assume all strings are UTF-8, so this change will make things nice and consistent. But I had to convert on-the-fly to 16-bit Unicode to render characters with FreeType. (Not a big deal because I did the rendering 1 glyph at a time anyway.) http://dscm.svn.sourceforge.net/viewvc/dscm/src/g2d-fb16-impl.c?revision=67&view=markup line 761 (sorry that code isn't very presentable yet and needs some modularization) Alternative string representations could be in an egg. (16-bit Unicode, 32-bit Unicode, string-plus-codepage, EBCDIC or whatever. :-) When doing in-place modifications with strings that actually have non-ASCII characters, actual Unicode is more efficient, so it would be nice to be able to switch to that representation when it has advantages. (like Windows does for string variables) _______________________________________________ Chicken-users mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/chicken-users
