Dnia pon 7. lipca 2003 05:46, Wu Yongwei napisał:
> What if something occur in the file and does not form a valid, say,
> UTF-16 sequence?
It's clearly invalid in the specs, so there would be an error detected. But
'\0' characters are valid UTF-8, so the only reason to disallow them could be
laziness, and although I am lazy, I do care about my language more :-)
An example when they occur in what can be considered text: GNU find with
option -print0, usually consumed with xargs -0. They are used as separators
between filenames because they are guaranteed to not occur in a filename.
A find or xargs written in my language in a straightforward way would break on
filenames with invalid UTF-8 on UTF-8 locale though. It is the system's setup
responsibility to have filenames valid in the current locale. Well, some
defensive applications could probably wish to internally switch their locale
charset to ISO-8859-1 in order to process arbitrary bytes as text...
Maybe there should be a way to set filename encoding separately from the
locale.
> Western Visual Basic programmers often uses characters to represent bytes,
> which make applications break when the default encoding changes from Latin-1
> to UTF-8 or some DBCSs.
I do distinguish characters and bytes. I have separate types:
- String - immutable array of characters,
- CharArray - mutable and resizable array of characters, one of ways of
building strings from pieces (usually it's simpler to join a list of
strings), and
- ByteArray - mutable and resizable array of bytes, used to pass binary data,
or pass around text stored in an unknown encoding.
A single code point is represented as a String of length 1, a single byte is
represented as an Int. The language is dynamically typed so it's appropriate
to not make further distinctions.
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/