Re: Strings in a programming language

Marcin 'Qrczak' Kowalczyk Mon, 07 Jul 2003 00:30:12 -0700

Dnia pon 7. lipca 2003 05:46, Wu Yongwei napisał:

> What if something occur in the file and does not form a valid, say,
> UTF-16 sequence?


It's clearly invalid in the specs, so there would be an error detected. But 
'\0' characters are valid UTF-8, so the only reason to disallow them could be 
laziness, and although I am lazy, I do care about my language more :-)

An example when they occur in what can be considered text: GNU find with 
option -print0, usually consumed with xargs -0. They are used as separators 
between filenames because they are guaranteed to not occur in a filename.

A find or xargs written in my language in a straightforward way would break on 
filenames with invalid UTF-8 on UTF-8 locale though. It is the system's setup 
responsibility to have filenames valid in the current locale. Well, some 
defensive applications could probably wish to internally switch their locale 
charset to ISO-8859-1 in order to process arbitrary bytes as text...

Maybe there should be a way to set filename encoding separately from the 
locale.

> Western Visual Basic programmers often uses characters to represent bytes,
> which make applications break when the default encoding changes from Latin-1
> to UTF-8 or some DBCSs.

I do distinguish characters and bytes. I have separate types:
- String - immutable array of characters,
- CharArray - mutable and resizable array of characters, one of ways of
  building strings from pieces (usually it's simpler to join a list of
  strings), and
- ByteArray - mutable and resizable array of bytes, used to pass binary data,
  or pass around text stored in an unknown encoding.

A single code point is represented as a String of length 1, a single byte is 
represented as an Int. The language is dynamically typed so it's appropriate 
to not make further distinctions.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to