Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:

> - Simple one-time-use programs which assume that characters are what you get
>   when you index strings (which break paragraphs or draw ASCII tables or count
>   occurrences of characters) are broken more often. They work only for ASCII,
>   where with UTF-32 they work for all text converted from ISO-8859-x and
>   more. Sometimes it's better to write a program which works at the given
>   place (although it would break for e.g. Arabic) than not be able to write it
>   conveniently at all.
>
Such programs would break anyway, with non-spacing marks and wide
characters.  I think it would be much more useful to also provide some
width-based interface.

> > I would generally advise against doing what perl does: the decision to
> > automatically consider all I/O to be text mode with automatic clumsy
> > conversion has been a massive bug-generator. Its better to treat I/O as
> > binary, and let the user call a conversion function if desired.
>
> No, this would be awful, because users won't remember to do the conversion or
> won't to the right one.

You can do the right thing for the current locale by default but you
should povide a way for the programmer to override this.

> Half of the program would use UTF-8, half would use the encoding of
> the file, and it will break when they exchange non-ASCII data.
>
> I don't know how it should look like or what Perl does, but it's
> definitely not an option to pretend that UTF-8 is used internally
> and at the same time deliver file contents as unconverted strings.
> It would in essence make it the 3rd option, i.e. the encoding mess
> used internally we have today, and no reliable non-ASCII character
> properties at all.
>
I think the Python approach is optimal: there are two separate kinds
of strings, one for raw undecoded bytes and another for Unicode
characters.  They are translated automatically but there is no way to
incidentially make such a mistake as decoding the result of a decoded
string or encoding an already encoded string.

> I can't imagine anything better than attaching encodings to text
> file objects. Binary I/O will transfer arrays of bytes, text I/O
> will transfer Unicode strings (with as yet undecided internal
> representation) or arrays of bytes (when you just want to send them
> elsewhere without recoding twice). Arrays of bytes should be clearly
> distinguished from strings.
>
Good idea, see above.

-- 
Beni Cherniavsky <[EMAIL PROTECTED]>

"Reading the documentation I felt like a kid in a toy shop."
 -- Phil Thompson on Python's standard library
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to