Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03: > - Simple one-time-use programs which assume that characters are what you get > when you index strings (which break paragraphs or draw ASCII tables or count > occurrences of characters) are broken more often. They work only for ASCII, > where with UTF-32 they work for all text converted from ISO-8859-x and > more. Sometimes it's better to write a program which works at the given > place (although it would break for e.g. Arabic) than not be able to write it > conveniently at all. > Such programs would break anyway, with non-spacing marks and wide characters. I think it would be much more useful to also provide some width-based interface.
> > I would generally advise against doing what perl does: the decision to > > automatically consider all I/O to be text mode with automatic clumsy > > conversion has been a massive bug-generator. Its better to treat I/O as > > binary, and let the user call a conversion function if desired. > > No, this would be awful, because users won't remember to do the conversion or > won't to the right one. You can do the right thing for the current locale by default but you should povide a way for the programmer to override this. > Half of the program would use UTF-8, half would use the encoding of > the file, and it will break when they exchange non-ASCII data. > > I don't know how it should look like or what Perl does, but it's > definitely not an option to pretend that UTF-8 is used internally > and at the same time deliver file contents as unconverted strings. > It would in essence make it the 3rd option, i.e. the encoding mess > used internally we have today, and no reliable non-ASCII character > properties at all. > I think the Python approach is optimal: there are two separate kinds of strings, one for raw undecoded bytes and another for Unicode characters. They are translated automatically but there is no way to incidentially make such a mistake as decoding the result of a decoded string or encoding an already encoded string. > I can't imagine anything better than attaching encodings to text > file objects. Binary I/O will transfer arrays of bytes, text I/O > will transfer Unicode strings (with as yet undecided internal > representation) or arrays of bytes (when you just want to send them > elsewhere without recoding twice). Arrays of bytes should be clearly > distinguished from strings. > Good idea, see above. -- Beni Cherniavsky <[EMAIL PROTECTED]> "Reading the documentation I felt like a kid in a toy shop." -- Phil Thompson on Python's standard library -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
