On Tue, Mar 27, 2007 at 10:55:32PM -0400, Daniel B. wrote:
> Rich Felker wrote:
> > ...
> > None of this is relevant to most processing of text which is just
> > storage, retrieval, concatenation, and exact substring search.
> 
> It might be true that more-complicated processing is not relevant to those
> operations.  (I'm not 100% sure about exact substring matches, but maybe 
> if the byte string given to search for is proper (e.g., doesn't have any
> partial representations of characters), it's okay).

No character is a substring of another character in UTF-8. This is an
essential property of any sane multibyte encoding (incidentally, the
only other one is EUC-TW).

> Well of course you need to think in bytes when you're interpreting the
> stream of bytes as a stream of characters, which includes checking for 
> invalid UTF-8 sequences.

And what do you do if they're present? Under your philosophy, it would
be impossible for me to remove files with invalid sequences in their
names, since I could neither type the filename nor match it with glob
patterns (due to the filename causing an error at the byte to
character conversion phase before there's even a change to match
anything). I'd have to write specialized tools to do it...

Other similar problem: I open a file in a text editor and it contains
illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,
or a file with mixed encodings (e.g. a mail spool) or with mixed-in
binary data. I want to edit it anyway and save it back without
trashing the data that does not parse as valid UTF-8, while still
being able to edit the data that is valid UTF-8 as UTF-8.

This is easy if the data is kept as bytes and the character
interpretation is only made "just in time" when performing display,
editing, pattern searches, etc. If I'm going to convert everything to
characters, it requires special hacks for encoding the invalid
sequences in a reversible way. Markus Kuhn experimented with ideas for
this a lot back in the early linux-utf8 days and eventually it was
found to be a bad idea as far as I could tell.

Also, I've found performance and implementation simplicity to be much
better when data is kept as UTF-8. For example, my implementation of
the POSIX fnmatch() function (used by glob() function) is extremely
light and fast, due to performing all the matching as byte strings and
only considering characters "just in time" during bracket expression
matching (same as regex brackets). This also allows it to accept
strings with illegal sequences painlessly.

> > Hardly. A byte-based regex for all case matches (e.g. "(ä|�)") will

The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
instill faith...

> > work just as well even for case-insensitive matching, and literal
> > character matching is simple substring matching identical to any other
> > sane encoding. I get the impression you don't understand UTF-8..
> 
> How do you match a single character?  Would you want the programmer to 
> have to write an expression that matches a byte 0x00 through 0x7F, a
> sequence of two bytes from 0xC2 0x80 through 0xDF 0xBF, a sequence of
> three bytes from 0xE1 0xA0 0x80 through 0xEF 0xBF 0xBF, etc. [hoping I 
> got those bytes right] instead of simply "."?

No, this is the situation where a character-based regex is wanted.
Ideally, a single regex system could exist which could do both
byte-based and character-based matching together in the same string.
Sadly that's not compatible with POSIX BRE/ERE, nor Perl AFAIK.

> >... Sometimes a byte-based regex is also useful. For
> > example my procmail rules reject mail containing any 8bit octets if
> > there's not an appropriate mime type for it. This kills a lot of east
> > asian spam. :)
> 
> Yep.
> 
> Of course, you can still do that with character-based strings if you
> can use other encodings.  (E.g., in Java, you can read the mail
> as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255.
> Then you can write the regular expression in terms of Unicode characters
> 0-255.  The only disadvantage there is probably some time spent
> decoding the byte stream into the internal representation of characters.)

The biggest disadvantage of it is that it's WRONG. The data is not
Latin-1, and pretending it's Latin-1 is a hideous hack. The data is
bytes with either no meaning as characters, or (more often) an
interpretation as characters that's not available to the software
processing it. I've just seen waaaay too many bugs from pretending
that bytes are characters to consider doing this reasonable. It also
perpetuates the (IMO very bad) viewpoint among new users that UTF-8 is
"sequences of Latin-1 characters making up a character" instead of
"sequences of bytes making up a character".

> >  A stream should accept bytes, and a character string
> > should always be interpreted as bytes according to the machine's
> > locale when read/written to a stream
> 
> Note that it's specific to the stream, not the machine.  (Consider 
> for example, HTTP's Content-Encoding header's charset parameter.  A
> web browser needs to handle different character encodings in different
> responses.  A MIME application needs to handle different character 
> encodings in different parts of a single multi-part message.)

Yes, clients need to. Servers can just always serve UTF-8. However, in
the examples you give, the clean solution is just to treat the bytes
as bytes, not characters, until they've been processed.

Maybe 20 years from now we'll finally be able to get rid of the
nonsense and just assume everything is UTF-8...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to