Re: perl unicode support

Daniel B. Tue, 27 Mar 2007 18:55:37 -0800

Rich Felker wrote:
> ...
> None of this is relevant to most processing of text which is just
> storage, retrieval, concatenation, and exact substring search.


It might be true that more-complicated processing is not relevant to those
operations.  (I'm not 100% sure about exact substring matches, but maybe 
if the byte string given to search for is proper (e.g., doesn't have any
partial representations of characters), it's okay).

However, I think you're stretching things too much to say "most."  (I 
guess it depends on what we're calling "text processing".)

> 
> > ... But in most of the cases you have to _think_ in
> > characters, otherwise it's quite unlikely that your application will work
> > correctly.
> 
> It's the other way around, too: you have to think in terms of bytes.
> If you're thinking in terms of characters too much you'll end up doing
> noninvertable transformations and introduce vulnerabilities when data
> has been maliciously crafted not to be valid utf-8 (or just bugs due
> to normalizing data, etc.).

Well of course you need to think in bytes when you're interpreting the
stream of bytes as a stream of characters, which includes checking for 
invalid UTF-8 sequences.

Once you've checked that the bytes properly represent characters, from
then on you need to think in characters unless you're doing sufficiently
simple operations (e.g., your list above).

 
> Hardly. A byte-based regex for all case matches (e.g. "(Ã¤|Ã?)") will
> work just as well even for case-insensitive matching, and literal
> character matching is simple substring matching identical to any other
> sane encoding. I get the impression you don't understand UTF-8..

How do you match a single character?  Would you want the programmer to 
have to write an expression that matches a byte 0x00 through 0x7F, a
sequence of two bytes from 0xC2 0x80 through 0xDF 0xBF, a sequence of
three bytes from 0xE1 0xA0 0x80 through 0xEF 0xBF 0xBF, etc. [hoping I 
got those bytes right] instead of simply "."?


>... Sometimes a byte-based regex is also useful. For
> example my procmail rules reject mail containing any 8bit octets if
> there's not an appropriate mime type for it. This kills a lot of east
> asian spam. :)

Yep.

Of course, you can still do that with character-based strings if you
can use other encodings.  (E.g., in Java, you can read the mail
as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255.
Then you can write the regular expression in terms of Unicode characters
0-255.  The only disadvantage there is probably some time spent
decoding the byte stream into the internal representation of characters.)

Maybe the net result from your point is that one should be able to read
byte streams in encodings other than just UTF-8.  (A language might
do that by converting anything else into UTF-8, or could use a different
internal representation (e.g., as Java uses UTF-16).)


>  A stream should accept bytes, and a character string
> should always be interpreted as bytes according to the machine's
> locale when read/written to a stream

Note that it's specific to the stream, not the machine.  (Consider 
for example, HTTP's Content-Encoding header's charset parameter.  A
web browser needs to handle different character encodings in different
responses.  A MIME application needs to handle different character 
encodings in different parts of a single multi-part message.)




Daniel
-- 
Daniel Barclay
[EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to