Re: perl unicode support

Daniel B. Wed, 28 Mar 2007 18:40:43 -0800

Rich Felker wrote:
> 
> On Tue, Mar 27, 2007 at 10:55:32PM -0400, Daniel B. wrote:
> > Rich Felker wrote:
> > > ...
> > > None of this is relevant to most processing of text which is just
> > > storage, retrieval, concatenation, and exact substring search.
> >
> > It might be true that more-complicated processing is not relevant to those
> > operations.  (I'm not 100% sure about exact substring matches, but maybe
> > if the byte string given to search for is proper (e.g., doesn't have any
> > partial representations of characters), it's okay).
> 
> No character is a substring of another character in UTF-8. ...


I know.  That's why I addresses avoiding _partial_ byte sequences.



> > Well of course you need to think in bytes when you're interpreting the
> > stream of bytes as a stream of characters, which includes checking for
> > invalid UTF-8 sequences.
> 
> And what do you do if they're present? 

Of course, it depends where they are present.  You seem to be addressing
relative special cases.


> Under your philosophy, it would
> be impossible for me to remove files with invalid sequences in their
> names, since I could neither type the filename nor match it with glob
> patterns (due to the filename causing an error at the byte to
> character conversion phase before there's even a change to match
> anything). ...

If the file name contains illegal byte sequences, then either they're
not in UTF-8 to start with or, if they're supposed to be, something
else let invalid sequences through.

If they're not always in UTF-8 (if they're sometimes in a different
encoding), then why would you be interpreting them as UTF-8 (why would 
you hit the case where it seems there's an illegal sequence)?  (How
do you know what encoding they're in?  Or are you dealing with the
problem of not having any specification of what the encoding really
is and having to guess?)

If they're supposed to be UTF-8 and aren't, then certainly normal
tools shouldn't have to deal with malformed sequences.  If you write
a special tool to fix malformed sequences somehow (e.g., delete files
with malformed sequences), then of course you're going to be dealing
with the byte level and not (just) the character level.


> Other similar problem: I open a file in a text editor and it contains
> illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,

Again, you seem to be dealing with special cases.

If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why
would you expect a UTF-8 text editor to work on it? 

For the data that is parseable as a valid UTF-8 encoding of characters, 
how do you propose to know whether it really is characters encoded as
UTF-8 or is characters encoded some other way?   

(If you see the byte sequence 0xDF 0xBF, how do you know whether that 
means the character U+003FF or the two characters U+00DF U+00BF?  For
example, if at one point you see the UTF-8-illegal byte sequence
0x00 0xBF and assume that that 0xBF byte means character U+00BF, then
if you see the UTF-8-legal byte sequence 0xDF 0xBF, how would you
know that that 0xBF byte also represents U+00BF vs. whether it's really
part of the representation of character U+003FF?)

Either edit the test file with a byte editor, or edit it with a text 
editor specifying an encoding that maps bytes to characters (e.g., 
ISO 8859-1), even if those characters aren't the same characters the
UTF-8-valid parts of the file represent.


> or a file with mixed encodings (e.g. a mail spool) or with mixed-in
> binary data. I want to edit it anyway and save it back without
> trashing the data that does not parse as valid UTF-8, while still
> being able to edit the data that is valid UTF-8 as UTF-8.

If the file uses mixed encodings, then of course you can't read the
entire file in one encoding.  

But when you determine that a given section (e.g., a MIME part) is
in some given encoding, why not map that section's bytes to characters
and then work with characters from then on?

What if you're searching for a character string across multiple sections
of a mixed-encoding file like that?  You certainly can't write a UTF-8-
byte regular expression that matches other encodings.  And you can't 
"OR" together the regular expression for a UTF-8 byte encoding of the 
character string with each other encoding's byte sequences (since 
there's no way for the regular expression matcher to know which
alternative in the regular expression it should be using in each
section of the mixed-encoding file.


... 
> > > Hardly. A byte-based regex for all case matches (e.g. "(ÃfÂ¤|Ãf?)") will
> 
> The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
> instill faith...

Maybe you should think more clearly.  I didn't write my mailer, so the
quality of its behavior doesn't reflect my knowledge.   



> > >... Sometimes a byte-based regex is also useful. For
> > > example my procmail rules reject mail containing any 8bit octets if
> > > there's not an appropriate mime type for it. This kills a lot of east
> > > asian spam. :)
> >
> > Yep.
> >
> > Of course, you can still do that with character-based strings if you
> > can use other encodings.  (E.g., in Java, you can read the mail
> > as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255.
> > Then you can write the regular expression in terms of Unicode characters
> > 0-255.  The only disadvantage there is probably some time spent
> > decoding the byte stream into the internal representation of characters.)
> 
> The biggest disadvantage of it is that it's WRONG. 

Is it any more wrong than your rejecting of 1xxxxxx bytes?  The bytes
represent characters in some encoding.  You ignore those characters and
reject based on just the byte values.  

> The data is not
> Latin-1, and pretending it's Latin-1 is a hideous hack. 

It's not pretending the data is bytes encoding characters.  It's mapping
bytes to characters to use methods defined on characters.  Yes, it could
be misleading if it's not clear that it's a temporary mapping only for
that purpose (i.e., that the mapped-to characters are not the characters
that the byte sequence really represents).  And yes, byte-based regular
expressions would be useful.

> Maybe 20 years from now we'll finally be able to get rid of the
> nonsense and just assume everything is UTF-8...
 
Hopefully.


Daniel
-- 
Daniel Barclay
[EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to