On Tue, 28 Feb 2006 11:58:03 +0100
Patrick Lauer <[EMAIL PROTECTED]> wrote:

> During that discussion we realized that having utf-8 not enabled by
> default and no utf8 fonts available by default causes lots of
> recompilation and reconfiguration. 
> 
> Enabling the unicode useflag in the profiles should help our
> international users and should not cause any problems. Are there any
> known bugs / problems this would trigger? Any reasons against that?

Enabling support for utf-8 should be fine, but I'd like to sound a note
of caution about using a utf-8 locale as a system-wide setting.  Since
UTF-8 contains "holes" in the representation (i.e. some sequences of
8-bit values are invalid), when something is asked to parse such
invalid data unexpected results can ensue.

For an example, see bug #125375 - it turns out that invalid sequences
do not match '.' in sed regular expressions (sed-4.1.4).  The other gnu
tools probably behave similarly.  Up to a point this is in line with the
UTF-8 spec, which says, "When a process interprets a code unit sequence
which purports to be in a Unicode character encoding form, it shall
treat ill-formed code unit sequences as an error condition, and shall
not interpret such sequences as characters." (chapter 3 para 2 rule
C12a).  This clearly means that the invalid bytes cannot match "." (or
anything else for that matter).  However sed should either generate an
error, filter the illegal bytes out of its input, or replace them with
a marker (replacement character) - instead it leaves the non-conformant
bytes alone.

-- 
Kevin F. Quinn

Attachment: signature.asc
Description: PGP signature

Reply via email to