On Tue, 28 Feb 2006 11:58:03 +0100 Patrick Lauer <[EMAIL PROTECTED]> wrote:
> During that discussion we realized that having utf-8 not enabled by > default and no utf8 fonts available by default causes lots of > recompilation and reconfiguration. > > Enabling the unicode useflag in the profiles should help our > international users and should not cause any problems. Are there any > known bugs / problems this would trigger? Any reasons against that? Enabling support for utf-8 should be fine, but I'd like to sound a note of caution about using a utf-8 locale as a system-wide setting. Since UTF-8 contains "holes" in the representation (i.e. some sequences of 8-bit values are invalid), when something is asked to parse such invalid data unexpected results can ensue. For an example, see bug #125375 - it turns out that invalid sequences do not match '.' in sed regular expressions (sed-4.1.4). The other gnu tools probably behave similarly. Up to a point this is in line with the UTF-8 spec, which says, "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters." (chapter 3 para 2 rule C12a). This clearly means that the invalid bytes cannot match "." (or anything else for that matter). However sed should either generate an error, filter the illegal bytes out of its input, or replace them with a marker (replacement character) - instead it leaves the non-conformant bytes alone. -- Kevin F. Quinn
signature.asc
Description: PGP signature
