From: "Behdad Esfahbod", 10/10/2008 10:33
> Freddie Unpenstein wrote:
>> Why not just adopt the old thing of encoding NULLs and other non-UTF-8
>> characters as safe UTF-8 equivelants...?
> Because they are not valid UTF-8? And the moment we give up dealing with
> valid UTF-8 a whole other can of worms opens up.
I am aware of the trouble allowing multiple encodings of a given character can
cause. And I'm not suggesting that at all. If you're referring to anything
other than that, please expand on that a little.
My assertion here is basically this; ASCII text (defined here as characters
1-127) encode into UTF-8 as-is. Anything else in the 0-255 set is considered
binary, and should be encoded in its shortest multi-byte UTF-8 form. No more,
and no less. Call it Glib encoding.
I believe, that differs from the UTF-8 specification ONLY in the handling of
the NULL byte, but then I've been avoiding dealing with UTF-8 for the most part
for exactly this reason. When UTF-8 is a strict issue, I've been using
higher-level scripted languages instead, that already deal with it natively.
(And I'm not 100% certain, but I think that's essentially what they all do.)
A "convert to UTF-8" function given a UTF-8 input with a 6-byte representation
of the character 'A' would store the regular single-byte representation.
Likewise, given a 1 or 4-byte representation of NULL, it would store the 2-byte
C080 representation. A generic "convert input to Glib" function which takes
the input data and its encoding, and produces "UTF-8 for internal use only"
(aka Glib encoding here), would assert that rule even for UTF-8 input.
Likewise a "convert Glib to output" function, asked to produce UTF-8 output,
would convert whatever it's given to it, into STRICT UTF-8 (ie. restore C080's
to their one-byte \0 representation). So the rule of thumb would be, "ALWAYS
convert EVERYTHING entering or leaving the application". And that's a Good
Thing that should be encourages regardless of this issue.
I know it's a bit of a mind-bend from where Glib/GTK is right now with
encodings, Glib/GTK developers don't like hearing from us lowly humans, and
there's always resistance to change, but specifications often change when
needed to meet practical requirements (no one has ever written a 100% perfect
specification), and personally, changing the platform and established behaviour
(much harder and more dangerous to attempt to do) to suit the UTF-8
specification in this rather trivial issue seems far more wrong than breaking
the UTF-8 specification slightly for internal use only. (The key being the
"for internal use only", all "convert to UTF-8" functions would still produce
the strict interpretation with \0's) It seems furthermore to be more correct
in this day and age to bend a rule like this that makes it SAFER by allowing
the old NULL-terminated string handling to function, and not force programmers
to deal specially with length specifiers, which happens to all too frequently
be a great source of coding mistakes. This would also make it easier to
migrate, for example, to UTF-16 at some point in time - everything will already
be converting between UTF-8 to Glib-8, so transitioning to Glib-16 would be an
entirely internal affair.
Fredderic
------------------------------------------------------------
Italian Charm Bracelet
Click for fashionable Italian charm bracelets.
http://tagline.excite.com/fc/JkJQPTgLuTcOdlmN1YthoWcmwJpeghCVmKv3BTMZK4ss0jqUfbgWLC/
_______________________________________________
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list