Re: g_utf8_validate() and NUL characters

Freddie Unpenstein Thu, 09 Oct 2008 19:42:07 -0700

From: "Behdad Esfahbod", 10/10/2008 10:33

> Freddie Unpenstein wrote:
>> Why not just adopt the old thing of encoding NULLs and other non-UTF-8
>> characters as safe UTF-8 equivelants...?
> Because they are not valid UTF-8? And the moment we give up dealing with
> valid UTF-8 a whole other can of worms opens up.


I am aware of the trouble allowing multiple encodings of a given character can 
cause.  And I'm not suggesting that at all.  If you're referring to anything 
other than that, please expand on that a little.

My assertion here is basically this; ASCII text (defined here as characters 
1-127) encode into UTF-8 as-is.  Anything else in the 0-255 set is considered 
binary, and should be encoded in its shortest multi-byte UTF-8 form.  No more, 
and no less.  Call it Glib encoding.

I believe, that differs from the UTF-8 specification ONLY in the handling of 
the NULL byte, but then I've been avoiding dealing with UTF-8 for the most part 
for exactly this reason.  When UTF-8 is a strict issue, I've been using 
higher-level scripted languages instead, that already deal with it natively.  
(And I'm not 100% certain, but I think that's essentially what they all do.)

A "convert to UTF-8" function given a UTF-8 input with a 6-byte representation 
of the character 'A' would store the regular single-byte representation.  
Likewise, given a 1 or 4-byte representation of NULL, it would store the 2-byte 
C080 representation.  A generic "convert input to Glib" function which takes 
the input data and its encoding, and produces "UTF-8 for internal use only" 
(aka Glib encoding here), would assert that rule even for UTF-8 input.  
Likewise a "convert Glib to output" function, asked to produce UTF-8 output, 
would convert whatever it's given to it, into STRICT UTF-8 (ie. restore C080's 
to their one-byte \0 representation).  So the rule of thumb would be, "ALWAYS 
convert EVERYTHING entering or leaving the application".  And that's a Good 
Thing that should be encourages regardless of this issue.

I know it's a bit of a mind-bend from where Glib/GTK is right now with 
encodings, Glib/GTK developers don't like hearing from us lowly humans, and 
there's always resistance to change, but specifications often change when 
needed to meet practical requirements (no one has ever written a 100% perfect 
specification), and personally, changing the platform and established behaviour 
(much harder and more dangerous to attempt to do) to suit the UTF-8 
specification in this rather trivial issue seems far more wrong than breaking 
the UTF-8 specification slightly for internal use only.  (The key being the 
"for internal use only", all "convert to UTF-8" functions would still produce 
the strict interpretation with \0's)  It seems furthermore to be more correct 
in this day and age to bend a rule like this that makes it SAFER by allowing 
the old NULL-terminated string handling to function, and not force programmers 
to deal specially with length specifiers, which happens to all too frequently 
be a great source of coding mistakes.  This would also make it easier to 
migrate, for example, to UTF-16 at some point in time - everything will already 
be converting between UTF-8 to Glib-8, so transitioning to Glib-16 would be an 
entirely internal affair.


Fredderic

------------------------------------------------------------
Italian Charm Bracelet
Click for fashionable Italian charm bracelets.
http://tagline.excite.com/fc/JkJQPTgLuTcOdlmN1YthoWcmwJpeghCVmKv3BTMZK4ss0jqUfbgWLC/

_______________________________________________
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

Re: g_utf8_validate() and NUL characters

Reply via email to