Re: [coders] Converting text to UTF-8

Andrew Bennetts Sat, 04 Jun 2011 07:14:28 -0700

Erik de Castro Lopo wrote:
> Erik de Castro Lopo wrote:
> 
> > Anyone care to clue me into the best way to deal with this?
> 
> I've tried something like this:
> 
>     static void
>     convert_string (const char * in, gchar * out, size_t maxlen)


So you just have a char* that's supposed to contain text, but you don't
know the encoding?  In that case, in a sense all you have is just bytes,
and without knowing the encoding you lack a way to turn that into text.
And so you don't have a way to produce a UTF-8 representation of the
that text, because you don't have the text.

You really only have three options:

 1) Find out what the encoding is.  In another message you say it's from
    an ID3 tag… a quick glance at the Wikipedia article suggests ID3v1
    doesn't specify an encoding.  ID3v2 apparently does, but I wouldn't
    be shocked to find bad data in them anyway.
 2) Guess by inspecting the bytes.  Algorithms for this can be fairly
    complicated and will still be wrong in many cases, so probably not
    worth the effort.
 3) Give up.  If the bytes aren't valid UTF-8 pretend they are latin-1
    (iso-8859-1).  It'll probably be wrong, but decoding as latin-1 will
    always produce something, even if it is mojibake.  Or just tell the
    user that their data is bad and don't try display it.

This is a decent introduction to handling non-ASCII text:
<http://www.joelonsoftware.com/articles/Unicode.html>.

-Andrew.

_______________________________________________
coders mailing list
coders@slug.org.au
http://lists.slug.org.au/listinfo/coders

Re: [coders] Converting text to UTF-8

Reply via email to