Encoding conversions

Michael B. Allen Fri, 07 Sep 2001 12:23:35 -0700
Hi,

I want to write serialization functions for encoding and decoding
strings in binary file formats and network messages. This requires
converting the internal representations whcar_t and multibyte strings
to the various character encodings such as UCS-2LE, ISO-8859-1, UTF-8,
ASCII, etc. But it's not clear to me how this should be done correctly
and in a portable way (or at least portable enough so that when if comes
time to port I don't smack myself in the forehead).

I gather that I can only assume that wchar_t is just a sequence of UCS
codes of sizeof(wchar_t) in size. But is the in memory representation
of a multi-byte string the equivalent of the UTF-8 encoding such that
I can simply write it to a stream and read it back as UTF-8?

So as an example case, to encode wchar_t to UTF-16LE I must convert each
character to a definative encoding such as UCS-4 and then use iconv
to get to UTF-16LE. For a multi-byte string can I ue iconv on the mbs
directly to get UTF-16LE?

Another example might be to decode CP1252 I could use iconv to get to
say UCS-4 and then decode each sequence of 4 bytes to uint32_t and then
reassign it in place to correct for the byte order of the host?

Thanks,
Mike

PS: When encoding ASCII do I want to shave off the 8th bit?

-- 
Wow a memory-mapped fork bomb! Now what on earth did you expect? - lkml
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Encoding conversions

Reply via email to