Re: [Oorexx-devel] How to portably convert between 8-bit and UTF-8 and vice versa

Mike Cowlishaw Tue, 05 Jul 2011 06:28:39 -0700

Hi, Rony, trying to see what you are really looking for.
 
UTF8 -> 8-bit in general cannot be done because UTF-8 is an encoding for 16-bit
characters (and some escapes for 32-bit extension).
 
However, for European characters the Latin-1 character set
(http://en.wikipedia.org/wiki/ISO/IEC_8859-1) is probably what you need: that is
the 8-bit codes from '20'x -> '7e'x and 'a0'x -> 'ff'x?
 
These codes map correctly to just about any modern code page (including
Windows).   Transforming between UTF8 and 8-bit for this set of characters is
straightforward; here's some ancient sample code for 8-bit -> UTF [it probably
gives wrong results for non-Latin1 characters by the looks of it].
 
Mike
 
/* --------------------------------------------------------------- */
/* UTF-8 encoder (for 00-FF only)                                  */
/* --------------------------------------------------------------- */
utf8: procedure
  parse arg data
  out=''
  do while data\==''                    -- generate escapes
    parse var data char +1 data
    d=c2d(char)
    if d>=128 then do
      bits=x2b(c2x(char))
      c1=x2c(b2x('110000'left(bits, 2)))
      c2=x2c(b2x('10'substr(bits, 3)))
      char=c1||c2
      end
    out=out||char
    end
  return out

  _____  

From: Rony G. Flatscher [mailto:rony.flatsc...@wu-wien.ac.at] 
Sent: 05 July 2011 09:44
To: Open Object Rexx Developer Mailing List
Subject: Re: [Oorexx-devel] How to portably convert between 8-bit and UTF-8 and
vice versa

Hi Jean-Louis,

2011/7/4 Rony G. Flatscher <rony.flatsc...@wu-wien.ac.at>

Hi there,

in the process of creating an external ooRexx function library, I have
sometimes to transport strings as UTF-8, even if non-7-Bit-ASCII
characters are part of it (for non-English characters).

If you need only to transport utf-8 strings, then  strcpy and strlen should do
the work. You will work on bytes, not on characters.
If you need to work on characters, and search for a lightweight library, then
http://utfcpp.sourceforge.net/ may help. But your request was not on that :-)

:)

The problem is as follows: the library is supposed to open the dbus world to
ooRexx programmers. dbus implementations are - they claim for security reasons -
extremely wary about spoofing and therefore check everything thoroughly. If an
argument is wrong for whatever reasons the message call is not carried out.

The current state is that transporting strings is fine as long as they only
contain 7-Bit-ASCII-characters/bytes, i.e. only English letters. Once starting
to transport German umlauts, which of course is very common in a German speaking
country (as French characters in your country), then dbus merely disconnects, if
detecting that the string is not properly UTF-8-encoded! This makes ooRexx
totally incompatible with dbus (and the rest of the world that has been using
UTF-8 as a standard encoding). 

As ooRexx (unexplainably!) still does not officially support UTF-8/Unicode (in
the meantime the entire world speaks UTF-8/Unicode, text files are
UTF-8/Unicode, arguments are UTF-8/Unicode etc.) I need some means to at least
cater somehow for creating proper UTF-8 encodings. Hence this request for help.

Ist there a simple/easy way in C++ how one could create UTF-8 strings
from 8-Bit-Strings and convert UTF-8 to 8-Bit-Strings, such that that
code compiles for Windows as well as for gcc on the other platforms ?

That's more complicated... ICU supports plenty of character sets, but it's big. 
See also the library Glib used by GTK :
http://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html.
If your 8-bit string is always encoded in the current locale encoding (C
runtime), then functions like
g_locale_to_utf8 ()
g_locale_from_utf8 ()
from Glib are what you need.

Hmm, glib would cover at least GNOME-based Linuxes (plus systems where gtk-apps
got installed to, but this would be merely by chance). 

Would you know by any chance whether there are alternatives for Linux, MacOSX
and Windows ? 

---

This would not be problem at all, if ooRexx supported UTF-8/Unicode, as every
modern scripting language does nowadays!

---rony

P.S.: Am even contemplating of using JNI (the Java native interface) which
possesses UTF-8 encodings/decodings out of the box, which means that the dbus
library would have to become a part of BSF4ooRexx. Should ooRexx ever get
UTF-8/Unicode capabilities I could adjust the respective code then.

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Re: [Oorexx-devel] How to portably convert between 8-bit and UTF-8 and vice versa

Reply via email to