Re: [Oorexx-devel] How to portably convert between 8-bit and UTF-8 and vice versa

Rony G. Flatscher Tue, 05 Jul 2011 06:44:00 -0700

Hi Mike,

thank you very much for your information and code !


As in the meantime I was able to come up with a working solution with
BSF4ooRexx, I will look into it again (using your code), once the
library is done and do timings, as preferably I would not want to force
the usage of BSF4ooRexx, if people do not want to take advantage of it.

Currently the conversions is done with the following two routines
requiring the BSF.CLS Rexx package from BSF4ooRexx:

    /* as of 2011-07-05 ooRexx does not support UTF-8, hence taking advantage 
of BSF4ooRexx */
    ::routine string.To.utf8 public
      parse arg str
      return BsfRawBytes(.java.lang.String~new(str)~getBytes("UTF-8"))


    ::routine utf8.To.String public
      parse arg str
      return 
BsfRawBytes(.java.lang.String~new(BsfRawBytes(str),"UTF-8")~getBytes)
      

BsfRawBytes() is a built-in function (for speed) that turns a Rexx
string into a Java byte array without conversion. If the argument is a
Java byte array it will get turned into a Rexx string without
conversion. Other than that the code takes advantage of Java's String
class conversion capabilities.

---rony



On 05.07.2011 15:28, Mike Cowlishaw wrote:
> Hi, Rony, trying to see what you are really looking for.
>  
> UTF8 -> 8-bit in general cannot be done because UTF-8 is an encoding
> for 16-bit characters (and some escapes for 32-bit extension).
>  
> However, for European characters the Latin-1 character set
> (http://en.wikipedia.org/wiki/ISO/IEC_8859-1) is probably what you
> need: that is the 8-bit codes from '20'x -> '7e'x and 'a0'x -> 'ff'x?
>  
> These codes map correctly to just about any modern code page
> (including Windows).   Transforming between UTF8 and 8-bit for this
> set of characters is straightforward; here's some ancient sample code
> for 8-bit -> UTF [it probably gives wrong results for non-Latin1
> characters by the looks of it].
>  
> Mike
>  
> /* --------------------------------------------------------------- */
> /* UTF-8 encoder (for 00-FF only)                                  */
> /* --------------------------------------------------------------- */
> utf8: procedure
>   parse arg data
>   out=''
>   do while data\==''                    -- generate escapes
>     parse var data char +1 data
>     d=c2d(char)
>     if d>=128 then do
>       bits=x2b(c2x(char))
>       c1=x2c(b2x('110000'left(bits, 2)))
>       c2=x2c(b2x('10'substr(bits, 3)))
>       char=c1||c2
>       end
>     out=out||char
>     end
>   return out
>  
>  
>  
>  
>  
>
>     ------------------------------------------------------------------------
>     *From:* Rony G. Flatscher [mailto:rony.flatsc...@wu-wien.ac.at]
>     *Sent:* 05 July 2011 09:44
>     *To:* Open Object Rexx Developer Mailing List
>     *Subject:* Re: [Oorexx-devel] How to portably convert between
>     8-bit and UTF-8 and vice versa
>
>     Hi Jean-Louis,
>>
>>     2011/7/4 Rony G. Flatscher <rony.flatsc...@wu-wien.ac.at
>>     <mailto:rony.flatsc...@wu-wien.ac.at>>
>>
>>         Hi there,
>>
>>         in the process of creating an external ooRexx function
>>         library, I have
>>         sometimes to transport strings as UTF-8, even if non-7-Bit-ASCII
>>         characters are part of it (for non-English characters).
>>
>>
>>     If you need only to transport utf-8 strings, then  strcpy and
>>     strlen should do the work. You will work on bytes, not on characters.
>>     If you need to work on characters, and search for a lightweight
>>     library, then http://utfcpp.sourceforge.net/ may help. But your
>>     request was not on that :-)
>     :)
>
>     The problem is as follows: the library is supposed to open the
>     dbus world to ooRexx programmers. dbus implementations are - they
>     claim for security reasons - extremely wary about spoofing and
>     therefore check everything thoroughly. If an argument is wrong for
>     whatever reasons the message call is not carried out.
>
>     The current state is that transporting strings is fine as long as
>     they only contain 7-Bit-ASCII-characters/bytes, i.e. only English
>     letters. Once starting to transport German umlauts, which of
>     course is very common in a German speaking country (as French
>     characters in your country), then dbus merely disconnects, if
>     detecting that the string is not properly UTF-8-encoded! This
>     makes ooRexx totally incompatible with dbus (and the rest of the
>     world that has been using UTF-8 as a standard encoding).
>
>     As ooRexx (unexplainably!) still does not officially support
>     UTF-8/Unicode (in the meantime the entire world speaks
>     UTF-8/Unicode, text files are UTF-8/Unicode, arguments are
>     UTF-8/Unicode etc.) I need some means to at least cater somehow
>     for creating proper UTF-8 encodings. Hence this request for help.
>
>>         Ist there a simple/easy way in C++ how one could create UTF-8
>>         strings
>>         from 8-Bit-Strings and convert UTF-8 to 8-Bit-Strings, such
>>         that that
>>         code compiles for Windows as well as for gcc on the other
>>         platforms ?
>>
>>     That's more complicated... ICU supports plenty of character sets,
>>     but it's big.
>>     See also the library Glib used by GTK :
>>     
>> http://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html.
>>     If your 8-bit string is always encoded in the current locale
>>     encoding (C runtime), then functions like
>>     g_locale_to_utf8 ()
>>     g_locale_from_utf8 ()
>>     from Glib are what you need.
>     Hmm, glib would cover at least GNOME-based Linuxes (plus systems
>     where gtk-apps got installed to, but this would be merely by chance).
>
>     Would you know by any chance whether there are alternatives for
>     Linux, MacOSX and Windows ? 
>
>     ---
>
>     This would not be problem at all, if ooRexx supported
>     UTF-8/Unicode, as every modern scripting language does nowadays!
>
>     ---rony
>
>     P.S.: Am even contemplating of using JNI (the Java native
>     interface) which possesses UTF-8 encodings/decodings out of the
>     box, which means that the dbus library would have to become a part
>     of BSF4ooRexx. Should ooRexx ever get UTF-8/Unicode capabilities I
>     could adjust the respective code then.
>
>
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security 
> threats, fraudulent activity, and more. Splunk takes this data and makes 
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
>
>
> _______________________________________________
> Oorexx-devel mailing list
> Oorexx-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oorexx-devel
>   

-- 
--
__________________________________________________________________________________

Prof. Dr. Rony G. Flatscher
Department Informationsverarbeitung und Prozessmanagement
Institut für Betriebswirtschaftslehre und Wirtschaftsinformatik
WU Wien
Augasse 2-6
A-1090  Wien/Vienna, Austria/Europe

http://www.wu.ac.at (English: http://www.wu.ac.at/start/en)
__________________________________________________________________________________

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Re: [Oorexx-devel] How to portably convert between 8-bit and UTF-8 and vice versa

Reply via email to