--- In [email protected], "entropyreduction"
<alancampbelllists+ya...@...> wrote:
>
> These any help?
>
> Unicode UTF-8 encoding
> http://www1.tip.nl/~t876506/utf8tbl.html
>
Here it is using and's and or's instead of subtracts and adds. The subtracts
work in the case only because you know the exact bit that are being removed
(whereas &0x3f removes (eg) the top 2 bits, regardless of what they are.
Note that &, |, <<, >> are understood by PowerPro to operate on numbers. Since
PowerPro stores integers as (base 10) strings, it automatically converts the
numbers to their binary form before applying the operator, then converts back
to strings.
This function converts a UTF8 string representing a single code point to that
code point in Unicode. Note that utf8 can be stored in normal PowerPro
strings. Unicode cannot, so the result is returned as a number.
// split lines throughout by Yahoo...
// test samples are from wikipedia article on utf8
win.debug("50 => 50 ", cvtutf8("\x50").convertbase(10,16))
win.debug("C2A2 => 00A2 ", cvtutf8("\xC2\xA2").convertbase(10,16))
win.debug("E282AC => 20AC ", cvtutf8("\xE2\x82\xAC").convertbase(10,16))
win.debug("F0A4ADA2 => 024B62 ",
cvtutf8("\xF0\xA4\xAD\xA2").convertbase(10,16))
//****************************************************
function cvtutf8(u8)
local b1=u8[0].tonum
if (b1 <= 0x7f)
quit (b1)
local b2 = u8[1].tonum
if (b1<=0xDf)
quit ((b1&0x1f)<<6 | (b2 & 0x3f )) // 110y yyyy 10xx xxxx
local b3 = u8[2].tonum
if (b1<=0xef)
quit ((b1&0xf)<<12 | (b2&0x3f)<<6 | (b3&0x3f) ) //1110zzzz 10yyyyyy 10xxxxxx
local b4 = u8[3].tonum
quit ((b1&0xf)<<18 | (b2&0x3f)<<12 | (b3&0x3f)<<6 | (b4&0x3f)) //11110uuu
10zzzzzz 10yyyyyy 10xxxxxx