--- In [email protected], "entropyreduction" 
<alancampbelllists+ya...@...> wrote:
>
> These any help?
> 
> Unicode UTF-8 encoding
> http://www1.tip.nl/~t876506/utf8tbl.html
>

Here it is using and's and or's instead of subtracts and adds.  The subtracts 
work in the case only because you know the exact bit that are being removed 
(whereas &0x3f removes (eg) the top 2 bits, regardless of what they are.

Note that &, |, <<, >> are understood by PowerPro to operate on numbers.  Since 
PowerPro stores integers as (base 10) strings, it automatically converts the 
numbers to their binary form before applying the operator, then converts back 
to strings.

This function converts a UTF8 string representing a single code point to that 
code point in Unicode.  Note that utf8 can be stored in normal PowerPro 
strings.  Unicode cannot, so the result is returned as a number.

// split lines throughout by Yahoo...
// test samples are from wikipedia article on utf8

win.debug("50 => 50      ",  cvtutf8("\x50").convertbase(10,16))

win.debug("C2A2 => 00A2     ", cvtutf8("\xC2\xA2").convertbase(10,16))

win.debug("E282AC => 20AC    ", cvtutf8("\xE2\x82\xAC").convertbase(10,16))

win.debug("F0A4ADA2 => 024B62    ", 
cvtutf8("\xF0\xA4\xAD\xA2").convertbase(10,16))

//****************************************************
function cvtutf8(u8)

local b1=u8[0].tonum
if (b1 <= 0x7f)
quit (b1)


local b2 = u8[1].tonum
if (b1<=0xDf) 
quit ((b1&0x1f)<<6 | (b2 & 0x3f ))     //   110y yyyy  10xx xxxx 

local b3 = u8[2].tonum
if (b1<=0xef)
quit ((b1&0xf)<<12 | (b2&0x3f)<<6 | (b3&0x3f) )   //1110zzzz 10yyyyyy 10xxxxxx 

local b4 = u8[3].tonum
quit  ((b1&0xf)<<18 | (b2&0x3f)<<12 | (b3&0x3f)<<6 | (b4&0x3f))    //11110uuu 
10zzzzzz 10yyyyyy 10xxxxxx      

     


Reply via email to