Mano wrote: >I am sure this is a stupid question... >but hey! why break with tradition. > >why is the first byte of my Greek character 206 (hex CE) >when on the unicode chart Greek (not polytonic.. just simple) the first >byte is 3 (hex 03 :) > >I may be further than I thought. > >Mano
This is UTF-8 see http://en.wikipedia.org/wiki/UTF-8 or google for UTF-8 encoding or similar. UTF-8 is a variable length encoding that allows for multi-byte characters to be embedded in 8-bit data such that the 7-bit ASCII characters ( where $a(char)<128 ) are each represented unchanged in a single byte and the characters with $a(char)>127 indicate the introduction or the continuation of a multi-byte sequence. The 206 is 11001110 in binary. That breaks down to 110_01110 (192 + 14 in decimal) where the 110 (192 decimal) introduces a 2-byte character sequence and the 01110 (14 decimal) represents the high bits of the decoded unicode character number. The second byte of each pair has high-bits 10 (128 decimal) leaving six bits to add to the five bits from the first byte. The high bits get multiplied by 64 to effect a 6 bit shift. Here is another decoding of your previous Greek example: s z="ΛΚΞΔΣΛΦΚ" f i=1:1:$l(z) w !,i s c=$e(z,i),a=$a(c) s:a>127 i=i+1,c=c_$e(z,i),a=a-192*64+($a(z,i)-128) w *9,c,*9,a,*9,$s(a>127:"&#"_a_";",1:a) 1 Λ 923 Λ 3 Κ 922 Κ 5 Ξ 926 Ξ 7 Δ 916 Δ 9 Σ 931 Σ 11 Λ 923 Λ 13 Φ 934 Φ 15 Κ 922 Κ --------------------------------------- Jim Self Systems Architect, Lead Developer VMTH Computer Services, UC Davis (http://www.vmth.ucdavis.edu/us/jaself) ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Hardhats-members mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/hardhats-members
