[power-pro] Re: unicode muti-word code points (Bruce)

silvermoonwoman2001 Mon, 31 Aug 2009 15:30:29 -0700

--- In [email protected], "brucexs" <bswit...@...> wrote:
>
> --- In [email protected], "silvermoonwoman2001" <sherip99@> wrote:
> >
> > --- In [email protected], "brucexs" <bswitzer@> wrote:
> > >
> > > If you need to do some specific mathematical or string operation,
> > > maybe I can help on that.
> > 
> > 
> > A couple of test cases:
> > if utf8charstring=="\xf0\x90\x80\x80" c s/b 0x10000
> > if utf8charstring=="\xf0\x90\x8a\xa9" c s/b 0x102A9
> > 
> > If I know all of the following, how do I solve for c ?
> > 
> > case("tonum", utf8charstring[0]) = 0xF0 | c>>18
> > case("tonum", utf8charstring[1]) = 0x80 | c>>12 & 0x3F
> > case("tonum", utf8charstring[2]) = 0x80 | c>>6 & 0x3F
> > case("tonum", utf8charstring[3]) = 0x80 | c & 0x3F
> > 
> > c = ??
> >
> 
> Hdf a quick look at the wikipedia article. I think want something
> that includes scans the bytes one at a time and does the four
> cases, using shifts and masks. By four cases, I am referring to
> whether there are 0, 1, 2, or 3 leading ones. 
>
> I am not really sure I understand what are asking about above
> when you mention c.


A code point is a unit of unicode. Personally I think of it as a character 
(although there are some situations where code points can be combined, 
resulting in an "extended character" -- e.g., a main character with some 
unusual diacritical).

I can encode utf-8 (which uses one to four bytes per unicode code point) from 
the code point numeric values using the previously posted script.

I haven't mastered bit operations. I just finally created a file, using my 
encoder for each code point in the unicode org's data base, with two values per 
line:

codepoint-hexvalue;codepoint-encoded-in-utf8

So, if I want the codepoint hex-value for some arbitrary utf-8 "character" 
(matches dot in utf-8 regex), I can look it up in the file's readall using 
normal (non-utf8) regex. Since the file has only valid existing codepoints, 
there are only 19336 entries. Contrast that with the over 2 million "possible" 
code points.

I know the conversion should be doable with much less overhead, but using file 
lookup works and was something I could manage.

Regards,
Sheri

[power-pro] Re: unicode muti-word code points (Bruce)

Reply via email to