--- In [email protected], "brucexs" <bswit...@...> wrote: > > --- In [email protected], "silvermoonwoman2001" <sherip99@> wrote: > > > > --- In [email protected], "brucexs" <bswitzer@> wrote: > > > > > > If you need to do some specific mathematical or string operation, > > > maybe I can help on that. > > > > > > A couple of test cases: > > if utf8charstring=="\xf0\x90\x80\x80" c s/b 0x10000 > > if utf8charstring=="\xf0\x90\x8a\xa9" c s/b 0x102A9 > > > > If I know all of the following, how do I solve for c ? > > > > case("tonum", utf8charstring[0]) = 0xF0 | c>>18 > > case("tonum", utf8charstring[1]) = 0x80 | c>>12 & 0x3F > > case("tonum", utf8charstring[2]) = 0x80 | c>>6 & 0x3F > > case("tonum", utf8charstring[3]) = 0x80 | c & 0x3F > > > > c = ?? > > > > Hdf a quick look at the wikipedia article. I think want something > that includes scans the bytes one at a time and does the four > cases, using shifts and masks. By four cases, I am referring to > whether there are 0, 1, 2, or 3 leading ones. > > I am not really sure I understand what are asking about above > when you mention c.
A code point is a unit of unicode. Personally I think of it as a character (although there are some situations where code points can be combined, resulting in an "extended character" -- e.g., a main character with some unusual diacritical). I can encode utf-8 (which uses one to four bytes per unicode code point) from the code point numeric values using the previously posted script. I haven't mastered bit operations. I just finally created a file, using my encoder for each code point in the unicode org's data base, with two values per line: codepoint-hexvalue;codepoint-encoded-in-utf8 So, if I want the codepoint hex-value for some arbitrary utf-8 "character" (matches dot in utf-8 regex), I can look it up in the file's readall using normal (non-utf8) regex. Since the file has only valid existing codepoints, there are only 19336 entries. Contrast that with the over 2 million "possible" code points. I know the conversion should be doable with much less overhead, but using file lookup works and was something I could manage. Regards, Sheri
