On Sat, 24 Jan 2015 13:45:37 +0100 Diederick Huijbers ☾ <[email protected]> wrote:
> Thanks so much Richard, one question though .... (see below) Please reply to the list ( [email protected] ), not just to me. > > The ICU positions translate to byte offsets as: > > Position 0 = Byte offset 0 > > Position 1 = Byte offset 3 > > Position 2 = Byte offset 6 > > Position 3 = Byte offset 9 > > Position 4 = Byte offset 10 (previous character was ASCII space) > > Position 5 = Byte offset 13 > > Position 6 = Byte offset 16 > > Position 7 = Byte offset 19 > > Position 8 = Byte offset 20 > > Position 9 = Byte offset 23 > > Position 10 = Byte offset 26 > > Position 11 = Byte offset 29 (end of string, so no cluster, no > > glyphs) > > The ICU positions are 16-bit word offsets in UTF-16. I don't know > > if there is a UTF-8 interface; I believe ICU word segmentation that > > needs dictionary lookup is broken for UTF-8. > How did you arrive to this mapping? I'm wondering what structs hold > these information. If it's precomputed for you, I think that will be done by ICU rather than by HarfBuzz. I know the lengths of Unicode characters (by codepoint) in the UTF-8 and UTF-16 encodings. I also knew that the HarfBuzz cluster numbers would be byte offsets, so I checked my workings that way. I would generate such a table by stepping through the string, character by character. Strictly, one should ensure that the UTF-8 string consists only of UTF-8 characters, e.g. no CESU-8 or Latin-1 masquerading as UTF-8. I would treat surrogate codepoints (U+D800 to U+DFFF) as corresponding to two UTF-8 bytes. If the string originates as a sequence of characters in UTF-8, there will be no lone surrogates to create trouble. I would test the generation of this conversion table using a mixture of 1-byte, 2-byte and 4-byte characters. Richard. _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
