That happens when ligatures are applied to text. In general, there is no 1-to-1 relationship between characters and glyphs. You can use the cluster values in hb_glyph_info_t to match clusters of glyphs to clusters of characters. That's the most granular mapping you can get. In this case, you will see that one glyph (LAM-ALEF) corresponds to two characters (LAM and ALEF).
On Wed, Oct 31, 2018 at 7:28 AM Laurent CRUAU <laurent.cr...@ingenico.com> wrote: > Hello there, > > > > I am pretty new to harfbuzz but anyway I had not been into trouble for > long using arabic shaping until recently. > > And now I am submitted something weird with very few Arabic strings (the > vast majority of them do not cause any problem). > > > > I use HB v1.0.1 on Ubuntu 16, using the regular ArialTTF mscorefont. I > also tried HB v2.0.2. on an embedded target and got the same issue. > > > > Consider the following utf16 string: > > "\x8D\xFE" "\xDF\xFE" "\xB4\xFE" "\xE0\xFE" "\x8E\xFE" "\xE1\xFE" > "\x20\x00" "\xCB\xFE" "\xE0\xFE" "\xF4\xFE" "\xDC\xFE" "\xE2\xE” > > Or the following UTF8: > > > "\xEF\xBA\x8D\xEF\xBB\x9F\xEF\xBA\xB4\xEF\xBB\xA0\xEF\xBA\x8E\xEF\xBB\xA1\x20\xEF\xBB\x8B\xEF\xBB\xA0\xEF\xBB\xB4\xEF\xBB\x9C\xEF\xBB\xA2\x00"; > > > > After shaping has been performed, the following string is counted for 11 > glyphs (i.e. w/ hb_buffer_len). > > The strange thing is that some arabic speaking persons have told me that > VISUALLY, we still have 12 glyphs. And I can confirm this myself if I paste > this string in an online UTF8/16 decoder. I can move through 12 characters… > > > > Is there some implicit fusion at stake there, or some information I should > grab somewhere to match the visuals ? > > > > I did not mention I played with a lot of HB options to configure shaping > and I hope I have forgot something important. (hb_buffer_set_flags, > hb_buffer_set_unicode_funcs(…get_default()) etc…) > > > > Cheers, > > Laurent > > > > > > Here is my test snippet: > > > > > /*---------------------------------------------------------------------------- > > * > > * HarfBuzz arabic shaping text > > * > > > *----------------------------------------------------------------------------*/ > > > > #include <stdio.h> > > #include <string.h> > > #include <wchar.h> > > > > #include <harfbuzz/hb.h> > > #include <harfbuzz/hb-ft.h> > > > > #define ARIAL_TTF ("/usr/share/fonts/truetype/msttcorefonts/Arial.ttf") > > > > #define UTF16_TEST > > > > > > static const char utf8_content[] = > "\xEF\xBA\x8D\xEF\xBB\x9F\xEF\xBA\xB4\xEF\xBB\xA0\xEF\xBA\x8E\xEF\xBB\xA1\x20\xEF\xBB\x8B\xEF\xBB\xA0\xEF\xBB\xB4\xEF\xBB\x9C\xEF\xBB\xA2\x00"; > > > > static const char utf16le_content[] = "\x8D\xFE" "\xDF\xFE" "\xB4\xFE" > "\xE0\xFE" "\x8E\xFE" "\xE1\xFE" "\x20\x00" "\xCB\xFE" "\xE0\xFE" > "\xF4\xFE" "\xDC\xFE" "\xE2\xE" "\x0\x0"; > > > > int main( int argc, char** argv ) > > { > > /*data*/ > > hb_font_t* font; > > hb_buffer_t* buffer; > > hb_script_t script; > > FT_Library flib; > > FT_Face face; > > int found; > > int ret; > > > > > > /*code*/ > > ret = -1; > > font = NULL; > > buffer = NULL; > > found = 0; > > script = HB_SCRIPT_INVALID; > > > > if( FT_Init_FreeType(&flib) ) > > { printf("unable to initialize freetype library\n"); > > goto main_exit; > > } > > > > if( FT_New_Face(flib, ARIAL_TTF, 0, &face) ) > > { printf("cannot create face\n"); > > goto main_exit; > > } > > > > font = hb_ft_font_create(face, NULL); > > if( !font ) > > { printf("uanble to create font\n"); > > goto main_exit; > > } > > > > buffer = hb_buffer_create(); > > if( !buffer ) > > { printf("uanble to create buffer\n"); > > goto main_exit; > > } > > > > // Assign text segment to buffer and examine its properties > > #ifdef UTF16_TEST > > hb_buffer_add_utf16(buffer, (const uint16_t*)utf16le_content, 12, 0, > 12); > > #else > > hb_buffer_add_utf8(buffer, utf8_content, -1, 0, -1); > > #endif > > hb_buffer_guess_segment_properties(buffer); > > > > // Get script type of text > > script = hb_buffer_get_script(buffer); //Do not check here but > Arabic script IS detected > > > > hb_buffer_set_direction(buffer, HB_DIRECTION_RTL); > > hb_buffer_set_language(buffer, hb_language_from_string("ar", -1)); > > > > hb_shape(font, buffer, NULL, 0); > > printf("SHAPED !\n"); > > > > > > printf("got %d characters as a result\n", hb_buffer_get_length(buffer) > ); > > > > ret = 0; > > > > main_exit: > > //test only, free another day > > exit(ret); > > } > This email and its content belong to Ingenico Group. The enclosed > information is confidential and may not be disclosed to any unauthorized > person. If you have received it by mistake do not forward it and delete it > from your system. Cet email et son contenu sont la propriété du Groupe > Ingenico. L’information qu’il contient est confidentielle et ne peut être > communiquée à des personnes non autorisées. Si vous l’avez reçu par erreur > ne le transférez pas et supprimez-le. > _______________________________________________ > HarfBuzz mailing list > HarfBuzz@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/harfbuzz > -- behdad http://behdad.org/
_______________________________________________ HarfBuzz mailing list HarfBuzz@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/harfbuzz