Hi Richard, > Basic Arabic shaping, at the level of a typewriter, is straightforward > enough to leave to a terminal emulator, as Eli has suggested.
What is "basic" Arabic shaping exactly? I can see problems with leaving it to a terminal. It's not aware of the neighboring character if the string is cropped. It's not able to separate different UI elements that happen to be adjacent in the terminal, separated by different background color or such. On the other hand, let's reverse the question: "Basic Arabic shaping, at the level of a typewriter, is straightforward enough to be implemented in the application, using presentation form characters, as I suggest". Could you please point out the problems with this statement? > I believe combining marks present issues even in implicit modes. In > implicit mode, one cannot simply delegate the task to normal text > rendering, for one has to allocate text to cells. There are a number > of complications that spring to mind: > > 1) Some characters decompose to two characters that may otherwise lay > claim to their own cells: > > U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2, > 0654>. Do you intend that your scheme be usable by Unicode-compliant > processes? Decompose during which step? During shaping? Or do you mean they are NFC-NFD counterparts of each other? Most terminal emulators are able to handle combining accents, and of course implicit mode would take them into account when rearranging the letters. Terminal emulators don't do explicit (de)composing, a.k.a. NFC->NFD or NFD->NFC conversion (at least I'm not aware of any that does). > 4) Indic conjuncts. > (i) There are some conjuncts, such as Devanagari K.SSA, where a > display as <KA, VIRAMA>, <SSA> is simply unacceptable. In some > closely related scripts, this conjunct has the status of a character. We (in GNOME Terminal / VTE) do have an open bug about Devanagari spacing marks (currently they don't show up properly), plus Virama and friends. I'd like to address the essentials along with the BiDi implementation; although here we should discuss the design and not a particular implementation thereof :) In case you're interested, at https://bugzilla.gnome.org/show_bug.cgi?id=584160 comments 45-48, 95 and perhaps a few others comments I wondered whether certain joining operations should be done on the emulation layer or the display layer. The answer is not yet clear. We can't fix suddenly everything, but it's nice to move forward step by step. It's also proposed that we used HarfBuzz, but it's unclear to me at this point how the grid alignment could be preserved in the mean time. "simply unacceptable" – I'm not familiar with those languages, cultures and so on, but I'd be hesitant to go as far as calling anything "unacceptable". E.g. there's a physical typewriter in our family, as far as I remember it has no digits 1 or 0 (use the letters lowercase L and anycase O instead), it doesn't contain all the accented letters of my mother tounge so sometimes a similarly looking one has to be used. In today's computer world, I'd say such limitations are "unacceptable", but at that time this was what we had to live with. Terminal emulators, due to their strict character grid nature and their legacy behavior of many decades, are a platform where a certain level of compromise might be necessary for some scripts. I cannot tell where to draw the line, cannot tell what is "extremely bad" vs. "not nice" vs. "kind of okay but could be better", but we can't do everything in a terminal emulator that a graphical app could do. If someone wants to have a pixel perfect look, terminal emulators are not for them. Maybe looking at typewriters of those scripts could be a good starting point. Anyway, we've drifted quite far away. What I've already implemented in VTE (in a work-in-progress branch), and to my eyes looks quite nice, is Arabic shape using presentation form characters as done by FriBidi (in implicit mode only). According to the API of this library, this shaping process keeps a 1:1 mapping between the original and shaped letters (at least the number of Unicode codepoints – I haven't double checked their terminal width, but I really hope they don't mess with us here). That is, I don't have to deal with a character cell splitting into two, or two character cells joining into one during shaping. Does this sound okay so far? cheers, egmont