Re: Hyphenation Markup
On Sat, 2 Jun 2018 05:44:29 +0100 Richard Wordingham via Unicode wrote: > In Latin text, one can indicate permissible line break opportunities > between grapheme clusters by inserting U+00AD SOFT HYPHEN. What > low-end schemes, if any, exist for such mark-up within grapheme > clusters? It didn't come into existence, but I've found a proposed HTML markup element HYPH that would almost have done the job at http://www.nada.kth.se/i18n/html/hyph.html . The one problem is the old one of displaying a left matra in isolation. Of course, if one has total font control, the PUA could have come to the rescue if HYPH had been adopted and implemented. Richard.
Re: Hyphenation Markup
On Sun, 3 Jun 2018 04:31:32 +0100 Richard Wordingham via Unicode wrote: > However, the text is actually in the Tham script, and without any > line-breaking controls, the first and third examples read, marking the > grapheme cluster boundaries with '|', as ᨾ᩠ᨿᩮ MA, U+1A60 TAI THAM SIGN SAKOT | U+1A3F TAI THAM LETTER LOW YA, U+1A6E > TAI THAM VOWEL SIGN E> and ᩉ᩠ᩅᩱ TAI THAM SIGN SAKOT | U+1A45 TAI THAM LETTER WA, U+1A71 TAI THAM VOWEL > SIGN AI>. What I have marked is the *extended* grapheme cluster boundaries. There is a *legacy* grapheme cluster break before the vowel sign. This may make line-breaking after Indic re-ordering a bit easier. However, in the Lao language, we have sequences in Tham such as ('|' = legacy grapheme break), and I now fully expect there to be renderings such as: , break, There seems to be an example about the string hole in the middle line of BAD-13-1-0100 in Figure 5.4 on p222 of Bounleuth's dissertation (http://ediss.sub.uni-hamburg.de/volltexte/2016/8039/pdf/Dissertation.pdf), but I'm not confident of my reading of the split word as . Theppitak would be able to confirm or refute, but he doesn't often participate in this forum. Richard.
Re: Hyphenation Markup
On Sat, 2 Jun 2018 14:33:01 -0600 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > >> What about U+200B ZWSP? > > > > Thanks for the suggestion, but it's not likely to work: > > Are you asking what schemes exist, or are you trying to call > attention to some rendering engine and/or font that doesn't render a > combination as it should? I'm asking what exists, or is reasonably supposed to exist. > This is too general for me to parse. Can you replace these > hypotheticals with actual characters, using code points, or at least > with actual General Categories? For example, an 'Mc' followed by ZWSP > followed by an 'Lo' displays like such-and-so. The code points would > be best. On Sun, 3 Jun 2018 09:26:40 +0900 "Martin J. Dürst via Unicode" wrote: > My question goes a bit further than to Doug's: Why would you want to > do such a thing? Are there actual scripts/languages where line breaks > within grapheme clusters occur? If yes, what are there? Can you show > actual examples, e.g. scans of documents,...? Three examples are given on p230 of the dissertation "Buddhist Monks and their Search for Knowledge: an examination of the personal collection of manuscripts of Phra Khamchan Virachitto (1920-2007), Abbot of Vat Saen Sukharam, Luang Prabang" by Bounleuth Sengsoulin, available at http://ediss.sub.uni-hamburg.de/volltexte/2016/8039/pdf/Dissertation.pdf . The text is in Lao in the Tham script. The transcriptions in the text are transliterated to the Lao script. The first example, transliterated to Lao, is ເມຽ, which one could encode as , provided the soft hyphen had no visual representation beyond the line break. (Strictly, it's a break for a hole for a string.) The third example is likewise ໄຫວ . (I can't make out the second example.) However, the text is actually in the Tham script, and without any line-breaking controls, the first and third examples read, marking the grapheme cluster boundaries with '|', as ᨾ᩠ᨿᩮ and ᩉ᩠ᩅᩱ . The internal grapheme cluster boundaries are purely stopping points for cursor movement; they correspond to nothing graphical and to nothing in user conception. The natural internal boundaries are just before the vowels, which are written on the left, and between the base and subscript characters, i.e. before U+1A60. There seem to be Northern Thai Pali examples in the proposal L2/2007-007 at the end of https://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf Figure 9a Page 2 Line 3, and at the end of Figure 9b Page 1 Line 2, but I can't read the Pali well enough to be sure that the apparent visually line-final instances of TAI THAM SIGN E are not just scribal blunders. Reverting to Doug's reply: > > Incidentally, does CLDR define the rendering of soft hyphen, or is > > one entirely at the mercy of the application? > Why would this be a CLDR thing? Because the rendering is quite likely to depend on locale. I had always understood that Thai did not mark breaks in words - and then I discovered them in the Royal Institute Dictionary! The correct German rendering of soft hyphens has recently changed. There are also subtle effects when Dutch words are hyphenated. These rules are not the same as for English, but Unicode tends not to deal in dependencies finer than a script. Richard.
Re: Hyphenation Markup
Hello Richard, On 2018/06/02 20:37, Richard Wordingham via Unicode wrote: Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode: In Latin text, one can indicate permissible line break opportunities between grapheme clusters by inserting U+00AD SOFT HYPHEN. What low-end schemes, if any, exist for such mark-up within grapheme clusters? 1) In the sequence realisation of the break should definitely result in on one line and in on the next line, whereas in visual order, character-2 should precede character-1. My question goes a bit further than to Doug's: Why would you want to do such a thing? Are there actual scripts/languages where line breaks within grapheme clusters occur? If yes, what are there? Can you show actual examples, e.g. scans of documents,...? In writing systems, there are almost always exceptions to simple rules, but in general, breaking a line *within* a grapheme cluster seems to be a bad idea. Regards, Martin.
Re: Hyphenation Markup
Richard Wordingham wrote: What about U+200B ZWSP? Thanks for the suggestion, but it's not likely to work: Are you asking what schemes exist, or are you trying to call attention to some rendering engine and/or font that doesn't render a combination as it should? 1) In the sequence realisation of the break should definitely result in on one line and in on the next line, whereas in visual order, character-2 should precede character-1. This is too general for me to parse. Can you replace these hypotheticals with actual characters, using code points, or at least with actual General Categories? For example, an 'Mc' followed by ZWSP followed by an 'Lo' displays like such-and-so. The code points would be best. Incidentally, does CLDR define the rendering of soft hyphen, or is one entirely at the mercy of the application? Why would this be a CLDR thing? -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Hyphenation Markup
On Sat, 2 Jun 2018 11:06:43 +0200 Otto Stolz via Unicode wrote: > Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode: > > In Latin text, one can indicate permissible line break opportunities > > between grapheme clusters by inserting U+00AD SOFT HYPHEN. What > > low-end schemes, if any, exist for such mark-up within grapheme > > clusters? > > What about U+200B ZWSP? > > this character is intended for invisible word > > separation and for line break control; it has no > > width, but its presence between two characters > > does not prevent increased letter spacing in > > justification Thanks for the suggestion, but it's not likely to work: Within a word and with a proper layout implementation, using ZWSP would be worse than using backing store . 1) In the sequence realisation of the break should definitely result in on one line and in on the next line, whereas in visual order, character-2 should precede character-1. 2) Use of ZWSP will usually result in a dotted circle even when the break does not occur. 3) ZWSP will result in a mandatory word boundary. That will cause problems with the spell checker. I've experimented (http://wrdingham.co.uk/lanna/renderer_test.htm#test_and_tell) with the combination where there is a default grapheme cluster boundary between the two characters. I get generally better results with SHY than ZWSP. The downside was that the rendering systems I tried seemed to insist on inserting the glyph of U+002D or U+2010, rather than the glyph of U+00AD. Incidentally, does CLDR define the rendering of soft hyphen, or is one entirely at the mercy of the application? Richard.
Re: Hyphenation Markup
Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode: In Latin text, one can indicate permissible line break opportunities between grapheme clusters by inserting U+00AD SOFT HYPHEN. What low-end schemes, if any, exist for such mark-up within grapheme clusters? What about U+200B ZWSP? this character is intended for invisible word separation and for line break control; it has no width, but its presence between two characters does not prevent increased letter spacing in justification Best wishes, Otto Stolz