Re: Hyphenation Markup

2018-06-07 Thread Richard Wordingham via Unicode
On Sat, 2 Jun 2018 05:44:29 +0100
Richard Wordingham via Unicode  wrote:

> In Latin text, one can indicate permissible line break opportunities
> between grapheme clusters by inserting U+00AD SOFT HYPHEN.  What
> low-end schemes, if any, exist for such mark-up within grapheme
> clusters?

It didn't come into existence, but I've found a proposed HTML markup
element HYPH that would almost have done the job at
http://www.nada.kth.se/i18n/html/hyph.html .  The one problem is the
old one of displaying a left matra in isolation.  Of course, if one
has total font control, the PUA could have come to the rescue if HYPH
had been adopted and implemented. 

Richard.


Re: Hyphenation Markup

2018-06-03 Thread Richard Wordingham via Unicode
On Sun, 3 Jun 2018 04:31:32 +0100
Richard Wordingham via Unicode  wrote:

> However, the text is actually in the Tham script, and without any
> line-breaking controls, the first and third examples read, marking the
> grapheme cluster boundaries with '|', as ᨾ᩠ᨿᩮ  MA, U+1A60 TAI THAM SIGN SAKOT | U+1A3F TAI THAM LETTER LOW YA, U+1A6E
> TAI THAM VOWEL SIGN E> and ᩉ᩠ᩅᩱ  TAI THAM SIGN SAKOT | U+1A45 TAI THAM LETTER WA, U+1A71 TAI THAM VOWEL
> SIGN AI>.

What I have marked is the *extended* grapheme cluster boundaries.
There is a *legacy* grapheme cluster break before the vowel sign.  This
may make line-breaking after Indic re-ordering a bit easier.  However,
in the Lao language, we have sequences in Tham such as  ('|' = legacy grapheme break), and I now fully
expect there to be renderings such as:

, break, 

There seems to be an example about the string hole in the middle line
of BAD-13-1-0100 in Figure 5.4 on p222 of Bounleuth's dissertation
(http://ediss.sub.uni-hamburg.de/volltexte/2016/8039/pdf/Dissertation.pdf),
but I'm not confident of my reading of the split word as .

Theppitak would be able to confirm or refute, but he doesn't often
participate in this forum.

Richard.



Re: Hyphenation Markup

2018-06-02 Thread Richard Wordingham via Unicode
On Sat, 2 Jun 2018 14:33:01 -0600
Doug Ewell via Unicode  wrote:

> Richard Wordingham wrote:
> 
> >> What about U+200B ZWSP?  
> >
> > Thanks for the suggestion, but it's not likely to work:  
> 
> Are you asking what schemes exist, or are you trying to call
> attention to some rendering engine and/or font that doesn't render a
> combination as it should?

I'm asking what exists, or is reasonably supposed to exist. 

> This is too general for me to parse. Can you replace these
> hypotheticals with actual characters, using code points, or at least
> with actual General Categories? For example, an 'Mc' followed by ZWSP
> followed by an 'Lo' displays like such-and-so. The code points would
> be best.

On Sun, 3 Jun 2018 09:26:40 +0900
"Martin J. Dürst via Unicode"  wrote:
> My question goes a bit further than to Doug's: Why would you want to
> do such a thing? Are there actual scripts/languages where line breaks 
> within grapheme clusters occur? If yes, what are there? Can you show 
> actual examples, e.g. scans of documents,...?

Three examples are given on p230 of the dissertation "Buddhist Monks and
their Search for Knowledge: an examination of the personal collection of
manuscripts of Phra Khamchan Virachitto (1920-2007), Abbot of Vat Saen
Sukharam, Luang Prabang" by Bounleuth Sengsoulin, available at
http://ediss.sub.uni-hamburg.de/volltexte/2016/8039/pdf/Dissertation.pdf .
The text is in Lao in the Tham script.  The transcriptions in the
text are transliterated to the Lao script.

The first example, transliterated to Lao, is ເມຽ,  which one could
encode as , provided the soft hyphen had
no visual representation beyond the line break.  (Strictly, it's a
break for a hole for a string.)  The third example is likewise ໄຫວ
. (I can't make out the second example.)
However, the text is actually in the Tham script, and without any
line-breaking controls, the first and third examples read, marking the
grapheme cluster boundaries with '|', as ᨾ᩠ᨿᩮ  and ᩉ᩠ᩅᩱ .  The internal grapheme cluster boundaries are purely stopping
points for cursor movement; they correspond to nothing graphical
and to nothing in user conception.  The natural internal boundaries are
just before the vowels, which are written on the left, and between the
base and subscript characters, i.e. before U+1A60.

There seem to be Northern Thai Pali examples in the proposal
L2/2007-007 at the end of
https://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf Figure 9a Page
2 Line 3, and at the end of Figure 9b Page 1 Line 2, but I can't read
the Pali well enough to be sure that the apparent visually line-final
instances of TAI THAM SIGN E are not just scribal blunders. 

Reverting to Doug's reply:
> > Incidentally, does CLDR define the rendering of soft hyphen, or is
> > one entirely at the mercy of the application?  

> Why would this be a CLDR thing?

Because the rendering is quite likely to depend on locale.  I had
always understood that Thai did not mark breaks in words - and then I
discovered them in the Royal Institute Dictionary!  The correct German
rendering of soft hyphens has recently changed.  There are also subtle
effects when Dutch words are hyphenated.  These rules are not the same
as for English, but Unicode tends not to deal in dependencies finer
than a script.

Richard.



Re: Hyphenation Markup

2018-06-02 Thread Martin J. Dürst via Unicode

Hello Richard,

On 2018/06/02 20:37, Richard Wordingham via Unicode wrote:


Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode:

In Latin text, one can indicate permissible line break opportunities
between grapheme clusters by inserting U+00AD SOFT HYPHEN.  What
low-end schemes, if any, exist for such mark-up within grapheme
clusters?



1) In the sequence



realisation of the break should definitely result in  on one line and in  on the next
line, whereas in visual order, character-2 should precede character-1.


My question goes a bit further than to Doug's: Why would you want to do 
such a thing? Are there actual scripts/languages where line breaks 
within grapheme clusters occur? If yes, what are there? Can you show 
actual examples, e.g. scans of documents,...?


In writing systems, there are almost always exceptions to simple rules, 
but in general, breaking a line *within* a grapheme cluster seems to be 
a bad idea.


Regards,   Martin.


Re: Hyphenation Markup

2018-06-02 Thread Doug Ewell via Unicode

Richard Wordingham wrote:


What about U+200B ZWSP?


Thanks for the suggestion, but it's not likely to work:


Are you asking what schemes exist, or are you trying to call attention 
to some rendering engine and/or font that doesn't render a combination 
as it should?



1) In the sequence

realisation of the break should definitely result in  on one line and in  on the next
line, whereas in visual order, character-2 should precede character-1.


This is too general for me to parse. Can you replace these hypotheticals 
with actual characters, using code points, or at least with actual 
General Categories? For example, an 'Mc' followed by ZWSP followed by an 
'Lo' displays like such-and-so. The code points would be best.



Incidentally, does CLDR define the rendering of soft hyphen, or is one
entirely at the mercy of the application?


Why would this be a CLDR thing?

--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Hyphenation Markup

2018-06-02 Thread Richard Wordingham via Unicode
On Sat, 2 Jun 2018 11:06:43 +0200
Otto Stolz via Unicode  wrote:

> Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode:
> > In Latin text, one can indicate permissible line break opportunities
> > between grapheme clusters by inserting U+00AD SOFT HYPHEN.  What
> > low-end schemes, if any, exist for such mark-up within grapheme
> > clusters?  
> 
> What about U+200B ZWSP?

> >  this character is intended for invisible word
> > separation and for line break control; it has no
> > width, but its presence between two characters
> > does not prevent increased letter spacing in
> > justification  

Thanks for the suggestion, but it's not likely to work:

Within a word and with a proper layout implementation, using ZWSP
would be worse than using backing store .

1) In the sequence



realisation of the break should definitely result in  on one line and in  on the next
line, whereas in visual order, character-2 should precede character-1. 

2) Use of ZWSP will usually result in a dotted circle even when the break does 
not occur.

3) ZWSP will result in a mandatory word boundary.  That will cause
problems with the spell checker.

I've experimented
(http://wrdingham.co.uk/lanna/renderer_test.htm#test_and_tell) with the
combination  where there is a default grapheme
cluster boundary between the two characters.  I get generally better
results with SHY than ZWSP.  The downside was that the rendering
systems I tried seemed to insist on inserting the glyph of U+002D or
U+2010, rather than the glyph of U+00AD.

Incidentally, does CLDR define the rendering of soft hyphen, or is one
entirely at the mercy of the application?

Richard.


Re: Hyphenation Markup

2018-06-02 Thread Otto Stolz via Unicode

Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode:

In Latin text, one can indicate permissible line break opportunities
between grapheme clusters by inserting U+00AD SOFT HYPHEN.  What
low-end schemes, if any, exist for such mark-up within grapheme
clusters?


What about U+200B ZWSP?


 this character is intended for invisible word
separation and for line break control; it has no
width, but its presence between two characters
does not prevent increased letter spacing in
justification


Best wishes,
  Otto Stolz