Re: [HarfBuzz] Fwd: Harfbuzz with linebreaking

Martin Hosken Tue, 14 Jun 2016 19:30:06 -0700

Dear Kelvin,

>     → [T] [r] [e] [e] [ ] [P] [a] [i] [n] [e] [’] [s] [ ] [p] [r] [i] [m]
> [a] [r] [y] [ ] [o] [ffi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s]
> [h] [v] [i] [l] [l] [e] [ ] [(]
> 
>       [T] [X] [E] [T] [ ] [D] [E] [S] [R] [E] [V] [E] [R] ←
> 
>     → [)] [.] [ ] [S] [h] [e] [ ] [w] [o] [r] [k] [s] [ ] [a] [s] [ ] [a] [
> ] [p] [u] [b] [l] [i] [c] [i] [st] [.]
>


[snip]

> I build my second line starting from the [fi] glyph, but I find that the
> last glyph in this run, the [(] , still leaves extra space at the end of
> the line. The line
> 
>     → [fi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h] [v] [i] [l]
> [l] [e] [ ] [(]
> 
> is only 38 points long, leaving 12 points of space. So I go on to the next
> run, the RTL segment. I find that 12 points of space is only enough to fit
> the glyphs
> 
>       [E] [T] [ ] [D] [E] [S] [R] [E] [V] [E] [R] ←
> 
> Again, clustering tells me this corresponds to index 55 in the original
> string ('Tree Paine’s primary office is in Nashville (REVERSED TE'), and
> the last breakpoint less than index 55 is the whitespace breakpoint at
> index 53. So I shape the RTL string 'REVERSED ' and add it to the 'fice is
> in Nashville (' I had from before to get line 2:
> 
>       <line 2> :  { [fi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s]
> [h] [v] [i] [l] [l] [e] [ ] [(] }  { [D] [E] [S] [R] [E] [V] [E] [R] } { [
> ] }

I don't see how the final space needs to be in its own run here. It's part of a 
single direction RTL run and can stay part of it. There is no need to rerun any 
bidi at this stage of the proceedings. Having said that, space generally needs 
special handling at the end of a line (in effect, cut it out, it isn't part of 
the line being broken or part of the following line either). This is true 
whether you had bidi going on or not. And there are lots of spaces in Unicode, 
as I'm sure you are aware.

> second string go into line 4. So the end result is:
> 
> 
>       <line 1> : { [T] [r] [e] [e] [ ] [P] [a] [i] [n] [e] [’] [s] [ ] [p]
> [r] [i] [m] [a] [r] [y] [ ] [o] [f] [-] }
> 
>       <line 2> : { [fi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h]
> [v] [i] [l] [l] [e] [ ] [(] }  { [D] [E] [S] [R] [E] [V] [E] [R] } { [ ] }
> 
>       <line 3> : { [T] [X] [E] [T] }  { [)] [.] [ ] [S] [h] [e] [ ] [w] [o]
> [r] [k] [s] [ ] [a] [s] [ ] [a] [ ] [p] [u] [b] [-] }
> 
>       <line 4> : { [l] [i] [c] [i] [st] [.] }
> 
> Does this make sense?

Yes. Notice that you only had to reshape twice per line. In the bad case that 
inserting a hyphen made the shaping result longer than a line, then you would 
need to back up and try again, which is in effect, the cost of another line. 
The costly bit is if you have a long paragraph, the reshaping of the 'rest of 
the paragraph' for each line is costly. I would suggest that you don't need to 
reshape if the start of the next line is in a different cluster to the end of 
the previous line. There are cases where you may need to do some positional 
tidying (deciding where the new 0 is in the line), but you can't ligate across 
a cluster boundary (by definition in OT). Equally, you should be able to save 
reshaping for the end of a line if there is no text added and you break on a 
cluster boundary. These are important optimisations (which I will probably get 
yelled at for suggesting, but it would be interesting to hear the use cases 
where my presuppositions fall down), because you really don't want to have to 
reshape a long paragraph n times, especially when most of the time you will 
break at a space.

Of course this all presumes you have a supporting engine that tells you line 
break opportunities for all the languages of the world, including hyphenation 
dictionaries. ICU may be sufficient for your needs, but I do encourage you, and 
everyone, to allow the addition of extra languages to your application beyond 
those you compile for.

I notice you say you want a very clear, to the user, line breaking algorithm 
and so are going purely line by line, earliest break first. I would suggest 
that for greatest clarity that you not do hyphenation. All systems try to avoid 
hyphenation unless they have to (can't find a break within a certain distance 
of the end of line), otherwise you may find you are hyphenating every line. I 
would give the user the option of turning hyphenation on and off and giving a 
hyphenation zone (or maximum raggedness). This doesn't impinge on your single 
line breaking algorithm, it just tries to reduce the likelihood of hyphens 
turning up. And, as you have shown, hyphenation is costly in terms of reshaping.

Yours,
Martin
_______________________________________________
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz

Re: [HarfBuzz] Fwd: Harfbuzz with linebreaking

Reply via email to