Be careful with the various TRs: UTR14 does not deal with character (rather: grapheme) or word boundaries, that's UTX-29. Actually, we don't use the latter. Our line breaking should probably be done the following way (this implements the "naive" paragraph filling strategy) loop calculate line width if next character is added check for a line breaking opportunity before the next character if there is an opportunity if the line is not full discard the last saved opportunity and save this else try hyphenation on the string accumulated since the last break opportunity (if enabled), save returned opportunity if any return saved line breaking opportunity end if end if end loop
hyphenation of a string: loop skip non-word characters (for this hyphenator) word = continuous run of word characters (for this hyphenator) if the end of the word is past the end of the line try hyphenating the word, generate new break opportunities return best fitting line break opportunity or null end if end loop
There is the degenerate case if the line overflows and no line break opportunity is discovered at all. The TeX paragraph filling strategy has to detect line break opportunities the same way but selects the opportunities turning into actual line breaks in a more clever way. We could do that too.
In my own thinking about the process of line-breaking, I have always assumed that a (possibly recursive) block of text is a fixed resource; a superset of the fixed resource that is a single glyph/grapheme with given font attributes. As such, it should be processed by a separate co-routine (to use the language of the Rec). All of the information about the hierarchy of potential break positions is determined by the text itself.
As a first cut, I would I would determine all potential breaks, along with information relevant to later line-height calculations, at the time a block is first prepared for layout. The co-routine (thread, whatever) that is grooming the text would then respond to enquiries about line-area possibilities, and eventually return contents for line-areas of particular dimensions. All of this is tentative, and all of the calculated information about the block would have to be held until the layout of the block is finalised.
What "finalised" means depends on the complexity of the layout strategies employed, but at a minimum, it must be maintained until the last page containing text from the block, and the subsequent page (if any) have been laid out, to allow for backtracking during last-page processing.
Peter -- Peter B. West <http://www.powerup.com.au/~pbwest/resume.html>