Line breaking and Hyphenation

Joerg Pietschmann Wed, 03 Jul 2002 01:37:22 -0700

Hello all,
I tracked down the bugs 10374, 2106 and 6042. The last
bug was caused by a simple, easy to fix mistake in the
hyphenation framework. The bug 10374 is unfortunately
a duplicate of 2106, not 6042, and a bit more interesting.
It is caused by the parser delivering character references
as a separate character chunk, thereby creating multiple
FOText children of the block (FObjMixed) for consecutive
text. This interferes badly with line breaking and
hyphenation. Take
  e&#x78;tensible
with room up to the "l" on the line.
This is split into three FOText objects
  e &#x78; tensible
The text is delivered separately to the line layout
algorithm. The "e" and "X" do not fill the line but
also are not words and are appended to the pendingAreas
vector. The "tensible" then overflows the line and is
passed to the hyphenation, lets say it is hyphenated
as "tensi-ble". The "tensi-" is appended without
flushing the pending areas, which are put first into the
next line.
I put a StringBuffer into FObjMixed to accumulate
consecutive addCharacters() events. This fixes the problem
with character references, but not
 e<fo:inline>X</fo:inline>tensible
(also noted somewhere in bugzilla as problem)
The second is to flush pendig areas in addWord(). This
fixes the lost characters problem but *still* does not
correctly hyphenate words split into inline FOs, only
the chunk actually overflowing the line is considered
for hyphenation.


More problems I noted:
- white space is handled inconsistently
- line break detection relies on white space only
- word detection for hyphenation relies on white space
  and wrongly assumes there is a white space before the
  word passed to doHyphenation()
- the LinkSet is not considered for hyphenated word parts
  in addWord, and neither for page-number-citation nor
  fo:character
- same for most of overlining, line through and vertical
  alignment
- characters are copied to FOText, and then copied *twice*
  in LineArea.layout(), one purely for hyphenation. During
  Layout, character data is at least three times, possibly
  four times (parser buffer) in memory

Questions:
- Is it still worth to do major hacks in LineArea.java?
- Should we consider using Unicode break properties for
  line break opportunity detection?
- How should words for hyphenation be detected?
- What happens to line breaks and word detection in case of
  * inline graphics and other definitely non-text inlines
  * inline foreign elements, like formulas
  * inline-containers containing blocks, especially blocks
    with text only
- Are there script or language dependencies to consider for
  line break and word detection?
- At which point should collapse-whitespace, linefeed-treatment
  etc. considered? Possibilities:
  * while creating FOText
  * while feeding it into the line area
  * during line area layout

Considering white-space-collapse during FOText creation has some
problems in case of successive spaces in different inline FO.

There are additional issues with consecutive spaces which had
been discussed here already, in particular how
  foo <fo:inline text-decoration="underline"> bar</fo:inline>
should be handled. Will this result in two consecutive spaces,
one of them underlined? Has this issue been resolved meanwhile?

J.Pietschmann

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Line breaking and Hyphenation

Reply via email to