On Wed, 26 Oct 2005 03:15 am, J.Pietschmann wrote: > Manuel Mall wrote: > > While investigating if we could use the standard > > java.text.BreakIterator to determine line break points I noticed > > that FOP uses in addition to space, zero width space, hyphen also > > the forward slash as a valid line breaking character. The Java > > BreakIterator does not recognize slash as a line breaking char (nor > > FWIW does MS Word). > > > > What is the background to FOP allowing this? Is this consistent > > with normal user expectations or is this specific to type setting > > environments / Tex / Knuth? > > The BreakIterator class is supposed to implement the Unicode TR14 > standard annex > http://www.unicode.org/reports/tr14/ > The slash U+002F aka SOLIDUS is assigned a line breaking property > value SY (Symbols Allowing Breaks) > http://www.unicode.org/Public/UNIDATA/LineBreak.txt > which means "prevent a break before, and allow a break after". I > suspect this is a recent change in Unicode, not implemented yet by > your JDK release. > BTW first breaking the text using whitespace, then applying the > BreakIterator is unwise, because white space is significant for TR14 > line breaking. Unfortunately, combining whitespace normalization, > line break detection and word parsing (for hyphenation) in a single > pass is unwieldy if BreakIterator is used, that's why I tried to > implement it differently some time ago > http://people.apache.org/~pietsch/linebreak.tar.gz > Joerg,
great stuff. I like the idea of having a UNICODE conformant/compliant/based line breaking algorithm in FOP. Note this has nothing to do with the Knuth algorithm used in FOP. I am talking about using the UNICODE algorithm to determine line break opportunities. It is then up to the Knuth algorithm to convert the Knuth element lists generated from the line break opportunities into an optimal set of line breaks. But how can we move forward? The current FOP code to determine line break opportunities looks a bit like a quick solution that works well for simple texts using only space, nbsp, zero width space, but not anything that uses more sophisticated UNICODE break characters. You have some code which does a better job at it but its not in FOP. Shall we use your work in FOP and if so how can we best integrate it? BTW, looking at http://www.unicode.org/reports/tr14/ with respect to the SOLIDUS, that is line breaking property SY, it is actually quite complex as it does not allow a break within a sequence of digits, e.g. 26/10/2005 and discourages breaking things like "w/o" or "A/S". > J.Pietschmann Manuel
