On Sat, 5 Nov 2005 12:05 am, Luca Furini wrote:
> Manuel Mall wrote:
> > Here are some of the combinations I have identified:
> >
> > 1. Non breaking / non elastic space => probably just a normal
> > character, i.e. part of a word.
> >
> > 2. Non breaking / elastic space - eg. U+00A0 Non breaking space
> >     => Must prevent break
> >     => Must handle text-align
> >
> > 3. Break / non elastic - eg. U+200B ZWSP, any other break between
> > two characters not involving adding or removing space/characters =>
> > Must handle border/padding
> >     => Must handle text-align
> >
> > 4. Break / non elastic / remove if not break - eg. U+00AD Soft
> > hyphen => Must remove if not at break
> >     => Must handle border/padding
> >     => Must handle text-align
> >
> > 5. Break / non elastic / add character if break - eg. hyphenation
> >     => Must add space for hyphen if at break
> >     => Must handle border/padding
> >     => Must handle text-align
> >
> > 6. Breaking / elastic / non removable - eg. U+3000 Ideographic
> > space => Must handle border/padding
> >     => Must handle text-align
> >     Question: XSL-FO does not define U+3000 as removable white space
> > but would under common CJK typesetting conventions this be removed
> > at a line break?
> >
> > 7. Breaking / elastic / removable - eg. U+0020 Space
> >     => Can occur in runs which must be wholly removed
> >     => Must handle border/padding
> >     => Must handle text-align
> >
> > Any combinations I have missed, e.g. is there a "break / non
> > elastic / remove at break" case?
>

I moved all this to a Wiki page with the actual Knuth sequences 
(http://wiki.apache.org/xmlgraphics-fop/LineBreaking). Please review / 
check!

> Maybe the fixed width spaces?
>
Yes - may be.

> Anyway, it seems an exhaustive analysis of the problem!
>
> Just a few comments / thoughts:
>
> - non breaking, non elastic: the simple solution would be to handle
> these characters as normal "letters", so the text "before_after"
> (where _ is zwnbsp) would create a single AreaInfo object in the
> TextLM; but this would create problems during hyphenation, as
> non-letter characters in the middle of a word ATM prevents
> hyphenation
I think word breaking, i.e. determining the word boundaries for the 
purpose of hyphenation, and line breaking are not 100% coupled. There 
are actually different Unicode documents describing each. Therefore 
down the track I don't see treating these are normal characters for the 
purpose of line breaking as being a problem as the word breaking would 
be done may be in parallel but logically separate. We also most likely 
want Knuth box elements covering the largest extend of consecutive 
characters as possible because a) it saves resources and b) as the 
width of Knuth elements are the basis of determining what fits on a 
line if we ever look into kerning the calculations would need to be 
done on a per Knuth box element basis.

>
> - soft hyphen: at the moment it is not properly handled, but it won't
> be difficult to fix the implementation; it could create the same
> elements used for an hyphenation point, but the penalty could have a
> negative value (as probably users would use it to "suggest" a desired
> line break); note that a word with a soft hyphen in its middle would
> not be hyphenated, unless we ignore this character when collecting
> word fragments

I thought we simply delete the soft-hyphen character and generate a 
normal break with hyphen Knuth sequence at that point.

>
> Regards
>      Luca

Manuel

Reply via email to