Manuel Mall wrote:
Here are some of the combinations I have identified:
1. Non breaking / non elastic space => probably just a normal character,
i.e. part of a word.
2. Non breaking / elastic space - eg. U+00A0 Non breaking space
=> Must prevent break
=> Must handle text-align
3. Break / non elastic - eg. U+200B ZWSP, any other break between two
characters not involving adding or removing space/characters
=> Must handle border/padding
=> Must handle text-align
4. Break / non elastic / remove if not break - eg. U+00AD Soft hyphen
=> Must remove if not at break
=> Must handle border/padding
=> Must handle text-align
5. Break / non elastic / add character if break - eg. hyphenation
=> Must add space for hyphen if at break
=> Must handle border/padding
=> Must handle text-align
6. Breaking / elastic / non removable - eg. U+3000 Ideographic space
=> Must handle border/padding
=> Must handle text-align
Question: XSL-FO does not define U+3000 as removable white space but
would under common CJK typesetting conventions this be removed at a
line break?
7. Breaking / elastic / removable - eg. U+0020 Space
=> Can occur in runs which must be wholly removed
=> Must handle border/padding
=> Must handle text-align
Any combinations I have missed, e.g. is there a "break / non elastic /
remove at break" case?
Maybe the fixed width spaces?
Anyway, it seems an exhaustive analysis of the problem!
Just a few comments / thoughts:
- non breaking, non elastic: the simple solution would be to handle these
characters as normal "letters", so the text "before_after" (where _ is
zwnbsp) would create a single AreaInfo object in the TextLM; but this
would create problems during hyphenation, as non-letter characters in the
middle of a word ATM prevents hyphenation
- soft hyphen: at the moment it is not properly handled, but it won't be
difficult to fix the implementation; it could create the same elements
used for an hyphenation point, but the penalty could have a negative value
(as probably users would use it to "suggest" a desired line break); note
that a word with a soft hyphen in its middle would not be hyphenated,
unless we ignore this character when collecting word fragments
Regards
Luca