Manuel Mall wrote:
That seems to be the consensus, that is consider ZWS for line breaking but then discard and don't give it to the renderers.


Renderers could deal with ZWS if the font would have a glyph for
this character; unfortunately, that's not the case for the PDF
standard fonts  :-)  Some fonts *do* have glyphs for various Unicode
space characters, notably the fixed width spaces.

This leads to the question: Is a space a character? What *is* a
character? The Unicode people had endless discussions about this.
Spaces are exactly in the gray area between "real characters"
which leave marks and layout control.

Handling space characters in layout and discarding them before
rendering has the distinctive advantage that they work for
any font in any renderer (which can handle variable space areas
properly, of course). OTOH, renderers which output a format which
can handle the spaces itself, like a hypothetical HTML renderer,
would better get the original character.

Are there any other (unusual Unicode) characters which fall in the same category that is they influence layout decisions but should not be seen by the renderers?

* Unicode spaces
 + variable with spaces
   - ordinary space U+0020
   - ordinary non-breaking space U+00A0
 + fixed width spaces; potentially available in fonts and *may*
   be passed to renderers, *except* for U+200B
   - zero width space U+200B, may expand in justification (not
     implemented this way in FOP 0.20.5, which will haunt us)
   - zero width non breaking space, aka byte order mark U+FEFF,
     should now only be used as BOM (as the BOM is eaten by the
     XML parser, FOP could emit a "deprecated" warning)
   - en quad U+2000, according to my Unicode book *identical* to
     U+2002, *not* a 4en space (strange)
   - em quad U+2001, similar to U+2000
   - en space aka nut U+2002,
   - em space aka mutton U+2003
   - three-per-em space aka thick space (1/3 em width) U+2004
   - four-per-em space aka mid space (1/4 em width) U+2005
   - six-per-em space (generally 1/6 em width) U+2006
   - figure space (font dependent) U+2007
   - punctuation space (as wide as a dot or comma) U+2008
   - thin space (1/5..1/8 em width) U+2009
   - hair space (1/10..1/16 em width) U+200A
   - narrow no-break space (probably 1/6 em width) U+202F
   - mathematical space U+205F
   - non breaking word joiner U+2060 replaces U+FFEF in text
   - ideographic space U+3000
   - OGHAM SPACE MARK U+1680 (odd stuff)
   - Note: ETHIOPIC WORDSPACE U+1361 leaves marks and is therefore
     not a space. At least I hope so.
 + see also
    http://en.wikipedia.org/wiki/Space_character
    http://www.alistapart.com/stories/emen/

* Other characters
 + Character shaping hints; they do not cause line breaks.
   - zero width joiner U+200D
   - zero width non-joiner U+200C (may probably also hint at
     preventing ligatures)
   - see http://en.wikipedia.org/wiki/Zero-width_joiner et al.
 + Soft hyphen U+00AD. Must be hidden if no line break follows.
 + Formatting characters. I'd say these characters should not occur
   in XSLFO source, because there are FO which represent the same
   functionality.
   - line separator U+2028, FOP 0.20.5 creates an unconditional line
     break regardless of any FO properties
   - paragraph separator U+2029
   - bidi control characters 200E-200F, 202A-202E
   - deprecated controls 206A-206F


J.Pietschmann

Reply via email to