Manuel Mall wrote:
That seems to be the consensus, that is consider ZWS for line breaking but then discard and don't give it to the renderers.
Renderers could deal with ZWS if the font would have a glyph for this character; unfortunately, that's not the case for the PDF standard fonts :-) Some fonts *do* have glyphs for various Unicode space characters, notably the fixed width spaces. This leads to the question: Is a space a character? What *is* a character? The Unicode people had endless discussions about this. Spaces are exactly in the gray area between "real characters" which leave marks and layout control. Handling space characters in layout and discarding them before rendering has the distinctive advantage that they work for any font in any renderer (which can handle variable space areas properly, of course). OTOH, renderers which output a format which can handle the spaces itself, like a hypothetical HTML renderer, would better get the original character.
Are there any other (unusual Unicode) characters which fall in the same category that is they influence layout decisions but should not be seen by the renderers?
* Unicode spaces + variable with spaces - ordinary space U+0020 - ordinary non-breaking space U+00A0 + fixed width spaces; potentially available in fonts and *may* be passed to renderers, *except* for U+200B - zero width space U+200B, may expand in justification (not implemented this way in FOP 0.20.5, which will haunt us) - zero width non breaking space, aka byte order mark U+FEFF, should now only be used as BOM (as the BOM is eaten by the XML parser, FOP could emit a "deprecated" warning) - en quad U+2000, according to my Unicode book *identical* to U+2002, *not* a 4en space (strange) - em quad U+2001, similar to U+2000 - en space aka nut U+2002, - em space aka mutton U+2003 - three-per-em space aka thick space (1/3 em width) U+2004 - four-per-em space aka mid space (1/4 em width) U+2005 - six-per-em space (generally 1/6 em width) U+2006 - figure space (font dependent) U+2007 - punctuation space (as wide as a dot or comma) U+2008 - thin space (1/5..1/8 em width) U+2009 - hair space (1/10..1/16 em width) U+200A - narrow no-break space (probably 1/6 em width) U+202F - mathematical space U+205F - non breaking word joiner U+2060 replaces U+FFEF in text - ideographic space U+3000 - OGHAM SPACE MARK U+1680 (odd stuff) - Note: ETHIOPIC WORDSPACE U+1361 leaves marks and is therefore not a space. At least I hope so. + see also http://en.wikipedia.org/wiki/Space_character http://www.alistapart.com/stories/emen/ * Other characters + Character shaping hints; they do not cause line breaks. - zero width joiner U+200D - zero width non-joiner U+200C (may probably also hint at preventing ligatures) - see http://en.wikipedia.org/wiki/Zero-width_joiner et al. + Soft hyphen U+00AD. Must be hidden if no line break follows. + Formatting characters. I'd say these characters should not occur in XSLFO source, because there are FO which represent the same functionality. - line separator U+2028, FOP 0.20.5 creates an unconditional line break regardless of any FO properties - paragraph separator U+2029 - bidi control characters 200E-200F, 202A-202E - deprecated controls 206A-206F J.Pietschmann