On Thu, 3 Nov 2005 05:57 am, J.Pietschmann wrote:
> Manuel Mall wrote:
> > That seems to be the consensus, that is consider ZWS for line
> > breaking but then discard and don't give it to the renderers.
>
> Renderers could deal with ZWS if the font would have a glyph for
> this character; unfortunately, that's not the case for the PDF
> standard fonts  :-)  Some fonts *do* have glyphs for various Unicode
> space characters, notably the fixed width spaces.
>
> This leads to the question: Is a space a character? What *is* a
> character? The Unicode people had endless discussions about this.
> Spaces are exactly in the gray area between "real characters"
> which leave marks and layout control.
>
> Handling space characters in layout and discarding them before
> rendering has the distinctive advantage that they work for
> any font in any renderer (which can handle variable space areas
> properly, of course). OTOH, renderers which output a format which
> can handle the spaces itself, like a hypothetical HTML renderer,
> would better get the original character.
>
Exactly this was actually discussed recently in an exchange between 
myself and Luca. Luca pointed out that leaving space characters out of 
a PDF would lead to copy/paste behaviour most likely contrary to user 
expectations. I thought that was a very important point.

> > Are there any other (unusual Unicode) characters which fall in the
> > same category that is they influence layout decisions but should
> > not be seen by the renderers?
>
> * Unicode spaces
>   + variable with spaces
>     - ordinary space U+0020
>     - ordinary non-breaking space U+00A0
>   + fixed width spaces; potentially available in fonts and *may*
>     be passed to renderers, *except* for U+200B
>     - zero width space U+200B, may expand in justification (not
>       implemented this way in FOP 0.20.5, which will haunt us)
>     - zero width non breaking space, aka byte order mark U+FEFF,
>       should now only be used as BOM (as the BOM is eaten by the
>       XML parser, FOP could emit a "deprecated" warning)
With respect to U+200B it says in 
http://www.unicode.org/Public/UNIDATA/UCD.html:
<quote>
White_Space: Those separator characters and control characters which 
should be treated by programming languages as "white space" for the 
purpose of parsing elements.

Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, 
since their functions are restricted to line-break control. Their names 
are unfortunately misleading in this respect.
</quote>
Also in UAX#14 it says:
<quote>
When expanding or compressing inter-word space according to common 
typographical practice, only the spaces marked by U+0020  SPACE, U+00A0  
NO-BREAK SPACE, and U+3000  IDEOGRAPHIC SPACE are subject to 
compression, and only spaces marked by U+0020 SPACE, U+00A0  NO-BREAK 
SPACE, and occasionally spaces marked by U+2009  THIN SPACE are subject 
to expansion. All other space characters normally have fixed width. 
When expanding or compressing inter-character space the presence of 
U+200B ZERO WIDTH SPACE or U+2060 WORD JOINER is always ignored.
</quote>

It therefore surprises me that you imply U+200B may expand in 
justification. However, I don't have the Unicode book (pretty 
expensive) and rely on the Internet for this sort of information. But I 
noticed that http://en.wikipedia.org/wiki/Space_character indicates 
U+200B can be used for justification. 

>     - en quad U+2000, according to my Unicode book *identical* to
>       U+2002, *not* a 4en space (strange)
>     - em quad U+2001, similar to U+2000
>     - en space aka nut U+2002,
>     - em space aka mutton U+2003
>     - three-per-em space aka thick space (1/3 em width) U+2004
>     - four-per-em space aka mid space (1/4 em width) U+2005
>     - six-per-em space (generally 1/6 em width) U+2006
>     - figure space (font dependent) U+2007
>     - punctuation space (as wide as a dot or comma) U+2008
>     - thin space (1/5..1/8 em width) U+2009
>     - hair space (1/10..1/16 em width) U+200A
>     - narrow no-break space (probably 1/6 em width) U+202F
>     - mathematical space U+205F
>     - non breaking word joiner U+2060 replaces U+FFEF in text
>     - ideographic space U+3000
>     - OGHAM SPACE MARK U+1680 (odd stuff)
>     - Note: ETHIOPIC WORDSPACE U+1361 leaves marks and is therefore
>       not a space. At least I hope so.
>   + see also
>      http://en.wikipedia.org/wiki/Space_character
>      http://www.alistapart.com/stories/emen/
>
> * Other characters
>   + Character shaping hints; they do not cause line breaks.
>     - zero width joiner U+200D
>     - zero width non-joiner U+200C (may probably also hint at
>       preventing ligatures)
>     - see http://en.wikipedia.org/wiki/Zero-width_joiner et al.
>   + Soft hyphen U+00AD. Must be hidden if no line break follows.
>   + Formatting characters. I'd say these characters should not occur
>     in XSLFO source, because there are FO which represent the same
>     functionality.
>     - line separator U+2028, FOP 0.20.5 creates an unconditional line
>       break regardless of any FO properties
>     - paragraph separator U+2029
>     - bidi control characters 200E-200F, 202A-202E
>     - deprecated controls 206A-206F
>

Thanks for that list. With respect to the issue at hand, that is which 
codepoints should be given to the renderers it seems there are 3 types: 

1. Those we always give to the renderers even if they are not in the 
font (this is the default and applies to the vast majority)

2. Those we never give to the renderers, e.g. Soft Hyphen (its either 
suppressed or replaced by the proper hyphen), zero-width joiners, ...

3. Those we replace (by another character or layout positioning) only if 
they are not in the font, e.g. fixed width spaces

Is that a sensible grouping?

Of course there are other modifications to codepoints not mentioned here 
like combining into ligatures, hyphenation combined with spelling 
changes, ....

>
> J.Pietschmann

Manuel

Reply via email to