Re: White space handling Wiki page

Simon Pepping Wed, 09 Nov 2005 13:33:53 -0800

On Tue, Nov 08, 2005 at 11:19:15AM +0800, Manuel Mall wrote:
> On Tue, 8 Nov 2005 04:40 am, Simon Pepping wrote:
> >
> > Step 2. Refinement: white-space-collapse
> > ========================================
> >
> > Issue 1. The spec intentionally addresses only XML white space,
> > because only such white space is manipulated by editors to obtain
> > pretty printing.
> 
> Point taken, although I have no experience with non western editors. Do 
> they all use 0x20 for 'pretty printing'?


The XML spec does not allow one to use other characters than XML white
space for pretty printing, at least not in element content. It would
result in an invalid XML file because PCDATA would be present where
the DTD or schema does not allow it. That is even true for
non-breaking-space, U+A0.

   <ul>
       <li>Some text.</li>
   </ul>

is valid XHTML, but

   <ul>
&#xA0;&#xA0;&#xA0;&#xA0;<li>Some text.</li>
   </ul>

is not.

In PCDATA it is slightly different.

   <p>
      This is some content.
      We wrap the lines at
      a narrow width.
   </p>

Formally these data are different from the case when the text of the
paragraph were written in one line: spaces have been converted to
linefeeds, and sequences of spaces have been inserted. The XML parser
reports all linefeeds and spaces as character data to the
application. But almost all applications treat the two cases as
equivalent, certainly when the data are considered as textual data. It
is exactly this convention that the FO spec tries to formalize.

   <fo:block>
      This is some content.
      We wrap the lines at
      a narrow width.
   </fo:block>

_is_ equivalent to the case when the text of the block were written in
one line, due to the line-feed-treatment and white-space-collapse
properties (at default values).

Such a convention is not usually applied to non-XML-whitespace
characters, and the FO spec shows no intention to do so.

A side effect is that 'This is some content' is equivalent to
'This  is   some  content', but that is not the case with any
other character, even if that is considered as white space in
some script.

> > Example 2
> > =========
> >
> > The space in "<fo:block>.<fo:block>" is suppressed because it is at
> > the start of the block. 
> Interesting - I agree that this is the intention but you don't find that 
> sentence in the spec. In 1.1 this is covered by the "deleting spaces at 
> the beginning of a line" under white-space-treatment / line building. 
> Again the discussion is probably academic - we all agree what the 
> expected outcome is. If we can derive that outcome from the spec or not 
> is a very interesting discussion but won't change what we will do.

This is convered under the notion that the start and end of an fo:block are
equivalent to line breaks.

> > And "<fo:block><fo:block>" does not generate 
> > an empty line. <fo:block> starts a new line, but that is not
> > equivalent to a linefeed. When at the start of the nested fo:block
> > there is no content in the line yet, it starts the same line. A
> > similar thing happens in the case of "</fo:block>&#x0A;</fo:block>",
> > which was discussed in an email thread.
> I assume you mean the discussion under linefeed-treatment="preserve". I 
> am still confused about that because
> </fo:block>&#x0A;&#x0A;</fo:block> 
> will generate one linefeed or should this create also none?

Yes, I am referring to that discussion, and I quoted it
wrong. The case is: &#x0A;</fo:block>". The linefeed creates a
linebreak, </fo:block> does not add another one since the line has
already been ended. </fo:block>&#x0A;</fo:block> should create one
empty line, and </fo:block>&#x0A;&#x0A;</fo:block> two empty lines, I
suppose.

> > Nowhere in the spec is a conversion of tabs and CRs to spaces
> > specified.
> Under 7.15.8 it says:
> 
> preserve
> 
>     Specifies that any character flow object whose character is 
> classified, before any linefeed-treatment handling is considered, as 
> white space in XML, except for U+000A (linefeed) characters, shall be 
> converted during the refinement process into a character flow object 
> whose Unicode code point is U+0020 (space).

But they removed it in 7.16.8 in the 1.1 draft.

Regards, Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl

Re: White space handling Wiki page

Reply via email to