Re: Leading/trailing space removal in LineLM

Manuel Mall Wed, 02 Nov 2005 04:36:58 -0800

On Wed, 2 Nov 2005 01:59 pm, Manuel Mall wrote:
> On Wed, 2 Nov 2005 04:18 am, Simon Pepping wrote:
> > On Tue, Nov 01, 2005 at 11:40:42PM +0800, Manuel Mall wrote:
> > > This is probably a question for Luca or Simon.
>
> <snip/>
>
> > Glue and penalty items are removed at the start of a line. This is
> > part of the Knuth algorithm. It does not touch the matter of
> > white-space-collapse. If there is whitespace that may not be
> > removed/collapsed at the start of the line, it must be protected by
> > a preceding zero-width box. I.o.w., the value of
> > white-space-collapse needs to be taken into account at the phase of
> > getNextKnuthElements.
>
> Fair enough - I need some help with the Knuth elements then.
>
> During getNextKnuth we need to only consider white-space-treatment as
> white-space-collapse can be handled completely during refinement,
> that is consecutive sequences of white space are either collapsed or
> not during refinement.
>
> We also can limit white-space-treatment during getNextKnuth to any
> line breaks generated by the line breaking algorithm (Knuth
> algorithm). white-space-treatment around hard line breaks (linefeeds,
> start/end of a block) are handled during refinement.
>
> We can also limit white-space-treatment during getNextKnuth to the
> values "preserve" vs "ignore-if...". Other values are handled during
> refinement. We also can treat the three different "ignore-if..."
> values, that is the values: ignore-if-before-linefeed,
> ignore-if-after-linefeed, ignore-if-surrounding-linefeed, as just one
> case: 'delete all white space around a formatter generated break'.
>
> So we end up with only two cases to consider: preserve white space
> and remove white space around a line break created by the Knuth
> algorithm.
>
> 1. Preserve white space: IMO in this case the space itself is
> actually not a break opportunity but there are now two break
> opportunities: one before the space and one after the space. That is
> a sequence like 'abc&#x20;def' is more like
> 'abc&#x200b;&#xa0;&#x200b;def' or in a more readable notation
> 'abc<zwsp><nbsp><zwsp>def'. That is our normal space becomes a
> non-breakable space flanked by zero-width spaces which represent the
> break opportunities. If this is correct the Knuth elements would look
> like:
> glue w=0
> box w=0
> pen +INFINITE
> glue w=<space>
> pen
> glue w=0
> Is this sequence correct? The first and last glue represent the
> <zwsp> and are break opportunities. The box prevents the removal of
> the space if a break is created before the space. The penalty
> prevents the space to be considered as a break opportunity.
> Of course as usual these sequences are further complicated in the
> absence of justification and in the presence of border/padding.
>
> 2. Removal of white space: This is the current behaviour but it works
> only for a single space and not for a sequence of spaces. Actually
> because the algorithm removes leading glues/penalties it is mainly a
> problem for trailing white space. I am not sure how to best tackle
> this. What comes to mind is:
>
> a) Do the same as for leading glues/penalties at the end of the line.
> However I am not sure how tricky it would be to determine the
> boundary because any 'blocking boxes' (see 1. above) are only placed
> before but not after elements. This options suffers from the problem
> that it will not remove leading/trailing white space across inline
> boundaries with border/padding as these generate zero width boxes to
> block removal of the glue elements for the border/padding.
>
> b) Do not generate individual Knuth sequences for each white space
> character but instead collect all consecutive white space and create
> one glue-penalty sequence for it. Again I am uncertain of the
> consequences of doing that. To do that correctly we would need to
> collect white space across inline boundaries. This firstly breaks the
> current getNextKnuth approach which assumes each LM can generate its
> sequences without knowledge of its neighbours. It would also break
> the current area info structures as a single Knuth element could now
> refer to text snippets from different LMs.
>
> Comments please.
>
> > Simon
>
> Thanks
>
Luca wrote a longer response to this but my mail reader doesn't like the 
character set (is that topical or what?). Any way at end end Luca ask 
the question about the UAX#14 line breaking algorithm and its handling 
of spaces. My answer to that is:
a) Yes UAX#14 always breaks at the of a sequence of spaces
b) But is also says that it assumes any trailing spaces in a line are 
being removed
This "conflicts" with XSL-FO which can force spaces being retained 
therefore adjustments to the algorithm are necessary to cater for that. 
One possible adjustment is simply changing what is given to the 
algorithm as indicated above, ie <sp> becomes <zwsp><nbsp><zwsp>.


Manuel

> Manuel

In case other people have the same problem with Luca's post here is the 
content:
++++++++ Start Luca's e-mail +++++++++
I like your idea of "expanding" a preserved space into zwsps and nbsp;
this allows us to forget alignments and borders / padding as we just 
have
to insert the appropriate elements for the non breaking space.

The sequence is very good, as it has a couple of interesting properties:

- it interacts with the surrounding elements just a single glue element

- if there are two (or more) consecutive, non-collapsed spaces the
sequence has just 3 feasible breaks, not 4

However, I have a doubt: reading the Unicode document about line 
breaking,

it seems to me that, regardless of the quantity of consecutive spaces,
there is only *one* feasible break, after the last one (Unicode Standard
Annex #14, section 2 "Definitions", in particular the definition of
"direct break" and "indirect break")

--- begin quoted text ---

Direct Break - a line break opportunity exists between two adjacent
characters of the given line breaking classes. This is indicated in the
rules below as B =F7 A, where B is the character class of the character
before and A is the character class of the character after the break. If
they are separated by one or more space characters, a break opportunity
also exists after the last space. In the pair table, the optional space
characters are not shown.

Indirect Break - a line break opportunity exists between two characters 
of
the given line breaking classes only if they are separated by one or 
more
spaces. In this case, a break opportunity exists after the last space. 
No
break opportunity exists if the characters are immediately adjacent. 
This
is indicated in the pair table below as B % A, where B is the character
class of the character before and A is the character class of the
character after the break. Even though space characters are not shown in
the pair table, an indirect break can only occur if one or more spaces
follow B. In the notation of the rules in Section 6, Line Breaking
Algorithm this would be represented as two rules: B =D7 A and B SP+ =F7 
A.

--- end quoted text ---

I still have not read the document from top to bottom, and I could have
misunderstood even the sections I read :-), but I think this point must 
be clarified before we continue.

Regards
     Luca

++++++++ End Luca's e-mail +++++++++

Re: Leading/trailing space removal in LineLM

Reply via email to