Manuel Mall wrote:

So we end up with only two cases to consider: preserve white space and remove white space around a line break created by the Knuth algorithm.

1. Preserve white space: IMO in this case the space itself is actually not a break opportunity but there are now two break opportunities: one before the space and one after the space. That is a sequence like 'abc&#x20;def' is more like 'abc&#x200b;&#xa0;&#x200b;def' or in a more readable notation 'abc<zwsp><nbsp><zwsp>def'. That is our normal space becomes a non-breakable space flanked by zero-width spaces which represent the break opportunities. If this is correct the Knuth elements would look like:
 glue w=0
 box w=0
 pen +INFINITE
 glue w=<space>
 pen
 glue w=0
Is this sequence correct? The first and last glue represent the <zwsp> and are break opportunities. The box prevents the removal of the space if a break is created before the space. The penalty prevents the space to be considered as a break opportunity. Of course as usual these sequences are further complicated in the absence of justification and in the presence of border/padding.

I like your idea of "expanding" a preserved space into zwsps and nbsp; this allows us to forget alignments and borders / padding as we just have to insert the appropriate elements for the non breaking space.

The sequence is very good, as it has a couple of interesting properties:

- it interacts with the surrounding elements just a single glue element

- if there are two (or more) consecutive, non-collapsed spaces the sequence has just 3 feasible breaks, not 4

However, I have a doubt: reading the Unicode document about line breaking, it seems to me that, regardless of the quantity of consecutive spaces, there is only *one* feasible break, after the last one (Unicode Standard Annex #14, section 2 "Definitions", in particular the definition of "direct break" and "indirect break")

--- begin quoted text ---

Direct Break - a line break opportunity exists between two adjacent characters of the given line breaking classes. This is indicated in the rules below as B ? A, where B is the character class of the character before and A is the character class of the character after the break. If they are separated by one or more space characters, a break opportunity also exists after the last space. In the pair table, the optional space characters are not shown.

Indirect Break - a line break opportunity exists between two characters of the given line breaking classes only if they are separated by one or more spaces. In this case, a break opportunity exists after the last space. No break opportunity exists if the characters are immediately adjacent. This is indicated in the pair table below as B % A, where B is the character class of the character before and A is the character class of the character after the break. Even though space characters are not shown in the pair table, an indirect break can only occur if one or more spaces follow B. In the notation of the rules in Section 6, Line Breaking Algorithm this would be represented as two rules: B ? A and B SP+ ? A.

--- end quoted text ---

I still have not read the document from top to bottom, and I could have misunderstood even the sections I read :-), but I think this point must be clarified before we continue.

Regards
    Luca

Reply via email to