Manuel Mall wrote:
So we end up with only two cases to consider: preserve white space and
remove white space around a line break created by the Knuth algorithm.
1. Preserve white space: IMO in this case the space itself is actually
not a break opportunity but there are now two break opportunities: one
before the space and one after the space. That is a sequence like
'abc def' is more like 'abc​ ​def' or in a more
readable notation 'abc<zwsp><nbsp><zwsp>def'. That is our normal space
becomes a non-breakable space flanked by zero-width spaces which
represent the break opportunities. If this is correct the Knuth
elements would look like:
glue w=0
box w=0
pen +INFINITE
glue w=<space>
pen
glue w=0
Is this sequence correct? The first and last glue represent the <zwsp>
and are break opportunities. The box prevents the removal of the space
if a break is created before the space. The penalty prevents the space
to be considered as a break opportunity.
Of course as usual these sequences are further complicated in the
absence of justification and in the presence of border/padding.
I like your idea of "expanding" a preserved space into zwsps and nbsp;
this allows us to forget alignments and borders / padding as we just have
to insert the appropriate elements for the non breaking space.
The sequence is very good, as it has a couple of interesting properties:
- it interacts with the surrounding elements just a single glue element
- if there are two (or more) consecutive, non-collapsed spaces the
sequence has just 3 feasible breaks, not 4
However, I have a doubt: reading the Unicode document about line breaking,
it seems to me that, regardless of the quantity of consecutive spaces,
there is only *one* feasible break, after the last one (Unicode Standard
Annex #14, section 2 "Definitions", in particular the definition of
"direct break" and "indirect break")
--- begin quoted text ---
Direct Break - a line break opportunity exists between two adjacent
characters of the given line breaking classes. This is indicated in the
rules below as B ? A, where B is the character class of the character
before and A is the character class of the character after the break. If
they are separated by one or more space characters, a break opportunity
also exists after the last space. In the pair table, the optional space
characters are not shown.
Indirect Break - a line break opportunity exists between two characters of
the given line breaking classes only if they are separated by one or more
spaces. In this case, a break opportunity exists after the last space. No
break opportunity exists if the characters are immediately adjacent. This
is indicated in the pair table below as B % A, where B is the character
class of the character before and A is the character class of the
character after the break. Even though space characters are not shown in
the pair table, an indirect break can only occur if one or more spaces
follow B. In the notation of the rules in Section 6, Line Breaking
Algorithm this would be represented as two rules: B ? A and B SP+ ? A.
--- end quoted text ---
I still have not read the document from top to bottom, and I could have
misunderstood even the sections I read :-), but I think this point must be
clarified before we continue.
Regards
Luca