Luca Furini a écrit : > Vincent wrote: > >>> The LineLM tries, in the first instance, to avoid using hyphenation >>> points, so the penalty is not taken into account. But this has the side >>> effect of using the first glue element as a feasible break (if the >>> penalty were a feasible break too, it would surely be a better one, such >>> avoiding the glue to be effectively chosen). >> >> I don't follow you: IIUC the glue-penalty-glue triplet is generated only >> the second time, when the first breaking doesn't give acceptable >> results? What do you mean by "the penalty is not taken into account"? > > No, the sequence is always the same: since the beginning it represents > the hyphenation points too, but at the first call to > findBreakingPoints() there is a parameter saying that only > non-hyphenated breaks should be looked at.
Doesn't that have an impact on performance? I know we are no longer at the time when TeX was created, but, still, performing hyphenation only when the first pass has failed should be more effective? >> Also, I don't see why the penalty would be preferred over the glue, as >> it has a positive penalty value. > > Choosing the glue as a break has the effect of losing its stretch and > shrink, so the adjust ratio and the demerits would be higher. Now you > make me think of it, this is surely true when the penalty has penalty > value = 0, but could be false otherwise ... so we could check the > penalty value and add the additional penalty if it's >0. No, always add the penalty I would say. Because the first glue should never be chosen as a break point. Or? >>> want to prevent this breaking to happen, as we can now use >>> zero-width-spaces to explicitly insert breaking positions? >> >> Good point. I'd say yes for '/'. This would add a burden to the user who >> would have to modify the FO generation step to add ZWSP for URLs or >> filenames; but we must also take into account cases where the user does >> /not/ want the word to be split at '/' characters. > > Ok > >> For hyphens, I would keep the current behavior, as this is the most >> expected one IMO. And it can also be prevented by adding non-breaking >> zero-width space. > > I'm afraid that, at the moment, a zero-width non-breaking space after a > "-" would not prevent the break to happen ... and it would not be > completely trivial to handle it (as the hyphen could be the last > character of an inline, and the zwnbsp the first of another one. > > Maybe we could move the handling of hyphen to the hyphenation phase, > when text is collected from all inline LMs. Oh, didn't know that the handling of zwnbsp wasn't that simple. Well, still, I can't think of cases where the user would want text not to be broken after a hyphen. > Another question: should the hyphen characters in the text be feasible > breaks even if hyphenation is disabled? I'd say yes. In French, at least, there are lots of compound words that it is totally acceptable to break even if hyphenation is disabled. They should simply be discouraged by setting a positive penalty value (the same as for other hyphens). When we hyphenate words it's as if we were adding soft hyphens at some places in the input text --these wouldn't be the same hyphens. > At the moment, hyphen characters and hyphenation points are handled in > the same way, so a hyphen in the text could be a break only if > hyphenate=true, and only since the second call to findBreakingPoints(). > >> Yes 20 is probably too much. We need perhaps to also differentiate the >> case where no acceptable line-breaking can be found because of a box too >> long to even fit alone on one line. In such a case even a very high max >> ratio won't help. > > I agree. > > I'm thinking how this could be done in an efficient way: lowering the > threshold during the execution of the breaking algorithm is not this > simple (pruning the list of active nodes could be not enough as it could > become empty). That's where an improved restarting algorithm would do the trick, I guess. >> Yes, the current mechanism doesn't seem to be good enough, but I'm >> wondering if we can find a better one. Currently a too-short/too-long >> node replaces another one if it has fewer demerits. The number of >> lines/pages handled so far isn't taken into account. So this is likely >> that a too-short/too-long node ending an earlier line/page will be >> preferred over a node going further in the Knuth sequence. Why should >> that be the case? >> In fact the main problem I think is to find the right heuristic to >> select too-short/too-long nodes, in order to end up with the most >> acceptable result. Easy to say... > > The use of lastDeactivated should lead to some improvements: > lastDeactivated is (already) updated using the compareNodes() method, > which compares the node position first, and then the demerits. > > At the moment, my understanding of this matter is this: lastTooShort and > lastTooLong should be used only when the algorithm couldn't find any > good break since the last restart; otherwise lastDeactivated is probably > the best restarting point, as it allows the creation of a few good lines > / pages. I'd have to think more about it, but: - perhaps the compareNodes method should compare the line/page numbers for each node rather than the index in the Knuth sequence. Or some mixing of the two. - if you restart using the last deactivated node you are sure that immediately after that you'll have to restart using the last too-short/too-long node, because no feasible break will be found (otherwise the list of active nodes wouldn't have been emptied). >> Also, may I suggest you to look at the Temp_Floats branch, and perhaps >> even working on it instead of trunk? I've made quite heavy changes to >> the breaking code that might be difficult to merge back into the trunk >> if there are also changes there. > > Oops, you are right! > I'm going to look at it and work on it. Thanks, Vincent