Hi Dario, This is an interesting study. There is probably room for improvement, as an adjustment ratio of 20, even if this is the last resort, is really high. More below.
Dario Laera wrote: > Hi all, > > I found the reason why breaking paragraphs into short lines is really > slow and memory hungry: the threshold of the adjustment ratio, set to 20 > at the last try, is too high and makes EVERY legal breakpoint a feasible > breakpoint too. A check should be performed to avoid such situation and > to choose then a better threshold. > > For example, if I have a 2 columns in A4 page layout the line width is > ~140000. The glue stretchability, as far as I can see in TextLM class, > is often set to "3 * LineLayoutManager.DEFAULT_SPACE_WIDTH" that is > equal to ~10000. When you compute the adj ratio for a line that have > just one glue you get r = 140000/10000 = 14, that is lower than the > threshold = 20.0, so an active node is added. > > A better threshold can be chosen as follow: let idealDifference be a > reasonable size we choose as good threshold. We can assume "3 * > LineLayoutManager.DEFAULT_SPACE_WIDTH" as default stretchability a > compute a better threshold in that way: > > idealRatio = idealDifference / (3 * > LineLayoutManager.DEFAULT_SPACE_WIDTH); > > and bound that value: > > 1.0 <= idealRatio <= 20. > > How to choose idealDifference? A naive solution, but probably not so > bad, can be: > > idealDifference = iLineWidth / 2; > > A more sophisticated, maybe too much sophisticated, solution can choose > it by looking at the average box length: we can see how many average box > can fit a line (wordsPerLine) and execute: > > avgWord = avgBox + LineLayoutManager.DEFAULT_SPACE_WIDTH; > idealDifference = iLineWidth - (avgWord * (wordsPerLine / 2)); I’m not sure I’m following you here. What’s the value of wordsPerLine? Is is set manually to a value that’s considered to be a reasonable one? Because if it’s computed automatically, the formula can be simplified: wordsPerLine = lineWidth / avgWord, so idealDifference = lineWidth - lineWidth / 2 = lineWidth / 2 Anyway, the adjustment ratio is already a notion that is independent of the line width; that’s precisely the purpose of a ratio. In the case of left-justified mode, the only available stretchability is due to the space at the end of the line; the question is to determine up to how much we accept that space to be... Ok, by writing that I think I know what you mean now :-) But the issue should probably be considered the other way around: the problem is not so much the adjustment ratio as the amount of space allowed at the end of the line. In the case of narrow columns, that “3 times the width of a space character” is too big WRT the line width. Instead of having a fixed value, it should be changed into a small proportion of the line width. At the origin that 3 * space-width value was probably chosen for “normal” line widths, that is lines containing an optimal amount of words. I’ve read somewhere that the optimal number of letters per line is 60. Taking the Times font, the average width of lowercase letters is 459, so the optimal line width roughly is 459*60 = 27540. The width of the space character is 250, so 3 times a space character at the end of a line makes 2.7% of that line. So let’s go for an elastic space of 3% the line width, and then we can always chose the same adjustment ratio; the number of active nodes would be “automatically” limited, whatever the line width. > > Do you mean that this last try is /always/ performed (even when we > > already have a set of feasible breaks)? > > It's not always performed (so it's formally correct), but in my tests > it's rarely avoided, more precisely just once, with the file > "my_franklin_rep-jus.fo" that is composed of many paragraph in 1 column > with justified text. What I think (obviously, I may be wrong, as it has > been proved in other mails ;) is that another intermediate try, with > a judicious threshold, can be performed, leading to the same final > result but with much better performance, if this intermediate try > doesn't fail like the previous. > Anyway I always run my tests with hyphenation enabled, I should try > disabling to see if the second try is run with threshold=5 and if this > doesn't fails. The two-column case is not surprising: the columns are too narrow, which makes line-breaking particularly challenging. The one-column left-justified case surprises me a bit, however. I would have expected that text could be broken without even needing hyphenation. I find it a bit ironical that justifying text actually is easier for the line-breaking algorithm... At any rate, that adjustment ratio of 20 for the last run is surely too much. It can probably be reduced to 5. Actually, I’m not even sure a third run with a high adjustment ratio is desirable. Maybe we should simply re-run the algorithm in forcing mode, and accept the underfull lines that will be introduced. If you could run statistics on more real-life documents (how often is the first run without hyphenation sufficient, the third run required, justified and left-aligned text, single / two-column on A4 paper, etc), that would be fantastic. Thanks, Vincent