In a previous post Joerg pointed to the Unicode Standard Annex #14 on 
Line Breaking (http://www.unicode.org/reports/tr14/) and his initial 
implementation: http://people.apache.org/~pietsch/linebreak.tar.gz.

I had since a closer look at both UAX#14 and Joerg's code. Because I 
liked what I saw I went about adapting Joerg's code it to Unicode 4.1 
and added fairly extensive JUnit test cases to it mainly because it 
really helps to go through the various different cases mentioned in the 
spec in some structured fashion.

The results are now available for public inspection: 
http://people.apache.org/~manuel/fop/linebreak.tar.gz

1. I would like to propose that Unicode conformant line breaking be 
integrated into FOP trunk because it:
a) Moves FOP more towards being a universal formatter and not just a 
formatter for western languages
b) Moves FOP more towards becoming a high quality typesetting system 
(something that was really started by integrating Knuth style breaking)
The reason I think this needs to be voted on is because Unicode line 
breaking will in subtle ways change the current line breaking behaviour 
and therefore constitutes a (significant) change in FOPs overall 
rendering.

2. I would also like to propose that the Unicode conformant line 
breaking be implemented using our own pair-table based implementation 
and not using Java's line breaker, because:
a) It gives us full control and allows FOP to follow the Unicode 
standard (and its updates and erratas) closely and therefore keep FOPs 
Unicode compliance level independent of the Java version.
b) It allows us to tailor the algorithm to match the needs of XSL-FO and 
FOP.
c) It allows us to provide user customisation features (down the track) 
not available through using the Java APIs.

Of course there are downsides, like:
a) Are we falling for the 'not invented here' syndrome?
b) Duplicating code which is already in the Java base system
c) Increasing the memory footprint of FOP

3. Assuming we get enough +1 for the above proposals the first item to 
decide after that would be: Where should the code live?
a) Joerg would like to see it in Jakarta Commons but hasn't got the time 
to start the project. 
b) Jeremias suggested XMLGraphics Commons. 
c) Personally I think it is too early to factor it out. More experience 
with its design and use cases should be gathered before making it 
standalone and at this point in time it really only are 2 core Java 
classes. I would like to suggest that it initially lives under FOP in 
something like org.apache.fop.text. Should the need and energy levels 
(= developer enthusiasm) become available later to make this into an 
Jakarta Commons or XMLGraphics Commons project so be it.

Assuming now that this will be agreed as well the next step would be the 
more detailed design of the integration. But this is well beyond the 
scope of this e-mail as there are some tricky issues involved and they 
probably need to be tackled in conjunction with the white space 
handling issues. Many of the problems are related to our LayoutManager 
structures which create barriers when it comes to the need to process 
character sequences across those boundaries as is the case for both 
line breaking and white space handling. Add to that the design of the 
different Knuth sequences required to model the different break cases 
in conjunction with conditional border/padding and white space removal 
around line breaking and different types of line justifications and 
there is some real work ahead.

Cheers

Manuel

Should add my votes:

1.) +1
2.) +1
3.c) +1

Reply via email to