Re: UAX#14 implementation
Nice work, Manuel! That will be a great addition to Fop. I have never studied the problem in detail, so I can only give a general opinion. But I think we should follow as closely as possible the Unicode standard, even if that leads to behaviors incompatible with the current one. It seems the Unicode standard is designed to nicely handle all sorts of high-level typographical issues. This would be great to be able to say Fop is Unicode compliant. And users can refer to a well-known, well-defined standard if they want to understand Fop's behavior or achieve special effects. So, by all means, go for it! Vincent Manuel Mall a écrit : On Wednesday 20 December 2006 20:43, Manuel Mall wrote: On Tuesday 19 December 2006 23:55, Manuel Mall wrote: snip/ Its looking OK so far and most of the layout engine tests pass. Just discovered the first instance of an existing testcase which gives a different result. Here is another one: The current FOP implementation treats spaces other than NBSP, e.g. U+2009 (Thin Space) and U+200A (Hair Space) as suppressible around line breaks. I believe that is incorrect as the spec explicitly limits whitespace handling to the normal space U+0020. The test case which shows that is block_white-space_4.xml. It tests for specific Knuth element sequences which are now different because these spaces are now treated as not suppressible. After making the appropriate adjustment to the checks in that testcase ALL testcases are now passing! snip/ Manuel
Re: UAX#14 implementation
On Tuesday 19 December 2006 23:55, Manuel Mall wrote: snip/ Its looking OK so far and most of the layout engine tests pass. Just discovered the first instance of an existing testcase which gives a different result. Under UAX#14 the following text (Note this is plain text not FO markup!): text-align=center .conditionality=retain linefeed-treatment=preserve. which appears in inline_border_padding_conditionality_2.xml has only a single break opportunity which is before the word linefeed-treatment. The space between center and .conditionality is not a break opportunity as it is before a punctuation (Rule LB13). In our existing code this space is a valid break opportunity and under the specific circumstances this gives a different layout result. I don't think this is actually a problem but it is a noticeable difference. It just shows that UAX#14 is designed to break typical written text and not programming language code which this text snippet resembles. snip/ Manuel Manuel
Re: UAX#14 implementation
Manuel Mall wrote: On Tuesday 19 December 2006 23:55, Manuel Mall wrote: snip/ Its looking OK so far and most of the layout engine tests pass. Just discovered the first instance of an existing testcase which gives a different result. Under UAX#14 the following text (Note this is plain text not FO markup!): text-align=center .conditionality=retain linefeed-treatment=preserve. which appears in inline_border_padding_conditionality_2.xml has only a single break opportunity which is before the word linefeed-treatment. The space between center and .conditionality is not a break Interesting. Just to clarify; are you saying that in the previous release 0.92beta the line breaking code identified 2 BP but in 0.93 just the one BP is identifed? snip/ Chris
Re: UAX#14 implementation
On Wednesday 20 December 2006 23:22, Chris Bowditch wrote: Manuel Mall wrote: On Tuesday 19 December 2006 23:55, Manuel Mall wrote: snip/ Its looking OK so far and most of the layout engine tests pass. Just discovered the first instance of an existing testcase which gives a different result. Under UAX#14 the following text (Note this is plain text not FO markup!): text-align=center .conditionality=retain linefeed-treatment=preserve. which appears in inline_border_padding_conditionality_2.xml has only a single break opportunity which is before the word linefeed-treatment. The space between center and .conditionality is not a break Interesting. Just to clarify; are you saying that in the previous release 0.92beta the line breaking code identified 2 BP but in 0.93 just the one BP is identifed? No quite - what I am saying is that in the current fop trunk version 2 break points are identified but in my local UAX#14 version of FOP only one break point is identified. After looking through the UAX#14 specification the behaviour of my implementation appears to be correct. snip/ Chris Manuel
Re: UAX#14 implementation
On Wednesday 20 December 2006 20:43, Manuel Mall wrote: On Tuesday 19 December 2006 23:55, Manuel Mall wrote: snip/ Its looking OK so far and most of the layout engine tests pass. Just discovered the first instance of an existing testcase which gives a different result. Here is another one: The current FOP implementation treats spaces other than NBSP, e.g. U+2009 (Thin Space) and U+200A (Hair Space) as suppressible around line breaks. I believe that is incorrect as the spec explicitly limits whitespace handling to the normal space U+0020. The test case which shows that is block_white-space_4.xml. It tests for specific Knuth element sequences which are now different because these spaces are now treated as not suppressible. After making the appropriate adjustment to the checks in that testcase ALL testcases are now passing! snip/ Manuel
Re: UAX#14 implementation
Manuel Mall wrote: After making the appropriate adjustment to the checks in that testcase ALL testcases are now passing! Wonderful! I'm really looking forward to see this great new feature! Just a couple of doubts concerning the differences with respect to the old implementation (I must confess I read the Unicode Annex quite quickly ...): Just discovered the first instance of an existing testcase which gives a different result. Under UAX#14 the following text (Note this is plain text not FO markup!): text-align=center .conditionality=retain linefeed-treatment=preserve. which appears in inline_border_padding_conditionality_2.xml has only a single break opportunity which is before the word linefeed-treatment. The space between center and .conditionality is not a break Does this happens because that space is just before a .? Another doubt: why aren't the - signs in text-align and linefeed-treatment possible breaks? Regards Luca
Re: UAX#14 implementation
Luca Furini wrote: After making the appropriate adjustment to the checks in that testcase ALL testcases are now passing! Wonderful! Me too! text-align=center .conditionality=retain ... Does this happens because that space is just before a .? The dot (FULL STOP) has property IS and prevents break after any character, also even after a space. Interesting, I didn't remember this. Another doubt: why aren't the - signs in text-align and linefeed-treatment possible breaks? They should be, the dash in Unicode 5.0 has the property HY, which allows for a break after. The tables I generated were for 4.1 (or even 4.0) and might have to be updated, I haven't checked. The UAX14 has been updated too, which might have changed the pair table (cahp. 7.3), which is, oddly enough, part of the report instead of a data file. Links: http://www.unicode.org/reports/tr14/ http://www.unicode.org/Public/UNIDATA/LineBreak.txt J.Pietschmann
Re: UAX#14 implementation
On Thursday 21 December 2006 06:08, J.Pietschmann wrote: Luca Furini wrote: snip/ Another doubt: why aren't the - signs in text-align and linefeed-treatment possible breaks? They should be, the dash in Unicode 5.0 has the property HY, which allows for a break after. The tables I generated were for 4.1 (or even 4.0) and might have to be updated, I haven't checked. The UAX14 has been updated too, which might have changed the pair table (cahp. 7.3), which is, oddly enough, part of the report instead of a data file. My mistake, the code correctly generates break opportunities after the HYPHEN-MINUS U+002D. I didn't notice because the breaker didn't choose them, probably because of their higher penalty value. I tested again with words like: Rindfleisch-etikettierungs-überwachungs-aufgaben-übertragungs-gesetz Donau-dampf-schiffahrts-elektrizitaeten-hauptbetriebswerk-bauunterbeamten-gesellschaft and it breaks correctly after the hyphens. J.Pietschmann Manuel
Re: UAX#14 implementation
Here is a sample from the test case I am developing attached. The ..._old.pdf file shows the current fop-trunk behaviour while the ..._new.pdf file shows what happens in the FOP UAX#14 version. There are quite a few subtle differences (mostly for the better I hope). I also attach the test case file (hasn't got checks yet) if someone would like to study the fo or even better would like to add or comment or improve. Manuel block_linebreaking_old.pdf Description: Adobe PDF document ?xml version=1.0 encoding=UTF-8? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- !-- $Id: inline_border_padding_hyphenate_de.xml 426576 2006-07-28 15:44:37Z jeremias $ -- testcase info p This test checks some of the UAX#14 breaking rules. /p /info fo fo:root xmlns:fo=http://www.w3.org/1999/XSL/Format; xmlns:svg=http://www.w3.org/2000/svg; fo:layout-master-set fo:simple-page-master master-name=normal page-width=2.5in page-height=10in margin=5pt fo:region-body/ /fo:simple-page-master /fo:layout-master-set fo:page-sequence master-reference=normal white-space-collapse=true fo:flow flow-name=xsl-region-body font-size=10pt fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt BA -- Break Opportunity After (A) /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt VeryLongWordWithAThinSpace#x2009;PutInTheMiddleOfIt /fo:block fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt BB -- Break Opportunity Before (B) /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt VeryLongWordWithAnAcuteAccent#x00B4;PutInTheMiddleOfIt /fo:block fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt B2 -- Break Opportunity Before and After (B/A) /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt #x2014;Very#x2014;Long#x2014;Word#x2014;With#x2014;LotsOf#x2014;Em#x2014;Dashes#x2014;Put#x2014;InBetween#x2014;And#x2014;Around#x2014; /fo:block fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt B2 -- Break Opportunity Before and After (B/A) - as before but don't break between consecutive Em Dashes /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt AVeryLongWordWithSomeDouble#x2014; #x2014;Dashes#x2014; #x2014;Put#x2014; #x2014;In /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt AVeryLongWordWithSomeDouble#x2014;#x2014;Dashes#x2014;#x2014;Put#x2014;#x2014;In /fo:block fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt CL -- Closing Punctuation (XB) /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt Closing )brackets )even )if )preceeded )by )a )space )are )not )a )break )point /fo:block fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt EX -- Exclamation / interrogation (XB) /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt Question ?marks ?and exclamation !marks !even ?if !preceeded ?by !a ?space !are ?not !a ?break !point /fo:block fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt HY -- Hyphen Minus (XA) /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt Very-Long-Word-With-Lots-Of-Hyphens-Put-In-Between /fo:block fo:block background-color=yellow margin=0pt 0pt 3pt 0pt Hyphens-in-numbers-do-not-123-567-890-break /fo:block fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt ID -- Ideographic (B/A) /fo:block fo:block background-color=yellow margin=0pt 0pt 0pt 0pt Need#x3000;A#x3000;Proper#x3000;Test#x3000;Case#x3000;For#x3000;This#x3000;As#x3000;Here#x3000;Only#x3000;The#x3000;Ideographic#x3000;Space#x3000;Is#x3000;Used /fo:block
UAX#14 implementation
Just a quick heads up that I finally took the plunge to add UAX#14 line breaking to FOP. This is based on code donated by Joerg quite some time ago on which I did some work in October 2005. This had been documented on list at the time. One of the major stumbling blocks in progressing this was the conflict between the recursive / nested getNextKnuthElement calls and the need to do the UAX#14 line breaking processing across inline boundaries. In the end I decided, in the interest of making at least some progress in this area, to not attempt the 'all singing all dancing solution', but to simply apply this to the TextLayoutManager only. Yes, that gives us only limited new functionality, but hopefully its still an improvement. Also, the code is based on the Unicode 4.1 standard and not 5.0 but that can be fixed later. Its looking OK so far and most of the layout engine tests pass. The change consists of a new package org.apache.fop.text.linebreak containing two classes and changes to the TextLayoutManager. Nothing else has been touched so far. Its not ready for a commit yet, but hopefully in a few days. The question that arises is if this should go into the planned release or if that is too risky and I should wait with the commit until the release is out or do it in a branch? Another issue is that one of the two new files is actually generated by a little Java program (also from Joerg) from Unicode data files. While it would be a 'nice to have' for this generation to be integrated into the FOP build I would initially commit the generated file into the repository. To integrate the generation into the build we would either need have the Unicode data files in the Apache repository (not sure about licensing issues here) or the build would need to fetch those files causing an external dependency which usually is a hassle for people behind corporate firewalls etc.. Thats why I propose to apply the KISS principle initially. Manuel