Re: UAX#14 implementation

2006-12-21 Thread Vincent Hennebert
Nice work, Manuel! That will be a great addition to Fop.

I have never studied the problem in detail, so I can only give a general
opinion. But I think we should follow as closely as possible the Unicode
standard, even if that leads to behaviors incompatible with the current
one. It seems the Unicode standard is designed to nicely handle all
sorts of high-level typographical issues. This would be great to be able
to say Fop is Unicode compliant. And users can refer to a well-known,
well-defined standard if they want to understand Fop's behavior or
achieve special effects.

So, by all means, go for it!

Vincent


Manuel Mall a écrit :
 On Wednesday 20 December 2006 20:43, Manuel Mall wrote:
 On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
 snip/

 Its looking OK so far and most of the layout engine tests pass.
 Just discovered the first instance of an existing testcase which
 gives a different result.
 
 Here is another one: The current FOP implementation treats spaces other 
 than NBSP, e.g. U+2009 (Thin Space) and U+200A (Hair Space) as 
 suppressible around line breaks. I believe that is incorrect as the 
 spec explicitly limits whitespace handling to the normal space U+0020. 
 The test case which shows that is block_white-space_4.xml. It tests for 
 specific Knuth element sequences which are now different because these 
 spaces are now treated as not suppressible.
 
 After making the appropriate adjustment to the checks in that testcase 
 ALL testcases are now passing!
 
 snip/

 Manuel


Re: UAX#14 implementation

2006-12-20 Thread Manuel Mall
On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
snip/

 Its looking OK so far and most of the layout engine tests pass. 

Just discovered the first instance of an existing testcase which gives a 
different result. Under UAX#14 the following text (Note this is plain 
text not FO markup!):

text-align=center .conditionality=retain 
linefeed-treatment=preserve.

which appears in inline_border_padding_conditionality_2.xml has only a 
single break opportunity which is before the word linefeed-treatment. 
The space between center and .conditionality is not a break 
opportunity as it is before a punctuation (Rule LB13). In our existing 
code this space is a valid break opportunity and under the specific 
circumstances this gives a different layout result.

I don't think this is actually a problem but it is a noticeable 
difference. It just shows that UAX#14 is designed to break typical 
written text and not programming language code which this text snippet 
resembles.

snip/
 Manuel

Manuel


Re: UAX#14 implementation

2006-12-20 Thread Chris Bowditch

Manuel Mall wrote:


On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
snip/

Its looking OK so far and most of the layout engine tests pass. 



Just discovered the first instance of an existing testcase which gives a 
different result. Under UAX#14 the following text (Note this is plain 
text not FO markup!):


text-align=center .conditionality=retain 
linefeed-treatment=preserve.


which appears in inline_border_padding_conditionality_2.xml has only a 
single break opportunity which is before the word linefeed-treatment. 
The space between center and .conditionality is not a break 


Interesting. Just to clarify; are you saying that in the previous 
release 0.92beta the line breaking code identified 2 BP but in 0.93 just 
the one BP is identifed?


snip/

Chris





Re: UAX#14 implementation

2006-12-20 Thread Manuel Mall
On Wednesday 20 December 2006 23:22, Chris Bowditch wrote:
 Manuel Mall wrote:
  On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
  snip/
 
 Its looking OK so far and most of the layout engine tests pass.
 
  Just discovered the first instance of an existing testcase which
  gives a different result. Under UAX#14 the following text (Note
  this is plain text not FO markup!):
 
  text-align=center .conditionality=retain
  linefeed-treatment=preserve.
 
  which appears in inline_border_padding_conditionality_2.xml has
  only a single break opportunity which is before the word
  linefeed-treatment. The space between center and .conditionality
  is not a break

 Interesting. Just to clarify; are you saying that in the previous
 release 0.92beta the line breaking code identified 2 BP but in 0.93
 just the one BP is identifed?

No quite - what I am saying is that in the current fop trunk version 2 
break points are identified but in my local UAX#14 version of FOP only 
one break point is identified. After looking through the UAX#14 
specification the behaviour of my implementation appears to be correct.

 snip/

 Chris

Manuel


Re: UAX#14 implementation

2006-12-20 Thread Manuel Mall
On Wednesday 20 December 2006 20:43, Manuel Mall wrote:
 On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
 snip/

  Its looking OK so far and most of the layout engine tests pass.

 Just discovered the first instance of an existing testcase which
 gives a different result.

Here is another one: The current FOP implementation treats spaces other 
than NBSP, e.g. U+2009 (Thin Space) and U+200A (Hair Space) as 
suppressible around line breaks. I believe that is incorrect as the 
spec explicitly limits whitespace handling to the normal space U+0020. 
The test case which shows that is block_white-space_4.xml. It tests for 
specific Knuth element sequences which are now different because these 
spaces are now treated as not suppressible.

After making the appropriate adjustment to the checks in that testcase 
ALL testcases are now passing!

 snip/

Manuel


Re: UAX#14 implementation

2006-12-20 Thread Luca Furini

Manuel Mall wrote:

After making the appropriate adjustment to the checks in that testcase 
ALL testcases are now passing!


Wonderful!

I'm really looking forward to see this great new feature!

Just a couple of doubts concerning the differences with respect to the old 
implementation (I must confess I read the Unicode Annex quite quickly 
...):



Just discovered the first instance of an existing testcase which
gives a different result. Under UAX#14 the following text (Note
this is plain text not FO markup!):
text-align=center .conditionality=retain linefeed-treatment=preserve.
which appears in inline_border_padding_conditionality_2.xml has
only a single break opportunity which is before the word
linefeed-treatment. The space between center and .conditionality
is not a break


Does this happens because that space is just before a .?

Another doubt: why aren't the - signs in text-align and 
linefeed-treatment possible breaks?



Regards
Luca


Re: UAX#14 implementation

2006-12-20 Thread J.Pietschmann

Luca Furini wrote:
After making the appropriate adjustment to the checks in that testcase 
ALL testcases are now passing!


Wonderful!


Me too!
text-align=center .conditionality=retain 

...

Does this happens because that space is just before a .?


The dot (FULL STOP) has property IS and prevents break after
any character, also even after a space. Interesting, I didn't
remember this.

Another doubt: why aren't the - signs in text-align and 
linefeed-treatment possible breaks?


They should be, the dash in Unicode 5.0 has the property HY, which
allows for a break after. The tables I generated were for 4.1 (or
even 4.0) and might have to be updated, I haven't checked.
The UAX14 has been updated too, which might have changed the pair
table (cahp. 7.3), which is, oddly enough, part of the report instead
of a data file.

Links:
 http://www.unicode.org/reports/tr14/
 http://www.unicode.org/Public/UNIDATA/LineBreak.txt

J.Pietschmann


Re: UAX#14 implementation

2006-12-20 Thread Manuel Mall
On Thursday 21 December 2006 06:08, J.Pietschmann wrote:
 Luca Furini wrote:
snip/
  Another doubt: why aren't the - signs in text-align and
  linefeed-treatment possible breaks?

 They should be, the dash in Unicode 5.0 has the property HY, which
 allows for a break after. The tables I generated were for 4.1 (or
 even 4.0) and might have to be updated, I haven't checked.
 The UAX14 has been updated too, which might have changed the pair
 table (cahp. 7.3), which is, oddly enough, part of the report instead
 of a data file.

My mistake, the code correctly generates break opportunities after the 
HYPHEN-MINUS U+002D. I didn't notice because the breaker didn't choose 
them, probably because of their higher penalty value. I tested again 
with words like:
Rindfleisch-etikettierungs-überwachungs-aufgaben-übertragungs-gesetz
Donau-dampf-schiffahrts-elektrizitaeten-hauptbetriebswerk-bauunterbeamten-gesellschaft
and it breaks correctly after the hyphens.


 J.Pietschmann
Manuel


Re: UAX#14 implementation

2006-12-20 Thread Manuel Mall
Here is a sample from the test case I am developing attached.

The ..._old.pdf file shows the current fop-trunk behaviour while 
the ..._new.pdf file shows what happens in the FOP UAX#14 version.

There are quite a few subtle differences (mostly for the better I hope).

I also attach the test case file (hasn't got checks yet) if someone 
would like to study the fo or even better would like to add or comment 
or improve.

Manuel


block_linebreaking_old.pdf
Description: Adobe PDF document
?xml version=1.0 encoding=UTF-8?
!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the License); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an AS IS BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
--
!-- $Id: inline_border_padding_hyphenate_de.xml 426576 2006-07-28 15:44:37Z jeremias $ --
testcase
  info
p
  This test checks some of the UAX#14 breaking rules.
/p
  /info
  fo
fo:root xmlns:fo=http://www.w3.org/1999/XSL/Format; xmlns:svg=http://www.w3.org/2000/svg;
  fo:layout-master-set
fo:simple-page-master master-name=normal page-width=2.5in page-height=10in margin=5pt
  fo:region-body/
/fo:simple-page-master
  /fo:layout-master-set
  fo:page-sequence master-reference=normal white-space-collapse=true
fo:flow flow-name=xsl-region-body font-size=10pt
  fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt
BA -- Break Opportunity After (A)
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
VeryLongWordWithAThinSpace#x2009;PutInTheMiddleOfIt
  /fo:block
  fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt
BB -- Break Opportunity Before (B)
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
VeryLongWordWithAnAcuteAccent#x00B4;PutInTheMiddleOfIt
  /fo:block
  fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt
B2 -- Break Opportunity Before and After (B/A)
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
#x2014;Very#x2014;Long#x2014;Word#x2014;With#x2014;LotsOf#x2014;Em#x2014;Dashes#x2014;Put#x2014;InBetween#x2014;And#x2014;Around#x2014;
  /fo:block
  fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt
B2 -- Break Opportunity Before and After (B/A) - as before but don't break between consecutive Em Dashes
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
AVeryLongWordWithSomeDouble#x2014; #x2014;Dashes#x2014; #x2014;Put#x2014; #x2014;In
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
AVeryLongWordWithSomeDouble#x2014;#x2014;Dashes#x2014;#x2014;Put#x2014;#x2014;In
  /fo:block
  fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt
CL -- Closing Punctuation (XB)
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
Closing )brackets )even )if )preceeded )by )a )space )are )not )a )break )point
  /fo:block
  fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt
EX -- Exclamation / interrogation (XB)
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
Question ?marks ?and exclamation !marks !even ?if !preceeded ?by !a ?space !are ?not !a ?break !point
  /fo:block
  fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt
HY -- Hyphen Minus (XA)
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
Very-Long-Word-With-Lots-Of-Hyphens-Put-In-Between
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 3pt 0pt
Hyphens-in-numbers-do-not-123-567-890-break
  /fo:block
  fo:block background-color=silver font-size=8pt margin=3pt 0pt 0pt 0pt
ID -- Ideographic (B/A)
  /fo:block
  fo:block background-color=yellow margin=0pt 0pt 0pt 0pt
Need#x3000;A#x3000;Proper#x3000;Test#x3000;Case#x3000;For#x3000;This#x3000;As#x3000;Here#x3000;Only#x3000;The#x3000;Ideographic#x3000;Space#x3000;Is#x3000;Used
  /fo:block
  

UAX#14 implementation

2006-12-19 Thread Manuel Mall
Just a quick heads up that I finally took the plunge to add UAX#14 line 
breaking to FOP. This is based on code donated by Joerg quite some time 
ago on which I did some work in October 2005. This had been documented 
on list at the time.

One of the major stumbling blocks in progressing this was the conflict 
between the recursive / nested getNextKnuthElement calls and the need 
to do the UAX#14 line breaking processing across inline boundaries.

In the end I decided, in the interest of making at least some progress 
in this area, to not attempt the 'all singing all dancing solution', 
but to simply apply this to the TextLayoutManager only. Yes, that gives 
us only limited new functionality, but hopefully its still an 
improvement. Also, the code is based on the Unicode 4.1 standard and 
not 5.0 but that can be fixed later.

Its looking OK so far and most of the layout engine tests pass. The 
change consists of a new package org.apache.fop.text.linebreak 
containing two classes and changes to the TextLayoutManager. Nothing 
else has been touched so far.

Its not ready for a commit yet, but hopefully in a few days.

The question that arises is if this should go into the planned release 
or if that is too risky and I should wait with the commit until the 
release is out or do it in a branch?

Another issue is that one of the two new files is actually generated by 
a little Java program (also from Joerg) from Unicode data files. While 
it would be a 'nice to have' for this generation to be integrated into 
the FOP build I would initially commit the generated file into the 
repository. To integrate the generation into the build we would either 
need have the Unicode data files in the Apache repository (not sure 
about licensing issues here) or the build would need to fetch those 
files causing an external dependency which usually is a hassle for 
people behind corporate firewalls etc.. Thats why I propose to apply 
the KISS principle initially.

Manuel