Re: Initial soft hyphen support

2007-01-13 Thread Andreas L Delmelle

On Jan 13, 2007, at 10:31, Manuel Mall wrote:

Hi Manuel,


Just committed the initial support for the soft hyphen.


Nice job, thanks!


As we had two in favour of having the SHY always produce a break
opportunity and only one against that's the route I took.

I had no luck with giving the SHY a reduced penalty and have the Knuth
algorithm favour them before normal hyphenation breaks. Even with a
penalty value of 1 fop still chooses the hyphenation break with a
penalty of 50. Either I do something wrong or I misunderstand how the
Knuth breaking calculation is suppose to work. May be one of the Knuth
experts can have a look at this PLEASE.


Well, I'm still not really an expert, but as I'm beginning to  
understand more and more, what you altered was the base Knuth element  
generation, right?


IIUC, a possible solution may be to treat SHY as special *only* if  
hyphenation is turned off.
The reasoning being that, if hyphenate is true, then handling the SHY  
becomes the hyphenator's job. The SHY character will be presented to  
the hyphenator simply as a character of the word it appears in. The  
hyphenator should then be smart enough to recognize this as a special  
character, and do something like: create a hyphenation point for the  
SHY, and try to hyphenate the parts before and after the SHY as  
separate words...



HTH!

Andreas


Re: Unicode soft hyphen and hyphenation

2007-01-13 Thread Simon Pepping
On Sat, Jan 13, 2007 at 08:27:20PM +0900, Manuel Mall wrote:
> On Saturday 13 January 2007 19:57, Vincent Hennebert wrote:

> > Well, again, the description of the "hyphenate" property (§7.9.4)
> > sounds clear to me: when false, "Hyphenation may not be used in the
> > line-breaking algorithm".
> >
> I still think this can be interpreted both ways. It clearly forbids 
> formatter generated hyphenation but does it also suppress user 
> specified hyphenation?
> 
> In HTML there is no hyphenation but browsers are expected to honor the 
> SHY, that is treat it as a possible line break and if chosen put a 
> hyphen there otherwise discard the SHY. Given that XSL:FO is derived 
> from the HTML/CSS rendering model one could argue that this is the 
> default behaviour the XSL:FO authors most likely intended. If not it 
> would be difficult to construct a FO document that behaves with respect 
> to hyphenation and the SHY similar to HTML.

I agree with Manuel here: SHY should always be taken into account, and
always represents a linebreak opportunity.

> > 
> >
> > To summarize, my opinion is that:
> > - if "hyphenate" = false, no automatic hyphenation is performed, and
> >   soft hyphens are discarded
> > - if "hyphenate" = true, automatic hyphenation is performed, except
> > for any word that contains soft hyphens, in which case the soft
> > hyphens are used to create legal breakpoints.

I am not sure about this one.

Note that there is another way to let users override the automatic
hyphenation results. It is the equivalent of TeX's \hyphenation
command, which contains a list of fully hyphenated words which are
effectively added to the list of exceptions in the hyphenation
patterns. Every renderer has the freedom to provide a way for users to
specify such a list. This has nothing to do with the spec. It is part
of the hyphenation services of the renderer.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.eu


Re: svn commit: r494416 - /xmlgraphics/fop/trunk/src/java/org/apache/fop/fo/flow/Table.java

2007-01-13 Thread Andreas L Delmelle

On Jan 9, 2007, at 15:22, [EMAIL PROTECTED] wrote:


Author: vhennebert
Date: Tue Jan  9 06:21:59 2007
New Revision: 494416

URL: http://svn.apache.org/viewvc?view=rev&rev=494416
Log:
In relaxed validation mode, it should be acceptable to have  
fo:table-footer /after/ fo:table-body


Just a little note about this. I'm not vetoing this change but I  
would certainly not recommend it. Allowing a table-footer as last  
element in the table drastically increases the complexity of layout  
of multi-page tables.


True, in the current design, the whole table is present when layout  
beings (so that means flow.Table.tableFooter will be non-null if  
there is a footer), but consider the difficulties if we were to alter  
this interaction, and begin layout for the Table before the end of  
the table is reached. At that point, the footer might not yet be  
present, and we would again be stuck in a situation where we need to  
have the entire table in memory...


Just a thought.


Cheers,

Andreas



Unicode issues

2007-01-13 Thread Manuel Mall
After having delved into the UAX#14 and SHY issues I am interested in 
compiling a FOP UNICODE issues list. That is a list of things that 
still require work to make FOP Unicode compliant. Obviously the best 
place for such a list is the wiki. But before doing this I am 
interested in some even more informal collection of items on this 
mailing list. Here is a start:

1. The biggest ticket item is probably writing modes and bidi. This is 
clearly a big subproject on its own and outside of the scope I am 
attempting to cover.

2. Unicode text boundaries (UAX#29) especially word boundaries. Do we 
need this? It does not determine the word breaks to which the word 
spacing property is applied to as this is determined by the 
treat-as-word-space property. It could be used to determine the words 
for hyphenation.

3. Normalisation (UAX#15): Do we need this? Do we need to feed words in 
some normalised form to the hyphenation. Other uses for this?

4. Treatment of combining forms: What should / must we do with those 
character combinations?

5. Formatting control: Word joiners etc.. These need at least be 
discarded and not given to the renderers. Obviously proper handling 
when it comes breaking and similar decisions is required for full 
conformance.

Anything else that comes to mind? Please add / comment. If it is enough 
material I'll put it on the wiki.

Thanks

Manuel


Re: FOP Memory issues (fwd from fop-users)

2007-01-13 Thread Andreas L Delmelle

On Jan 12, 2007, at 13:00, [EMAIL PROTECTED] wrote:



   Just a quick comment on this, given the above stats you might
consider having a single Object field and switch the type of object
from the actual child (when there is a single child) to a List of
children when there are multiple children:


Also an interesting suggestion, thanks!

I'm still wondering whether the use of separate lists could not be  
avoided altogether.
Based on Jörgs statistics, I'd say that the number of children will  
most likely never reach the level where using direct index-based  
access (ArrayList) has its benefits over traversing a tree of  
references (LinkedList).


On top of that, all we really seem to need further in the code, is an  
iterator over that list, not the list itself...


I'm thinking, roughly all we need is something like:
...
class Node {
  Object parent;
  Object firstChild;
  Object[] siblings; //if there are siblings, always length=2
...
class ChildIterator
  Node currentNode;

  ChildIterator(Node n) {
currentNode = n.firstChild;
  }

  hasNext() {
return (currentNode.siblings != null)
 && (currentNode.siblings[1] != null);
  }

  next() {
if (hasNext()) {
  return currentNode.siblings[1];
} else {
  throw new NoSuchElementException();
}
  }

etc.

The backing list would be defined by the pointers between the  
objects, and not exist as a separate object (list) itself.


I'm still not completely sure about my estimation, but in the picture  
painted earlier (instance count of ArrayList and Object[]), those  
extra reference(s) for the siblings could turn out to be well worth  
it, since they'd only slightly increase the instance size of already  
existing objects, but they do avoid the creation of so many new  
ArrayList instances.



Cheers,

Andreas

Re: Unicode soft hyphen and hyphenation

2007-01-13 Thread Manuel Mall
On Saturday 13 January 2007 19:57, Vincent Hennebert wrote:
> Jeremias Maerki a écrit :
> > On 12.01.2007 09:25:59 Vincent Hennebert wrote:
> >> Jeremias Maerki a écrit :
> >>> Good to see that happen! Here's my take:
> >>>
> >>> On 11.01.2007 13:24:16 Manuel Mall wrote:
>  Hi,
> 

> Still don't agree. Overriding is not adding hyphenation points. The
> following sentence in the description of SHY is pretty clear to me:
> "The use of SHY is generally limited to situations where users need
> to override the behavior of [an automatic] hyphenator."
>
> [Manuel]
>
> > Interesting but moot point I think. FOP is the automatic hyphenator
> > in this case and the hyphenate property could be argued to control
> > which hyphenation algorithm FOP is using. If hyphenate="true" FOP
> > is allowed to add its own hyphenation breaks. If hyphenate="false"
> > it uses only user specified hyphenation breaks (= soft hyphens).
>
> Well, again, the description of the "hyphenate" property (§7.9.4)
> sounds clear to me: when false, "Hyphenation may not be used in the
> line-breaking algorithm".
>
I still think this can be interpreted both ways. It clearly forbids 
formatter generated hyphenation but does it also suppress user 
specified hyphenation?

In HTML there is no hyphenation but browsers are expected to honor the 
SHY, that is treat it as a possible line break and if chosen put a 
hyphen there otherwise discard the SHY. Given that XSL:FO is derived 
from the HTML/CSS rendering model one could argue that this is the 
default behaviour the XSL:FO authors most likely intended. If not it 
would be difficult to construct a FO document that behaves with respect 
to hyphenation and the SHY similar to HTML.

> 
>
> To summarize, my opinion is that:
> - if "hyphenate" = false, no automatic hyphenation is performed, and
>   soft hyphens are discarded
> - if "hyphenate" = true, automatic hyphenation is performed, except
> for any word that contains soft hyphens, in which case the soft
> hyphens are used to create legal breakpoints.
>
> Now if the majority is against me, I'll shut up right now to not
> prevent things moving on.
>

Fully agree - happy to go with the majority either way.

> Vincent

Manuel


Re: Unicode soft hyphen and hyphenation

2007-01-13 Thread Vincent Hennebert
Jeremias Maerki a écrit :
> On 12.01.2007 09:25:59 Vincent Hennebert wrote:
>> Jeremias Maerki a écrit :
>>> Good to see that happen! Here's my take:
>>>
>>> On 11.01.2007 13:24:16 Manuel Mall wrote:
 Hi,

 when I implemented the UAX#14 line breaking I noticed that fop doesn't 
 currently support the Unicode soft hyphen (SHY).

 I am thinking of adding support for this character to the line breaking 
 but am unsure of its correct behaviour in an XSL:FO environment. So I 
 have few questions related to treatment of the SHY:

 1) If hyphenation is not enabled should a SHY still produce a valid 
 break opportunity or should it be ignored?
>>> I think it should represent a valid break opportunity.
>> Well, I don't agree. See the description of SHY in section 15.2 of the
>> Unicode standard: SHY is used as a hint for automatic hyphenators and
>> overrides there behaviors. I would typically use it for nicely rendering
>> veryLongProgramVariablesLikeWeCanFindInJava in e.g. a portion of text
>> describing them in some documentation. Here I obviously want to force
>> hyphenation to occur between the words that make the variable name
>> (Long-Program-Variables instead of LongPro-gramVar-iables or whatever).
>>
>> So, as a hint for hyphenators, SHY should be ignored when hyphenation is
>> disabled, and when enabled have the priority over automatic hyphenation.
> 
> Hmm, I'm used to different behaviour in word processors and I don't read

Except that I wouldn't trust any word processor when it comes to
high-quality typography :-P
Does anyone know what InDesign is supposed to do?


> the UCD spec like you do. Also 5.3 in UAX#14 also doesn't give me the
> impression that a SHY is only active when hyphenation is enabled. It
> says: "The action of a hyphenation algorithm is equivalent to the
> insertion of a SHY. However, when a word contains an explicit SHY, it is
> customarily treated as overriding the action of the hyphenator for that
> word." I read this as: "SHY is the basic operator to add additional
> break points and a hyphenator can be added to do that task automatically."

Still don't agree. Overriding is not adding hyphenation points. The
following sentence in the description of SHY is pretty clear to me:
"The use of SHY is generally limited to situations where users need to
override the behavior of [an automatic] hyphenator."

[Manuel]
> Interesting but moot point I think. FOP is the automatic hyphenator in
> this case and the hyphenate property could be argued to control which
> hyphenation algorithm FOP is using. If hyphenate="true" FOP is allowed
> to add its own hyphenation breaks. If hyphenate="false" it uses only
> user specified hyphenation breaks (= soft hyphens).

Well, again, the description of the "hyphenate" property (§7.9.4) sounds
clear to me: when false, "Hyphenation may not be used in the
line-breaking algorithm".



To summarize, my opinion is that:
- if "hyphenate" = false, no automatic hyphenation is performed, and
  soft hyphens are discarded
- if "hyphenate" = true, automatic hyphenation is performed, except for
  any word that contains soft hyphens, in which case the soft hyphens
  are used to create legal breakpoints.

Now if the majority is against me, I'll shut up right now to not prevent
things moving on.

Vincent


Initial soft hyphen support

2007-01-13 Thread Manuel Mall
Just committed the initial support for the soft hyphen.

As we had two in favour of having the SHY always produce a break 
opportunity and only one against that's the route I took.

I had no luck with giving the SHY a reduced penalty and have the Knuth 
algorithm favour them before normal hyphenation breaks. Even with a 
penalty value of 1 fop still chooses the hyphenation break with a 
penalty of 50. Either I do something wrong or I misunderstand how the 
Knuth breaking calculation is suppose to work. May be one of the Knuth 
experts can have a look at this PLEASE.

Also not correctly working (yet) is ipd calculation when kerning and a 
SHY break is involved. But may be that's a more general issue.

For those looking closer at the commit the area handling within the text 
layout manager has changed a bit. Before this patch the assumption was 
made that the sequence of characters given to the LM will be fully 
output to the area tree. Now we have for the first time the case that 
characters (the SHY) can be dropped. This led to changes with respect 
to certain indexing loops.

Manuel