Re: White space handling Wiki page
On Tue, 8 Nov 2005 04:40 am, Simon Pepping wrote: > I have taken my time, but here is my reaction to the Wiki page on > white space handling. In addition, I have written my own view on the > XSL-FO spec's handling of white space in a Wiki page. Simon, your efforts are very much appreciated - at least by me. Your Wiki page presents white space handling at a different angle (paraphrased: Editors can modify the XML (adding spaces and linefeeds) and white space handling is mainly for dealing with those modifications). I think that is a very good perspective to take. > > Step 2. Refinement: white-space-collapse > > > Issue 1. The spec intentionally addresses only XML white space, > because only such white space is manipulated by editors to obtain > pretty printing. Point taken, although I have no experience with non western editors. Do they all use 0x20 for 'pretty printing'? > > Issue 2. The spec intentionally addresses only the collapse of white > space around linefeed characters, because only such white space is > manipulated by editors to obtain pretty printing. Even if linefeed > characters indicate real line breaks and are preserved, it is > possible that the editor has introduced sequences of XML white space > characters for pretty printing. > OK > Issue 3. White-space-collapse is formulated in terms of space > characters which do not generate an area. That is similar to the > space resolution rules, where space specifiers get a zero width. > Since there is no merging of white space glyph areas into a single > area, there is no contradiction with the condition for glyph merging > in section 4.7.2. The space glyph area that does generate an area, > determines the traits of that area. > Yes - but I my point was if someone writes: and if ሴ and 䌡 are mergeable according to the rules of the script than we are not allowed to do so because they don't have matching traits. But if someone writes: these would be removed / collapsed / deleted under the white space rules. Here is a more extreme example: Under white space collapse the whole fo:character with the border disappears. If you write: at least the border is retained and if the space survives depends on if the sequence is at the beginning or end of a line or not. Any way it is a bit academic as the spec is quite clear: if the Unicode value is U+0020 being it in a fo:character (during refinement) or a glyph area (during line building) it is subject to white space handling independent of any other properties / traits defined on it. > Step 3. Line building: white-space-treatment and > suppress-at-linebreak > = >= > > I agree that the references to the refinement stage are probably > editorial mistakes. > > Issue 1. As for white-space-collapse, the glyph areas are deleted, > and glyph merging is not applicable. > I agree with that interpretation - just not sure it really captures well what a user may expect - see examples above. > Issue 2. Here is a difference between FO 1.0 and 1.1. In 1.0 the flow > objects were deleted at the refinement stage. Therefore they cannot > contribute to line breaking. In 1.1 the glyph areas are deleted at > the line building stage. Therefore they could contribute to line > breaking. I do not think that this is intended, and they should not > contribute to line breaking. This is in line with my opinion that the > values preserve and ignore should not really be in the same property > as suppression around linebreaks, and should be taken care of in the > refinement stage. > Again I agree fully with you and the current implementation shows that issue. We deal with white-space-treatment twice once during refinement and once again during line building. Andreas commented on that as well. But I think that is how it has to be for the time being. > Example 2 > = > > The space in "." is suppressed because it is at > the start of the block. Interesting - I agree that this is the intention but you don't find that sentence in the spec. In 1.1 this is covered by the "deleting spaces at the beginning of a line" under white-space-treatment / line building. Again the discussion is probably academic - we all agree what the expected outcome is. If we can derive that outcome from the spec or not is a very interesting discussion but won't change what we will do. > And "" does not generate > an empty line. starts a new line, but that is not > equivalent to a linefeed. When at the start of the nested fo:block > there is no content in the line yet, it starts the same line. A > similar thing happens in the case of " ", > which was discussed in an email thread. I assume you mean the discussion under linefeed-treatment="preserve". I am still confused about that because will generate one linefeed or should this create also none? > > Example 3 > = > > Jörg asked the same in this email thr
Re: text-decoration problem?
Although it may be stating the obvious - our integrated tests don't catch regressions with respect to the renderers yet. Because our layout engine test suite is getting more comprehensive this may install a false sense of comfort when one does a change that everything is still fine. Need to look into integrating Jeremias bitmap verifications into the test suite. Manuel On Tue, 8 Nov 2005 09:05 am, Manuel Mall wrote: > On Tue, 8 Nov 2005 08:25 am, Sven wrote: > > Good evening, > > I will try to make it as short as I can: Using the latest trunk > > (revision 331647) I am getting a non-interpretable error from my > > Acrobat Reader (version 7.0.5), when compiling the attached fo > > fragment. The error box tells me something about "too many > > arguments available". I have pinpointed the error by using the > > text-decoration attribute with an value other than "none". Funny > > thing is, that this works some revisions before, but i am sorry not > > being able to tell you, when things broke down. > > > > Thanks for your good work > > > > Sven > > Sven, > > thanks for your problem report. You are correct that this is a > regression caused by some changes made recently. The problem should > be fixed now (revision 331655). > > Regards > > Manuel > > > > > http://www.w3.org/1999/XSL/Format";> > > > > > page-height="29.7cm" master-name="default"> > > > > > > > > > > > > > > > > > text-align="end" font-size="10pt" font-family="Helvetica"> > > Einleitung > > > > > > > > > > > > > > > > > > > font-family="Times"> > > Obwohl die agentenorientierte Softwareentwicklung > > im vergangenen Jahrzehnt immer größeren Zuspruch gefunden hat, > > findet sie bisher nur überwiegend im universitären Umfeld > > Anwendung. Damit sie > > > > Bestandteil > > > > der Entwicklung von Unternehmensanwendungen werden > > kann, müssten die Defizite bestehender Agentenplattformen beseitigt > > werden. Diese Defizite bestehen in der schlechten > > Administrierbarkeit und der häufig nur unzureichend unterstützten > > Interoperabilität mit bestehenden Unternehmensanwendungen > > [CoGrBKR02]. > > > > > > > > > > --- > >-- To unsubscribe, e-mail: > > [EMAIL PROTECTED] For additional > > commands, e-mail: > > [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED]
Re: White space handling Wiki page
I have taken my time, but here is my reaction to the Wiki page on white space handling. In addition, I have written my own view on the XSL-FO spec's handling of white space in a Wiki page. Step 2. Refinement: white-space-collapse Issue 1. The spec intentionally addresses only XML white space, because only such white space is manipulated by editors to obtain pretty printing. Issue 2. The spec intentionally addresses only the collapse of white space around linefeed characters, because only such white space is manipulated by editors to obtain pretty printing. Even if linefeed characters indicate real line breaks and are preserved, it is possible that the editor has introduced sequences of XML white space characters for pretty printing. Issue 3. White-space-collapse is formulated in terms of space characters which do not generate an area. That is similar to the space resolution rules, where space specifiers get a zero width. Since there is no merging of white space glyph areas into a single area, there is no contradiction with the condition for glyph merging in section 4.7.2. The space glyph area that does generate an area, determines the traits of that area. Step 3. Line building: white-space-treatment and suppress-at-linebreak == I agree that the references to the refinement stage are probably editorial mistakes. Issue 1. As for white-space-collapse, the glyph areas are deleted, and glyph merging is not applicable. Issue 2. Here is a difference between FO 1.0 and 1.1. In 1.0 the flow objects were deleted at the refinement stage. Therefore they cannot contribute to line breaking. In 1.1 the glyph areas are deleted at the line building stage. Therefore they could contribute to line breaking. I do not think that this is intended, and they should not contribute to line breaking. This is in line with my opinion that the values preserve and ignore should not really be in the same property as suppression around linebreaks, and should be taken care of in the refinement stage. Example 2 = The space in "." is suppressed because it is at the start of the block. And "" does not generate an empty line. starts a new line, but that is not equivalent to a linefeed. When at the start of the nested fo:block there is no content in the line yet, it starts the same line. A similar thing happens in the case of " ", which was discussed in an email thread. Example 3 = Jörg asked the same in this email thread: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&by=thread&from=561781, entitled "Suppression of leading space". foo bar . ..foo. ...bar . foo. .bar and also believes that two spaces remain. As to the border of the inline on the next line, I think indeed that a formatter should avoid it, as it may be considered as a bad layout choice. Processing Model 2 == In steps 2 and 3 you apply the conditions of glyph area merging. I do not agree with that, as I explained above. In step 3 eligible characters are all characters with suppress-at-line-break="true", by default only the space character. Nowhere in the spec is a conversion of tabs and CRs to spaces specified. In example 3, why is the space before 'Green' not deleted? It directly follows a line break (step 4b). Regards, Simon On Tue, Oct 25, 2005 at 04:57:41PM +0800, Manuel Mall wrote: > Hi, > > I haven't got any technical comments to the issues raised on the Wiki > page. Is this 'too hard' or 'too boring' or 'too messy' or what? The > problem is not going away. We currently don't do it right in some parts > (that is established) but I don't know overall what is right or wrong. > May be if I ask for comments on an issue by issue basis we get > somewhere? > -- Simon Pepping home page: http://www.leverkruid.nl
Re: Is getNextKnuthElements the right interface for inline LMs?
On Mon, 7 Nov 2005 10:39 pm, Luca Furini wrote: > Manuel Mall wrote: > > What I observed is that most of these issue cannot be solved by > > looking at a single character at a time. They need context, very > > often only one character, sometimes more (e.g. sequence of white > > space). More importantly the context needed is not limited to the > > fo they occur in. They all span across fos. This is were the > > current LM structures and especially the getNextKnuthElement > > interface really gets in the way of things. Basically one cannot > > create the correct Knuth sequences without the context but the > > context can come from everywhere (superior fo, subordinate fo, or > > neighboring fo). So one needs look ahead and backtrack features > > across all these boundaries and it feels extremely messy. > > > > It appears conceptually so much simpler to have only a single loop > > interating over all the characters in a paragraph doing all the > > character/glyph manipulation, word breaking (hyphenation), and line > > breaking analysis and generation of the Knuth sequences in one > > place. An example where this is currently done is the white space > > handling during refinement. One loop at block level based on a > > recursive char iterator that supports deletion and character > > replacement does the job. Very simple and easy to understand. I > > have something similar in mind for inline Knuth sequence > > generation. Of course the iterator would not only return the > > character but relevant formatting information for it as well, e.g. > > the font so the width etc. can be calculated. The iterator may also > > have to indicate start/end border/padding and conditional > > border/padding elements. > > I think that there are two different "layers" that affect the > generation of the elements: one is the "text layer" (or maybe > semantic level), where we have the text and we can easily handle > whitespace, recognize word boundaries, find hyphenation points, > regardless of the actual fo (and its depth) where the text lives, and > the "formatting layer" where we have the resolved values for the > properties like font, size, borders, etc. These layers speak > different languages, as one knows words and spaces and the other > elements and attributes. > > At the moment, the getNextKnuthElements() method works at the > formatting level: each LM knows the relevant properties but has a > limited view of the text, whence the current difficulties. > > Your proposal is to work at the text level (correct me if I'm wrong), > with the LineLM centralizing the handling of the text for a whole > block. I wonder if, doing so, we would not find difficult to know the > resolved property values applying to each piece of text. > > I'm not saying that whe don't need changes in the LM interactions; > I'm just asking myself (and asking to you all, of course :-)) if it > is really possible to have both breaking and element generation *in > one place*. > > What if we had first a centralized control at the text level (the > LineLM putting together all the text, finding words, normalizing > spaces, performing hyphenation ...) and then a localized element > generation (each LM, basing on what the LineLM did and using the > local properties)? > > Something somewhat similar (but limited to single words) happens at > the moment with the getChangedKnuthElements() method, which is called > only after the LineLM has reconstructed a word, found its breaking > points and told the inline LMs where the breaks are. > > Don't know if what I just wrote makes any sense; so, as I never tried > to do what you suggest or what I just attempted to describe, I really > look forward to see your code in action! > Luca, yes, what you wrote makes sense and I am not at the coding stage yet. So don't hold your breath yet with respect to seeing new code from me - you may get blue in the face. Still trying to get my head around all the possible issues. I think your suggestion has quite a few merits. To rephrase it in my words: we do a text processing stage which precedes the getNextKnuthElements and (among other things) determines all the break possibilities. This list is then given to the LMs as part of the getNextKnuth call and the LMs can build the Knuth elements based on their local knowledge (properties) + the already calculated break possibilities. We may even be able to do that during the refinement (white space handling) loop thereby keeping repeated iterations over the text to a minimum. I like the sound of this as it retains lots of what we have while addressing the need to analyse text across fo boundaries. > Regards > Luca Thanks Manuel
Re: A few new features
On 07.11.2005 15:30:08 Chris Bowditch wrote: > > - The command-line gets a new option: -out application/pdf myfile.pdf is > > the generic way to create an output file. If someone created a WordXML > > output handler and provided the right service resource file he could > > specify "-out text/xml+msword out.xml". "-out list" lists all MIME types > > that are available for output. > > Are you saying that the -ps, -pdf, etc options are to be replaced by > -out application/pdf, etc? If so, then I don't like that at all. It's > much more convenient to just type -ps or -pdf. I don't intend to remove any of the existing options. It's just an addition. Jeremias Maerki
Re: Is getNextKnuthElements the right interface for inline LMs?
Manuel Mall wrote: What I observed is that most of these issue cannot be solved by looking at a single character at a time. They need context, very often only one character, sometimes more (e.g. sequence of white space). More importantly the context needed is not limited to the fo they occur in. They all span across fos. This is were the current LM structures and especially the getNextKnuthElement interface really gets in the way of things. Basically one cannot create the correct Knuth sequences without the context but the context can come from everywhere (superior fo, subordinate fo, or neighboring fo). So one needs look ahead and backtrack features across all these boundaries and it feels extremely messy. It appears conceptually so much simpler to have only a single loop interating over all the characters in a paragraph doing all the character/glyph manipulation, word breaking (hyphenation), and line breaking analysis and generation of the Knuth sequences in one place. An example where this is currently done is the white space handling during refinement. One loop at block level based on a recursive char iterator that supports deletion and character replacement does the job. Very simple and easy to understand. I have something similar in mind for inline Knuth sequence generation. Of course the iterator would not only return the character but relevant formatting information for it as well, e.g. the font so the width etc. can be calculated. The iterator may also have to indicate start/end border/padding and conditional border/padding elements. I think that there are two different "layers" that affect the generation of the elements: one is the "text layer" (or maybe semantic level), where we have the text and we can easily handle whitespace, recognize word boundaries, find hyphenation points, regardless of the actual fo (and its depth) where the text lives, and the "formatting layer" where we have the resolved values for the properties like font, size, borders, etc. These layers speak different languages, as one knows words and spaces and the other elements and attributes. At the moment, the getNextKnuthElements() method works at the formatting level: each LM knows the relevant properties but has a limited view of the text, whence the current difficulties. Your proposal is to work at the text level (correct me if I'm wrong), with the LineLM centralizing the handling of the text for a whole block. I wonder if, doing so, we would not find difficult to know the resolved property values applying to each piece of text. I'm not saying that whe don't need changes in the LM interactions; I'm just asking myself (and asking to you all, of course :-)) if it is really possible to have both breaking and element generation *in one place*. What if we had first a centralized control at the text level (the LineLM putting together all the text, finding words, normalizing spaces, performing hyphenation ...) and then a localized element generation (each LM, basing on what the LineLM did and using the local properties)? Something somewhat similar (but limited to single words) happens at the moment with the getChangedKnuthElements() method, which is called only after the LineLM has reconstructed a word, found its breaking points and told the inline LMs where the breaks are. Don't know if what I just wrote makes any sense; so, as I never tried to do what you suggest or what I just attempted to describe, I really look forward to see your code in action! Regards Luca
Re: A few new features
Jeremias Maerki wrote: Last week, I've had some time to hack on my notebook. Fun stuff only. I've finished a few things I started earlier and did some other things. Here's a list of what I've done. Since some of them might be controversial I want to give you a chance to object, just in case. So here are the changes (almost) ready for committing on my notebook: Hi Jeremias, I have no comment for most of the items you mentioned. Generally they all sound good, except for one :) - The command-line gets a new option: -out application/pdf myfile.pdf is the generic way to create an output file. If someone created a WordXML output handler and provided the right service resource file he could specify "-out text/xml+msword out.xml". "-out list" lists all MIME types that are available for output. Are you saying that the -ps, -pdf, etc options are to be replaced by -out application/pdf, etc? If so, then I don't like that at all. It's much more convenient to just type -ps or -pdf. Chris
Re: Is getNextKnuthElements the right interface for inline LMs?
On 07.11.2005 08:24:14 Manuel Mall wrote: > Of course that would be quite a change internally although limited to > inline LMs and not affecting any block level operations. The way to do > this would be a branch in svn. But before I embark on such an endeavour > I'll like to seek some feedback on the list. Anyone aware of serious > problems with such an approach? No. > Has it been tried before and failed for example? We had to change a few things during the transition to the Knuth approach. Sometimes, changes are necessary and it makes no sense to stubbornly stick to what is already there. > Those who designed the current getNextKnuth approach may have > arguments why changing it for inline LMs is a bad idea? I have none. You seem to have good arguments for changing the interface. Still, care should be taken that the LMs stay as uniform as possible so it's possible to add layout managers for custom elements and that non-character content is handled well and without too much custom logic because the changed approach focuses strongly on text. > Any other views / concerns? The above said, it should be noted that I haven't dived, yet, into the Unicode stuff you've been discussing lately. I'm very happy about the flurry of activity in this area. It looked like a good discussion. I hope you will excuse me if I don't participate too much there right now. Jeremias Maerki
A few new features
Last week, I've had some time to hack on my notebook. Fun stuff only. I've finished a few things I started earlier and did some other things. Here's a list of what I've done. Since some of them might be controversial I want to give you a chance to object, just in case. So here are the changes (almost) ready for committing on my notebook: - Two new constructors for Fop.java: Fop(String) and Fop(String, FOUserAgent) where String is a MIME type. - org.apache.fop.apps.MimeConstants with a comprehensive list of MIME types used in FOP. - Non-standard, FOP-specific MIME types changed to a uniform pattern: application/X-fop-awt-preview, application/X-fop-print, application/X-fop-areatree - RendererFactory now supports manual registration and dynamic discovery of Renderers and FOEventHandlers by their MIME types. Instantiation is done using MIME types everywhere. - The RENDER_* constants are mapped to MIME types in Fop.java. I'd like to remove them but left them where they are for the moment. I'd also like to remove the "implements Constants" from Fop.java. But that's nothing new. :-) - RendererFactory is now an instantiable class whose reference is held by FOUserAgent just like it is done for the XMLHandlers. - Renderers and FOEventHandlers now each have a *Maker class which is also a kind of factory class which is used to register a Renderer/FOEventHandler and additionally serves to provide additional information about the thing, such as which MIME types it supports and if the implementation requires an OutputStream. - The command-line gets a new option: -out application/pdf myfile.pdf is the generic way to create an output file. If someone created a WordXML output handler and provided the right service resource file he could specify "-out text/xml+msword out.xml". "-out list" lists all MIME types that are available for output. - To make things a little more consistent and error reporting easier, I've changed FNode so each FNode can return the namespace URI it belongs to (getNamespaceURI()). Furthermore, it can return the normally used namespace prefix (getNormalNamespacePrefix(), "fo" for XSL-FO) and I've added methods to build the fully qualified name from a local name. For this I've changed getName() to getLocalName() for all descendants of FONode so it is defined to return only the local name and not (in some cases) the fully qualified name. The whole stuff now feels a lot cleaner. - I've started extended support to handle alternative forms of painting graphics. For example, Barcode4J supports SVG, EPS, bitmaps and Java2D as output targets. For PostScript, it's better too use EPS directly. In PDF Java2D could be used instead of SVG to avoid the slow round-trip through Batik. RTF could use Bitmaps. I've started a Graphics2DAdapter which renderers provide if they can provide a Graphics2D instance for painting. XMLHandler can query a Graphics2DAdapter and use that to paint the graphic. The image is passed to the Graphics2DAdapter as a Graphics2DImagePainter instance which essentially provides a paint(Graphics2D, Rectangle) method and a getImageSize() method. While implementing the last point I've had to realize that while this is a step forward, this is not enough in the long run. I believe we need to also refactor the whole image package again to handle a few additional problems. An example: Our JPEG support is currently restricted to PDF and PS renderers where the JPEG can be embedded in undecoded form. The Java2D renderer descendants currently don't support JPEG at all because they can't get a decoded JPEG image. What I would like to do is introduce something similar to the concept already used by the Java Printing System (JPS): The DocFlavor, an object to describe the format (SVG, JPEG, MathML etc.) and manifestation (byte[], DOM etc.). The renderers say what formats they support (with desirability indicators) and the image classes will support providing the images in different flavors, as needed. In between, special converters can convert from one format to another, like converting a MathML DOM to a bitmap image. I'm not going into more detail right now. I'll document everything on the Wiki. I can only say that I had to realize that I will basically need to recreate Nicola Ken Barozzi's Morphos idea again. That's going to be interesting and a lot of fun. :-) Ok, so if anybody is against any of the above points or needs additional information, please tell me. Jeremias Maerki
Test post from Gmane
This is a test post with an unsubscribed email address from the web interface of http://www.gmane.org. If it works we should publish some information about this on our website so people who hate mailing lists still have a way to post messages on our lists without having to subscribe. Cheers, Jeremias Maerki
Re: Unicode compliant Line Breaking
1. +1 2. +1 3.b) +1 for the separatable parts although c) is also ok for now. +1 to try to find synergies with the code in Batik. If I were you I'd create a branch and put your stuff in there. It's easier for everyone to follow and to help (wishful thinking). On 31.10.2005 08:25:12 Manuel Mall wrote: > In a previous post Joerg pointed to the Unicode Standard Annex #14 on > Line Breaking (http://www.unicode.org/reports/tr14/) and his initial > implementation: http://people.apache.org/~pietsch/linebreak.tar.gz. > > I had since a closer look at both UAX#14 and Joerg's code. Because I > liked what I saw I went about adapting Joerg's code it to Unicode 4.1 > and added fairly extensive JUnit test cases to it mainly because it > really helps to go through the various different cases mentioned in the > spec in some structured fashion. > > The results are now available for public inspection: > http://people.apache.org/~manuel/fop/linebreak.tar.gz > > 1. I would like to propose that Unicode conformant line breaking be > integrated into FOP trunk because it: > a) Moves FOP more towards being a universal formatter and not just a > formatter for western languages > b) Moves FOP more towards becoming a high quality typesetting system > (something that was really started by integrating Knuth style breaking) > The reason I think this needs to be voted on is because Unicode line > breaking will in subtle ways change the current line breaking behaviour > and therefore constitutes a (significant) change in FOPs overall > rendering. > > 2. I would also like to propose that the Unicode conformant line > breaking be implemented using our own pair-table based implementation > and not using Java's line breaker, because: > a) It gives us full control and allows FOP to follow the Unicode > standard (and its updates and erratas) closely and therefore keep FOPs > Unicode compliance level independent of the Java version. > b) It allows us to tailor the algorithm to match the needs of XSL-FO and > FOP. > c) It allows us to provide user customisation features (down the track) > not available through using the Java APIs. > > Of course there are downsides, like: > a) Are we falling for the 'not invented here' syndrome? > b) Duplicating code which is already in the Java base system > c) Increasing the memory footprint of FOP > > 3. Assuming we get enough +1 for the above proposals the first item to > decide after that would be: Where should the code live? > a) Joerg would like to see it in Jakarta Commons but hasn't got the time > to start the project. > b) Jeremias suggested XMLGraphics Commons. > c) Personally I think it is too early to factor it out. More experience > with its design and use cases should be gathered before making it > standalone and at this point in time it really only are 2 core Java > classes. I would like to suggest that it initially lives under FOP in > something like org.apache.fop.text. Should the need and energy levels > (= developer enthusiasm) become available later to make this into an > Jakarta Commons or XMLGraphics Commons project so be it. > > Assuming now that this will be agreed as well the next step would be the > more detailed design of the integration. But this is well beyond the > scope of this e-mail as there are some tricky issues involved and they > probably need to be tackled in conjunction with the white space > handling issues. Many of the problems are related to our LayoutManager > structures which create barriers when it comes to the need to process > character sequences across those boundaries as is the case for both > line breaking and white space handling. Add to that the design of the > different Knuth sequences required to model the different break cases > in conjunction with conditional border/padding and white space removal > around line breaking and different types of line justifications and > there is some real work ahead. > > Cheers > > Manuel > > Should add my votes: > > 1.) +1 > 2.) +1 > 3.c) +1 Jeremias Maerki