Peter B. West wrote: > <quote> ... > Echoing sentiments recently expressed in this publication, Clark said > that SAX, though efficient, was very hard to use, and that DOM had > obvious limitations due to the requirement that the document being > processed be in memory. He suggested that what was needed was a standard > "pull API," one that efficiently allowed random access to XML documents.
First, thanks for the update on your work -- I understand what you are doing a little better. Second, the statement above about random access almost jumped out at me, because I had exactly the same thought earlier today while contemplating a thread on the XSL-FO list which discussed processing of long documents and memory constraints related to them. The closest thing to a perfect document processing system that I have come across is FrameMaker, which seems to be able to handle pretty large documents with a pretty small footprint. I don't know for sure, but it seems to me that the "area tree" (if you will) is written to disk, and pages can be efficiently jumped to in an arbitrary manner. The WYSIWIG editor is essentially a viewport on the portion of the document in memory, which is itself a subset of the disk document. As you edit the document, I presume that events are sent to something akin to a layout manager, which has to do something with them. Now, in our case, we need to not only have random access to the area tree, but also to the fo tree. What follows is my feeble attempt to reconcile some of these issues. The issue with SAX as I see it, is that because it is one-way, and our processing is not (I think the standard calls it "non-linear"), we presumably have to essentially build our own DOM-ish (random access) things in order to get the job done. I wonder if we don't end up reinventing the wheel in frustration with that approach. From a cleanliness of design standpoint at least, it seems much more straightforward to instead use a DOM-based approach and write chunks of the two DOMs to disk where necessary. I haven't thought through whether java.io.RandomAccessFile or a regular database or some other alternative would be the way to go. The LMs can be totally protected from all of this by abstracting both the FO and Area Documents -- in other words, they work with abstract nodes on trees and don't care what was required to make them available. Oddly enough, once I have the stability of the DOMs to work from (perhaps this is more felt than real), an event-based approach seems much more natural -- like imitating a word processor. In fact, if done properly, another project could conceivably use FOP as the layout engine for a WYSIWIG editor. Actually I have been trying to quantify & grasp two processing models that come to mind: 1) the word-processing model, an event-based model, and 2) an 18th-century typesetter manually laying out pages, which is much more of a look-ahead, measure-it-to-see-how-it-fits-before-placing-it model. These two models roughly correspond to the two processing models I mentioned the other day ("I am text, place me somewhere" vs. "I am a page with room, place something on me"). The second model requires the 2-pass approach. The first fits either a push or a pull approach (since we can manufacture events if we need to), the second is definitely pull. When I wrote about those two models, I was frankly leaning heavily toward the 2nd approach, but I think I am changing my mind. To explain why, I need to have you forget for a moment about our SAX-based input (I'll come back to that). Forget also about performance for a moment, and picture the typesetter setting type one character at a time, with no thought of what the next character or image is -- in other words, setting type just like a user sitting at Microsoft Word does. If the typesetter comes to a concept that messes his previous work up, he has to yank a line of type out, or perhaps an entire page out, and replace them. However, (and this is the key point), he eventually will get the job done. In other words, when abstracted this way, the only benefit to a look-ahead /should be/ performance. Consider our auto table layout problem. If on the 350th page of the table, I find an item that requires me to change the width of the columns, which in turns changes the layout of all 350 pages, yes, I am going to burn up a few cycles to accomplish that, but I /should/ be able to get it done. So far all I have done is loosely reconciled these two processing models. The next thing I want to do is to try to compare these two models with FOP's layout process. If I like the event-based model, then maybe I ought to like FOP's approach. Let me go first to my 18th-century typesetter. Each time he has to tear out a line or page of type, he can go back to his manuscript (his FO document, if you will) to rebuild them. Similarly in a word processor, I presume that Microsoft Word must have some concept that the 2 lines at the top of page 84 are in the same paragraph as the 3 lines at the bottom of page 83. Do I have something similar in FOP? What the designer in me wants is a link between every area in my area tree back to its parent fo object. Then I know in pretty simple fashion how every item in both trees relates to any other object in either tree. Since we are using SAX (I promised I would come back to SAX), I conclude that by definition, we don't have this. When I get to page 350 of that auto-table layout, I either can't see the beginning of that table, or I have to store that information some other way. I then presume that, since we need similar functionality, we have some surrogate that is probably 1) a real pain to manage, 2) uses just as much memory as a DOM would, and 3) can't feasibly be segregated and written to disk if needed (because it is in the area tree??). Now I need to reconcile 1) liking an event-based approach, and 2) disliking SAX. This is actually pretty easy. I can read through a DOM and create events or something similar. My layout engine doesn't have to know whether a given event comes from a user sitting at a keyboard or some TreeWalker stepping through an fo DOM. Finally, let's come back to performance. So far, I have been talking about single-character events ("he just typed an 'a' here, lay my line out again"). But we can also have more efficient bigger events, analagous to pasting something into a document. Now my TreeWalker says "I have a 35,000 row auto-layout table for you that needs to span the available width of the page". My Page LM says "Cool, the available width of the page is 4.5". The TreeWalker crunches some numbers and says, "OK, column 1 needs to be 1" wide, column 2 is 1.5", etc. Now, here is the first row." I realize that this is over-simplified, but I am trying to describe a system that has its input abstracted. I question whether SAX is good for FOP's performance. In fact, if it gives us a klunkier structure, then it almost certainly slows us down. All of the logic still has to be performed, regardless of the input method, but it must be performed more slowly if the data is not convenient. I rather think that the affection for SAX must be because it saves the memory used by the DOM. It seems to me that writing it to a random-access file when necessary (which is what I think Peter was suggesting) would be a much better solution. To conclude, if I were designing this system from scratch, based on what I know right now, I would: 1. Use DOM for both the fo tree & the area tree. 2. Write them to disk when necessary, hiding all of this from the layout managers. 3. Use an event-based layout mechanism so that the fo tree doesn't even have to be there to get layout work done. I am sure I can be talked out of this by someone smarter, but I wanted to lay out the whole line of reasoning. My apologies to Peter and anyone else who may have been working on these points before. I am just now getting around to them. After further consideration, my use of "event-based" above may be too strong. Probably what I mean is more along the lines of API-based. In a WYSIWIG environment, the event would probably trigger an API action, but that action could be invoked another way as well. I am too tired to rewrite it -- I hope you know what I mean. This final thought is really a question which was briefly addressed during our recent weekend clarification about the role of the maintenance branch, and which I wish to apply specifically to the above thoughts. Does or could the new design give us the ability to (with say, a configuration option) choose between Layout Philosophy A and B? By this I mean 2 (or more) layout packages coexisting in the same code base, and sharing common resources that can be selected (configurable perhaps). If so, then we can play with these ideas at our leisure, compare them in various ways, transition between them if necessary, and maybe even keep both to be used in various circumstances. I think someone (Jeremias perhaps) had indicated that something along these lines would be possible, but that may have been at a lower level than what I am discussing here. I don't mean to rock the boat. I guess I am kind of like a three-year-old who asks "why" and "why not" all of the time to the annoyance of all around him -- I am still trying to learn the basics. Thanks for your patience. Victor Mote --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]