Good Day, Thought this might be of interest : a Java XML API (currently early beta) designed by James Clark - Technical Lead for XML 1.0, wrote XP and TREX (now merged into ISO RELAX NG), as well as being a very decent chap. DDJ article about him at http://www.ddj.com/documents/s=862/ddj0107e/0107e.htm There are some interesting ideas below, I haven't yet worked out whether "pullax" would need an adapter for dom4j, or vice versa. Thoughts, anyone? Regards, Thomas.
>>Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm >>X-No-Archive: yes >>List-Post: <mailto:[EMAIL PROTECTED]> >>List-Help: <mailto:[EMAIL PROTECTED]> >>List-Unsubscribe: <mailto:[EMAIL PROTECTED]> >>List-Subscribe: <mailto:[EMAIL PROTECTED]> >>Delivered-To: mailing list [EMAIL PROTECTED] >>From: "James Clark" <[EMAIL PROTECTED]> >>To: "John Cowan" <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]> >>Date: Tue, 18 Dec 2001 12:34:10 +0700 >>X-Priority: 3 >>Subject: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator >> >> > This is a first design for XMLIterator, a third base-level API >>> which allows an application to pull content from XML. This >>> avoids the memory demand and navigation issues of DOM, and >>> is a more straightforward programming model than SAX, which >>> requires magic data connections between the event handlers in >>> order to maintain application state. XMLIterator extends >>> the familiar Iterator interface, so it models an XML document >>> as a linear collection of partially specified nodes. >> >>I very much agree that we need such an API. SAX works great for some >>kinds of application. In particular, it works well for generic XML >>applications which do not have to parse a particular XML vocabulary. >>However, SAX is really awkward for some applications, particularly >>applications that parse a particular XML vocabulary with a complex, >>highly nested structure. >> >>As it happens, I have been working on a similar API for the last few >>months. One impetus for doing this was my experience in implementing >>Jing. I was struck by how painful it was to parse a RELAX NG schema >>into an internal form using SAX. The equivalent non-XML syntax was >>easily parsed using a straightforward recursive descent parser. By >>contrast, the parser for the XML syntax was a warped and twisted mess. >> >>My API is currently called "pullax" (pull API for XML). This is still >>very much work in progress. I hadn't been planning to release for a >>month or two yet. But since you have started this discussion, I think >>the most constructive thing I can do is to release what I have now. I >>do have quite a comprehensive API and I do have a fairly complete >>sample implementation. I have made this available at >> >> http://www.thaiopensource.com/pullax/ >> >>I chose to do my initial sample implementation on top of Xerces 2 >>because it provides a native interface (XNI) with a "pull" parser >>API. (I would call it a "controlled push" rather than a "pull" >>API. Roughly, it has a variant of XMLReader.parse which you call >>multiple times; on each call, it parses some portion of the document >>making SAX-like callbacks on handlers.) This allows an implementation >>that neither requires the whole document in memory (as would an >>implementation on top of DOM), nor the use of threads (as would an >>implementation on top of SAX). XNI also provides a very rich set of >>information. You'll need Xerces 2 Beta 3 if you want to play with my >>implementation. See >> >> http://xml.apache.org/xerces2-j/index.html >> >>Obviously, SAX and DOM adapters are on my list of things to do. >> >>The bad news is that the API documentation is pretty pathetic at the >>moment and still needs a lot of work. This message will have to serve >>as an overview of the API for now. >> >>In designing pullax, I have tried to follow modern Java best >>practices, for example, in favoring immutability and using classes for >>type-safe enumerations. One of my main guides here has been Joshua >>Bloch's book "Effective Java" >>(http://java.sun.com/docs/books/effective/). This is a truly >>excellent book done by the guy who designed several of the better >>recent Java platform APIs (including the Collections API). >> >>Perhaps the most fundamental decision in designing a pull API is >>whether the properties for each node are provided >> >>(a) by methods on some sort of node object returned by the >>scanner/parser/iterator object >> >>(b) by methods on the scanner/parser object itself; the scanner/parser >>object has methods to move to the next node >> >>You've chosen (a). A couple of notable pull APIs use (b): >> >>- the XmlReader API in .NET; this is the principal XML parser API for >>.NET (see >>http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemxmlxmlreadercl >>asstopic.asp) >> >>- XML Pull Parser (http://www.extreme.indiana.edu/soap/xpp/) >> >>I tried it both ways in pullax. I ended up, like you, with (a), for >>the following reasons: >> >>1. Handling attributes in (b) is messy >> >>2. (a) works more like the java.util.Iterator and >>java.util.Enumeration that are familiar to every Java programmer >> >>3. (a) makes it much easier to construct filters/processing pipelines; >>for example, writing a RELAX NG validator that wraps around a >>non-validating parser. >> >>The main argument against (a) is that it involves more object >>creation, which, according to Java folklore, is a performance killer. >> >>Now, you've minimized object creation by having next() implicitly >>invalidate any previously returned nodes. I don't think this is an >>acceptable design for an API intended for widespread public use: >> >>1. It's a common requirement to need to lookahead in the document when >>deciding how to process the current node. Your design makes this >>awkward. It also makes it very awkward to write a filter that needs >>lookahead in doing its filtering (imagine a filter that merges >>adjacent text nodes). >> >>2. This behavior would be a big surprise to the average Java user. >>The Iterators and Enumerations which a typical Java user will be >>familiar with just don't work like this. >> >>3. It's the kind of API that leads to "Write Once, Debug Everywhere" >>rather than "Write Once, Run Everywhere". A typical scenario is that >>a user writes an application that needs lookahead; they incorrectly >>access an XMLNode object after another call to next(); they test their >>application with an implementation that allocates a new XMLNode object >>for each next() call; their application appears to work fine. Then >>somebody else tries to use the application with a parser >>implementation that reuses XMLNode objects and the application >>mysteriously and silently gives the wrong results. >> >>In summary, this design does not promote reliability. I believe >>priority should be given to reliability over performance. >> >>My "solution" is simply to accept the object creation. Modern Java >>VMs (like Hotspot) do a fantastic job of efficient allocation of >>short-lived objects; object creation has much less performance >>overhead with modern VMs than it used to with classic VMs. In any >>case, a user that is prepared to sacrifice programming convenience for >>an extra ounce of performance can use SAX. (Also, since the objects >>returned are immutable, there is an opportunity for reducing object >>creation by sharing.) >> >>The central interface in my API is XmlScanner. (I'm planning a >>companion XmlPrinter interface for writing XML.) This corresponds to >>your XMLIterator interface. This interface is similar to >>java.util.Iterator but I chose not to derive XmlScanner from Iterator, >>for two reasons: >> >>1. the equivalents of the next() and hasNext() methods need to be >>able to throw a java.io.IOException >> >>2. it's awkward and inefficient to have always to cast the return >>value of next() >> >>My XmlScanner object returns XmlItem objects. I call these objects >>"items" rather than "nodes" because "node" to me suggests a tree view >>where elements have children rather than a flat view with >>start-element and end-element objects. >> >>My XmlItem object has similar methods to your XMLNode object to return >>the item type, the local name, namespace URI, QName, prefix, value >>etc. The method names are chosen based on the Infoset and XPath. >> >>I toyed with the approach to attributes that you took, that is, having >>ATTRIBUTE items following the START_ELEMENT item. This has the >>advantage of being simple. However, I found it inconvenient to work >>with and felt it would seem rather strange to anybody with exposure to >>SAX or DOM. So instead an XmlItem of type START_ELEMENT has >>getAttribute() methods that return an XmlItem for an attribute >>identified by name or index. >> >>XmlItem has a getContext() method returning an XmlContext object. >>This provides information about the context of the item, such as the >>in-scope namespaces. Typically, many XmlItem objects can share the >>same XmlContext object. >> >>A major challenge in designing a general-purpose XML API is to deal >>with the diversity of XML applications. At one end of the spectrum >>are simple applications that need no more than elements, attributes >>and text (the "holy trinity of XML" as I think David Megginson once >>called them). At the other end of the spectrum are applications such >>as XML editors that want as much detail about the markup as they can >>get including things like comments and entities. Just as there is a >>diversity of XML applications, so is there a diversity of XML >>processors/parsers. There are large, complex parsers like Xerces that >>a very rich set of information but take a corresponding hit in terms >>of size and speed. There is also a need for simpler parsers that do >>less but can be smaller and faster. >> >>The solution I use in pullax is based on the "feature" concept of >>SAX2. An implementation of the pullax API implements the >>XmlScannerFactory interface. By default an XmlScanner created by an >>XmlScannerFactory returns exactly three types of XmlItem: >>START_ELEMENT, END_ELEMENT, TEXT. Also by default TEXT items are >>maximal. So, for example, the document >> >> <doc>4<!-- a silly comment -->2</doc> >> >>will be returned as three items: a START_ELEMENT item, a TEXT item >>with string value "42", and an END_ELEMENT item. If an application >>wishes to see, for example, comments, it must request the SHOW_COMMENT >>feature from the XmlScannerFactory before creating the XmlScanner. If >>the parser cannot satisfy the request, it must throw an exception. >>XmlScannerFactory objects are designed to be dynamically discoverable >>using the service provider mechanism (like JAXP). >>XmlScannerFactoryFinder is a utility class that takes a set of >>features and dynamically finds an XmlScannerFactory implementation >>that supports those features. This approach ensures that the support >>for a rich information set in pullax does not get in the way of simple >>applications or simple XML processors. >> >>The pullax API aims to provide a very rich information set. As far as >>the document instance is concerned, it is intended to support the >>union of SAX2, DOM2 core, and the XML infoset and then some. As far >>as the DTD is concerned, pullax currently provides approximately the >>same information as the union of the XML Infoset and DOM Level 2 core. >>I have opted not to provide the detailed lexical information about the >>DTD that SAX2 provides. It seems to me that it is not much use having >>lexical information about DTDs if you lose information about parameter >>entities within declarations; but dealing with parameter entities >>within declarations is just too hard for a general-purpose API, >>especially when consider nested parameter entity references. I believe >>DTD editor type applications really require specialized APIs and >>parsers (eg DTDinst see http://www.thaiopensource.com/dtdinst). >> >>Another respect in which pullax's approach to DTDs differs from SAX is >>that it represents the DOCTYPE declaration as a single item. There >>does seem much point in breaking it down into a multiple items. Most >>of the information is in the XmlDtd object which is available from the >>XmlContext. Note that the XmlDtd object is immutable. I'm planning >>to extend the API to allow straightforward DTD caching: the idea is >>that a user-supplied XmlDtdResolver object will map the system id, >>public id and internal subset to an XmlDtd object. >> >>I've written too much already. I'll be happy to answer any questions >>people may have about the design and I'll try to get the API doc into >>shape as soon as possible. >> >>James >> >> >> >>----------------------------------------------------------------- >>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an >>initiative of OASIS <http://www.oasis-open.org> >> >>The list archives are at http://lists.xml.org/archives/xml-dev/ >> >>To subscribe or unsubscribe from this list use the subscription >>manager: <http://lists.xml.org/ob/adm.pl> > > >-- >-- ----------------------------------------------------------------- >Robin La Fontaine, Director, Monsell EDM Ltd >DeltaXML: "Change control for XML in XML" >Tel: +44 1684 592 144 Fax: +44 1684 594 504 >Email: [EMAIL PROTECTED] http://www.deltaxml.com _______________________________________________ dom4j-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dom4j-dev