[dom4j-dev] Fwd: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator

Thomas Nichols Tue, 18 Dec 2001 12:20:50 -0800

Good Day,

Thought this might be of interest : a Java XML API (currently early beta) 
designed by James Clark - Technical Lead for XML 1.0, wrote XP and TREX 
(now merged into ISO RELAX NG), as well as being a very decent chap. DDJ 
article about him at http://www.ddj.com/documents/s=862/ddj0107e/0107e.htm
There are some interesting ideas below, I haven't yet worked out whether 
"pullax" would need an adapter for dom4j, or vice versa. Thoughts, anyone?
Regards,
Thomas.





>>Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
>>X-No-Archive: yes
>>List-Post: <mailto:[EMAIL PROTECTED]>
>>List-Help: <mailto:[EMAIL PROTECTED]>
>>List-Unsubscribe: <mailto:[EMAIL PROTECTED]>
>>List-Subscribe: <mailto:[EMAIL PROTECTED]>
>>Delivered-To: mailing list [EMAIL PROTECTED]
>>From: "James Clark" <[EMAIL PROTECTED]>
>>To: "John Cowan" <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>
>>Date: Tue, 18 Dec 2001 12:34:10 +0700
>>X-Priority: 3
>>Subject: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
>>
>>  > This is a first design for XMLIterator, a third base-level API
>>>  which allows an application to pull content from XML.  This
>>>  avoids the memory demand and navigation issues of DOM, and
>>>  is a more straightforward programming model than SAX, which
>>>  requires magic data connections between the event handlers in
>>>  order to maintain application state.  XMLIterator extends
>>>  the familiar Iterator interface, so it models an XML document
>>>  as a linear collection of partially specified nodes.
>>
>>I very much agree that we need such an API.  SAX works great for some
>>kinds of application.  In particular, it works well for generic XML
>>applications which do not have to parse a particular XML vocabulary.
>>However, SAX is really awkward for some applications, particularly
>>applications that parse a particular XML vocabulary with a complex,
>>highly nested structure.
>>
>>As it happens, I have been working on a similar API for the last few
>>months.  One impetus for doing this was my experience in implementing
>>Jing. I was struck by how painful it was to parse a RELAX NG schema
>>into an internal form using SAX.  The equivalent non-XML syntax was
>>easily parsed using a straightforward recursive descent parser.  By
>>contrast, the parser for the XML syntax was a warped and twisted mess.
>>
>>My API is currently called "pullax" (pull API for XML). This is still
>>very much work in progress.  I hadn't been planning to release for a
>>month or two yet.  But since you have started this discussion, I think
>>the most constructive thing I can do is to release what I have now.  I
>>do have quite a comprehensive API and I do have a fairly complete
>>sample implementation.  I have made this available at
>>
>>   http://www.thaiopensource.com/pullax/
>>
>>I chose to do my initial sample implementation on top of Xerces 2
>>because it provides a native interface (XNI) with a "pull" parser
>>API. (I would call it a "controlled push" rather than a "pull"
>>API. Roughly, it has a variant of XMLReader.parse which you call
>>multiple times; on each call, it parses some portion of the document
>>making SAX-like callbacks on handlers.)  This allows an implementation
>>that neither requires the whole document in memory (as would an
>>implementation on top of DOM), nor the use of threads (as would an
>>implementation on top of SAX).  XNI also provides a very rich set of
>>information. You'll need Xerces 2 Beta 3 if you want to play with my
>>implementation.  See
>>
>>    http://xml.apache.org/xerces2-j/index.html
>>
>>Obviously, SAX and DOM adapters are on my list of things to do.
>>
>>The bad news is that the API documentation is pretty pathetic at the
>>moment and still needs a lot of work. This message will have to serve
>>as an overview of the API for now.
>>
>>In designing pullax, I have tried to follow modern Java best
>>practices, for example, in favoring immutability and using classes for
>>type-safe enumerations. One of my main guides here has been Joshua
>>Bloch's book "Effective Java"
>>(http://java.sun.com/docs/books/effective/).  This is a truly
>>excellent book done by the guy who designed several of the better
>>recent Java platform APIs (including the Collections API).
>>
>>Perhaps the most fundamental decision in designing a pull API is
>>whether the properties for each node are provided
>>
>>(a) by methods on some sort of node object returned by the
>>scanner/parser/iterator object
>>
>>(b) by methods on the scanner/parser object itself; the scanner/parser
>>object has methods to move to the next node
>>
>>You've chosen (a).  A couple of notable pull APIs use (b):
>>
>>- the XmlReader API in .NET; this is the principal XML parser API for
>>.NET (see
>>http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemxmlxmlreadercl
>>asstopic.asp)
>>
>>- XML Pull Parser (http://www.extreme.indiana.edu/soap/xpp/)
>>
>>I tried it both ways in pullax.  I ended up, like you, with (a), for
>>the following reasons:
>>
>>1. Handling attributes in (b) is messy
>>
>>2. (a) works more like the java.util.Iterator and
>>java.util.Enumeration that are familiar to every Java programmer
>>
>>3. (a) makes it much easier to construct filters/processing pipelines;
>>for example, writing a RELAX NG validator that wraps around a
>>non-validating parser.
>>
>>The main argument against (a) is that it involves more object
>>creation, which, according to Java folklore, is a performance killer.
>>
>>Now, you've minimized object creation by having next() implicitly
>>invalidate any previously returned nodes. I don't think this is an
>>acceptable design for an API intended for widespread public use:
>>
>>1. It's a common requirement to need to lookahead in the document when
>>deciding how to process the current node.  Your design makes this
>>awkward.  It also makes it very awkward to write a filter that needs
>>lookahead in doing its filtering (imagine a filter that merges
>>adjacent text nodes).
>>
>>2. This behavior would be a big surprise to the average Java user.
>>The Iterators and Enumerations which a typical Java user will be
>>familiar with just don't work like this.
>>
>>3. It's the kind of API that leads to "Write Once, Debug Everywhere"
>>rather than "Write Once, Run Everywhere".  A typical scenario is that
>>a user writes an application that needs lookahead; they incorrectly
>>access an XMLNode object after another call to next(); they test their
>>application with an implementation that allocates a new XMLNode object
>>for each next() call; their application appears to work fine. Then
>>somebody else tries to use the application with a parser
>>implementation that reuses XMLNode objects and the application
>>mysteriously and silently gives the wrong results.
>>
>>In summary, this design does not promote reliability.  I believe
>>priority should be given to reliability over performance.
>>
>>My "solution" is simply to accept the object creation.  Modern Java
>>VMs (like Hotspot) do a fantastic job of efficient allocation of
>>short-lived objects; object creation has much less performance
>>overhead with modern VMs than it used to with classic VMs.  In any
>>case, a user that is prepared to sacrifice programming convenience for
>>an extra ounce of performance can use SAX. (Also, since the objects
>>returned are immutable, there is an opportunity for reducing object
>>creation by sharing.)
>>
>>The central interface in my API is XmlScanner. (I'm planning a
>>companion XmlPrinter interface for writing XML.) This corresponds to
>>your XMLIterator interface.  This interface is similar to
>>java.util.Iterator but I chose not to derive XmlScanner from Iterator,
>>for two reasons:
>>
>>1. the equivalents of the next() and hasNext() methods need to be
>>able to throw a java.io.IOException
>>
>>2. it's awkward and inefficient to have always to cast the return
>>value of next()
>>
>>My XmlScanner object returns XmlItem objects.  I call these objects
>>"items" rather than "nodes" because "node" to me suggests a tree view
>>where elements have children rather than a flat view with
>>start-element and end-element objects.
>>
>>My XmlItem object has similar methods to your XMLNode object to return
>>the item type, the local name, namespace URI, QName, prefix, value
>>etc.  The method names are chosen based on the Infoset and XPath.
>>
>>I toyed with the approach to attributes that you took, that is, having
>>ATTRIBUTE items following the START_ELEMENT item. This has the
>>advantage of being simple. However, I found it inconvenient to work
>>with and felt it would seem rather strange to anybody with exposure to
>>SAX or DOM.  So instead an XmlItem of type START_ELEMENT has
>>getAttribute() methods that return an XmlItem for an attribute
>>identified by name or index.
>>
>>XmlItem has a getContext() method returning an XmlContext object.
>>This provides information about the context of the item, such as the
>>in-scope namespaces.  Typically, many XmlItem objects can share the
>>same XmlContext object.
>>
>>A major challenge in designing a general-purpose XML API is to deal
>>with the diversity of XML applications.  At one end of the spectrum
>>are simple applications that need no more than elements, attributes
>>and text (the "holy trinity of XML" as I think David Megginson once
>>called them).  At the other end of the spectrum are applications such
>>as XML editors that want as much detail about the markup as they can
>>get including things like comments and entities.  Just as there is a
>>diversity of XML applications, so is there a diversity of XML
>>processors/parsers.  There are large, complex parsers like Xerces that
>>a very rich set of information but take a corresponding hit in terms
>>of size and speed.  There is also a need for simpler parsers that do
>>less but can be smaller and faster.
>>
>>The solution I use in pullax is based on the "feature" concept of
>>SAX2.  An implementation of the pullax API implements the
>>XmlScannerFactory interface. By default an XmlScanner created by an
>>XmlScannerFactory returns exactly three types of XmlItem:
>>START_ELEMENT, END_ELEMENT, TEXT.  Also by default TEXT items are
>>maximal.  So, for example, the document
>>
>>   <doc>4<!-- a silly comment -->2</doc>
>>
>>will be returned as three items: a START_ELEMENT item, a TEXT item
>>with string value "42", and an END_ELEMENT item. If an application
>>wishes to see, for example, comments, it must request the SHOW_COMMENT
>>feature from the XmlScannerFactory before creating the XmlScanner.  If
>>the parser cannot satisfy the request, it must throw an exception.
>>XmlScannerFactory objects are designed to be dynamically discoverable
>>using the service provider mechanism (like JAXP).
>>XmlScannerFactoryFinder is a utility class that takes a set of
>>features and dynamically finds an XmlScannerFactory implementation
>>that supports those features.  This approach ensures that the support
>>for a rich information set in pullax does not get in the way of simple
>>applications or simple XML processors.
>>
>>The pullax API aims to provide a very rich information set.  As far as
>>the document instance is concerned, it is intended to support the
>>union of SAX2, DOM2 core, and the XML infoset and then some.  As far
>>as the DTD is concerned, pullax currently provides approximately the
>>same information as the union of the XML Infoset and DOM Level 2 core.
>>I have opted not to provide the detailed lexical information about the
>>DTD that SAX2 provides. It seems to me that it is not much use having
>>lexical information about DTDs if you lose information about parameter
>>entities within declarations; but dealing with parameter entities
>>within declarations is just too hard for a general-purpose API,
>>especially when consider nested parameter entity references. I believe
>>DTD editor type applications really require specialized APIs and
>>parsers (eg DTDinst see http://www.thaiopensource.com/dtdinst).
>>
>>Another respect in which pullax's approach to DTDs differs from SAX is
>>that it represents the DOCTYPE declaration as a single item.  There
>>does seem much point in breaking it down into a multiple items.  Most
>>of the information is in the XmlDtd object which is available from the
>>XmlContext.  Note that the XmlDtd object is immutable.  I'm planning
>>to extend the API to allow straightforward DTD caching: the idea is
>>that a user-supplied XmlDtdResolver object will map the system id,
>>public id and internal subset to an XmlDtd object.
>>
>>I've written too much already.  I'll be happy to answer any questions
>>people may have about the design and I'll try to get the API doc into
>>shape as soon as possible.
>>
>>James
>>
>>
>>
>>-----------------------------------------------------------------
>>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>>initiative of OASIS <http://www.oasis-open.org>
>>
>>The list archives are at http://lists.xml.org/archives/xml-dev/
>>
>>To subscribe or unsubscribe from this list use the subscription
>>manager: <http://lists.xml.org/ob/adm.pl>
>
>
>--
>-- -----------------------------------------------------------------
>Robin La Fontaine, Director, Monsell EDM Ltd
>DeltaXML: "Change control for XML in XML"
>Tel: +44 1684 592 144 Fax: +44 1684 594 504
>Email: [EMAIL PROTECTED]      http://www.deltaxml.com


_______________________________________________
dom4j-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dom4j-dev

[dom4j-dev] Fwd: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator

Reply via email to