Re: [dom4j-dev] Fwd: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator

James Strachan Wed, 19 Dec 2001 11:08:52 -0800

Hi Thomas

This looks interesting. I've been working with XPP2 quite cloely lately. CVS
now has a new XPPReader that uses the XPP2 which is about 10-20% faster than
SAX. Though hopefully we can make it lazy so that it only parses the parts
of the document that are required.


XPP2 and James Clark's pullax look similar things.

James
----- Original Message -----
From: "Thomas Nichols" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, December 18, 2001 8:13 PM
Subject: [dom4j-dev] Fwd: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator


> Good Day,
>
> Thought this might be of interest : a Java XML API (currently early beta)
> designed by James Clark - Technical Lead for XML 1.0, wrote XP and TREX
> (now merged into ISO RELAX NG), as well as being a very decent chap. DDJ
> article about him at http://www.ddj.com/documents/s=862/ddj0107e/0107e.htm
> There are some interesting ideas below, I haven't yet worked out whether
> "pullax" would need an adapter for dom4j, or vice versa. Thoughts, anyone?
> Regards,
> Thomas.
>
>
>
>
> >>Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
> >>X-No-Archive: yes
> >>List-Post: <mailto:[EMAIL PROTECTED]>
> >>List-Help: <mailto:[EMAIL PROTECTED]>
> >>List-Unsubscribe: <mailto:[EMAIL PROTECTED]>
> >>List-Subscribe: <mailto:[EMAIL PROTECTED]>
> >>Delivered-To: mailing list [EMAIL PROTECTED]
> >>From: "James Clark" <[EMAIL PROTECTED]>
> >>To: "John Cowan" <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>
> >>Date: Tue, 18 Dec 2001 12:34:10 +0700
> >>X-Priority: 3
> >>Subject: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
> >>
> >>  > This is a first design for XMLIterator, a third base-level API
> >>>  which allows an application to pull content from XML.  This
> >>>  avoids the memory demand and navigation issues of DOM, and
> >>>  is a more straightforward programming model than SAX, which
> >>>  requires magic data connections between the event handlers in
> >>>  order to maintain application state.  XMLIterator extends
> >>>  the familiar Iterator interface, so it models an XML document
> >>>  as a linear collection of partially specified nodes.
> >>
> >>I very much agree that we need such an API.  SAX works great for some
> >>kinds of application.  In particular, it works well for generic XML
> >>applications which do not have to parse a particular XML vocabulary.
> >>However, SAX is really awkward for some applications, particularly
> >>applications that parse a particular XML vocabulary with a complex,
> >>highly nested structure.
> >>
> >>As it happens, I have been working on a similar API for the last few
> >>months.  One impetus for doing this was my experience in implementing
> >>Jing. I was struck by how painful it was to parse a RELAX NG schema
> >>into an internal form using SAX.  The equivalent non-XML syntax was
> >>easily parsed using a straightforward recursive descent parser.  By
> >>contrast, the parser for the XML syntax was a warped and twisted mess.
> >>
> >>My API is currently called "pullax" (pull API for XML). This is still
> >>very much work in progress.  I hadn't been planning to release for a
> >>month or two yet.  But since you have started this discussion, I think
> >>the most constructive thing I can do is to release what I have now.  I
> >>do have quite a comprehensive API and I do have a fairly complete
> >>sample implementation.  I have made this available at
> >>
> >>   http://www.thaiopensource.com/pullax/
> >>
> >>I chose to do my initial sample implementation on top of Xerces 2
> >>because it provides a native interface (XNI) with a "pull" parser
> >>API. (I would call it a "controlled push" rather than a "pull"
> >>API. Roughly, it has a variant of XMLReader.parse which you call
> >>multiple times; on each call, it parses some portion of the document
> >>making SAX-like callbacks on handlers.)  This allows an implementation
> >>that neither requires the whole document in memory (as would an
> >>implementation on top of DOM), nor the use of threads (as would an
> >>implementation on top of SAX).  XNI also provides a very rich set of
> >>information. You'll need Xerces 2 Beta 3 if you want to play with my
> >>implementation.  See
> >>
> >>    http://xml.apache.org/xerces2-j/index.html
> >>
> >>Obviously, SAX and DOM adapters are on my list of things to do.
> >>
> >>The bad news is that the API documentation is pretty pathetic at the
> >>moment and still needs a lot of work. This message will have to serve
> >>as an overview of the API for now.
> >>
> >>In designing pullax, I have tried to follow modern Java best
> >>practices, for example, in favoring immutability and using classes for
> >>type-safe enumerations. One of my main guides here has been Joshua
> >>Bloch's book "Effective Java"
> >>(http://java.sun.com/docs/books/effective/).  This is a truly
> >>excellent book done by the guy who designed several of the better
> >>recent Java platform APIs (including the Collections API).
> >>
> >>Perhaps the most fundamental decision in designing a pull API is
> >>whether the properties for each node are provided
> >>
> >>(a) by methods on some sort of node object returned by the
> >>scanner/parser/iterator object
> >>
> >>(b) by methods on the scanner/parser object itself; the scanner/parser
> >>object has methods to move to the next node
> >>
> >>You've chosen (a).  A couple of notable pull APIs use (b):
> >>
> >>- the XmlReader API in .NET; this is the principal XML parser API for
> >>.NET (see
>
>>http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemxmlxmlreader
cl
> >>asstopic.asp)
> >>
> >>- XML Pull Parser (http://www.extreme.indiana.edu/soap/xpp/)
> >>
> >>I tried it both ways in pullax.  I ended up, like you, with (a), for
> >>the following reasons:
> >>
> >>1. Handling attributes in (b) is messy
> >>
> >>2. (a) works more like the java.util.Iterator and
> >>java.util.Enumeration that are familiar to every Java programmer
> >>
> >>3. (a) makes it much easier to construct filters/processing pipelines;
> >>for example, writing a RELAX NG validator that wraps around a
> >>non-validating parser.
> >>
> >>The main argument against (a) is that it involves more object
> >>creation, which, according to Java folklore, is a performance killer.
> >>
> >>Now, you've minimized object creation by having next() implicitly
> >>invalidate any previously returned nodes. I don't think this is an
> >>acceptable design for an API intended for widespread public use:
> >>
> >>1. It's a common requirement to need to lookahead in the document when
> >>deciding how to process the current node.  Your design makes this
> >>awkward.  It also makes it very awkward to write a filter that needs
> >>lookahead in doing its filtering (imagine a filter that merges
> >>adjacent text nodes).
> >>
> >>2. This behavior would be a big surprise to the average Java user.
> >>The Iterators and Enumerations which a typical Java user will be
> >>familiar with just don't work like this.
> >>
> >>3. It's the kind of API that leads to "Write Once, Debug Everywhere"
> >>rather than "Write Once, Run Everywhere".  A typical scenario is that
> >>a user writes an application that needs lookahead; they incorrectly
> >>access an XMLNode object after another call to next(); they test their
> >>application with an implementation that allocates a new XMLNode object
> >>for each next() call; their application appears to work fine. Then
> >>somebody else tries to use the application with a parser
> >>implementation that reuses XMLNode objects and the application
> >>mysteriously and silently gives the wrong results.
> >>
> >>In summary, this design does not promote reliability.  I believe
> >>priority should be given to reliability over performance.
> >>
> >>My "solution" is simply to accept the object creation.  Modern Java
> >>VMs (like Hotspot) do a fantastic job of efficient allocation of
> >>short-lived objects; object creation has much less performance
> >>overhead with modern VMs than it used to with classic VMs.  In any
> >>case, a user that is prepared to sacrifice programming convenience for
> >>an extra ounce of performance can use SAX. (Also, since the objects
> >>returned are immutable, there is an opportunity for reducing object
> >>creation by sharing.)
> >>
> >>The central interface in my API is XmlScanner. (I'm planning a
> >>companion XmlPrinter interface for writing XML.) This corresponds to
> >>your XMLIterator interface.  This interface is similar to
> >>java.util.Iterator but I chose not to derive XmlScanner from Iterator,
> >>for two reasons:
> >>
> >>1. the equivalents of the next() and hasNext() methods need to be
> >>able to throw a java.io.IOException
> >>
> >>2. it's awkward and inefficient to have always to cast the return
> >>value of next()
> >>
> >>My XmlScanner object returns XmlItem objects.  I call these objects
> >>"items" rather than "nodes" because "node" to me suggests a tree view
> >>where elements have children rather than a flat view with
> >>start-element and end-element objects.
> >>
> >>My XmlItem object has similar methods to your XMLNode object to return
> >>the item type, the local name, namespace URI, QName, prefix, value
> >>etc.  The method names are chosen based on the Infoset and XPath.
> >>
> >>I toyed with the approach to attributes that you took, that is, having
> >>ATTRIBUTE items following the START_ELEMENT item. This has the
> >>advantage of being simple. However, I found it inconvenient to work
> >>with and felt it would seem rather strange to anybody with exposure to
> >>SAX or DOM.  So instead an XmlItem of type START_ELEMENT has
> >>getAttribute() methods that return an XmlItem for an attribute
> >>identified by name or index.
> >>
> >>XmlItem has a getContext() method returning an XmlContext object.
> >>This provides information about the context of the item, such as the
> >>in-scope namespaces.  Typically, many XmlItem objects can share the
> >>same XmlContext object.
> >>
> >>A major challenge in designing a general-purpose XML API is to deal
> >>with the diversity of XML applications.  At one end of the spectrum
> >>are simple applications that need no more than elements, attributes
> >>and text (the "holy trinity of XML" as I think David Megginson once
> >>called them).  At the other end of the spectrum are applications such
> >>as XML editors that want as much detail about the markup as they can
> >>get including things like comments and entities.  Just as there is a
> >>diversity of XML applications, so is there a diversity of XML
> >>processors/parsers.  There are large, complex parsers like Xerces that
> >>a very rich set of information but take a corresponding hit in terms
> >>of size and speed.  There is also a need for simpler parsers that do
> >>less but can be smaller and faster.
> >>
> >>The solution I use in pullax is based on the "feature" concept of
> >>SAX2.  An implementation of the pullax API implements the
> >>XmlScannerFactory interface. By default an XmlScanner created by an
> >>XmlScannerFactory returns exactly three types of XmlItem:
> >>START_ELEMENT, END_ELEMENT, TEXT.  Also by default TEXT items are
> >>maximal.  So, for example, the document
> >>
> >>   <doc>4<!-- a silly comment -->2</doc>
> >>
> >>will be returned as three items: a START_ELEMENT item, a TEXT item
> >>with string value "42", and an END_ELEMENT item. If an application
> >>wishes to see, for example, comments, it must request the SHOW_COMMENT
> >>feature from the XmlScannerFactory before creating the XmlScanner.  If
> >>the parser cannot satisfy the request, it must throw an exception.
> >>XmlScannerFactory objects are designed to be dynamically discoverable
> >>using the service provider mechanism (like JAXP).
> >>XmlScannerFactoryFinder is a utility class that takes a set of
> >>features and dynamically finds an XmlScannerFactory implementation
> >>that supports those features.  This approach ensures that the support
> >>for a rich information set in pullax does not get in the way of simple
> >>applications or simple XML processors.
> >>
> >>The pullax API aims to provide a very rich information set.  As far as
> >>the document instance is concerned, it is intended to support the
> >>union of SAX2, DOM2 core, and the XML infoset and then some.  As far
> >>as the DTD is concerned, pullax currently provides approximately the
> >>same information as the union of the XML Infoset and DOM Level 2 core.
> >>I have opted not to provide the detailed lexical information about the
> >>DTD that SAX2 provides. It seems to me that it is not much use having
> >>lexical information about DTDs if you lose information about parameter
> >>entities within declarations; but dealing with parameter entities
> >>within declarations is just too hard for a general-purpose API,
> >>especially when consider nested parameter entity references. I believe
> >>DTD editor type applications really require specialized APIs and
> >>parsers (eg DTDinst see http://www.thaiopensource.com/dtdinst).
> >>
> >>Another respect in which pullax's approach to DTDs differs from SAX is
> >>that it represents the DOCTYPE declaration as a single item.  There
> >>does seem much point in breaking it down into a multiple items.  Most
> >>of the information is in the XmlDtd object which is available from the
> >>XmlContext.  Note that the XmlDtd object is immutable.  I'm planning
> >>to extend the API to allow straightforward DTD caching: the idea is
> >>that a user-supplied XmlDtdResolver object will map the system id,
> >>public id and internal subset to an XmlDtd object.
> >>
> >>I've written too much already.  I'll be happy to answer any questions
> >>people may have about the design and I'll try to get the API doc into
> >>shape as soon as possible.
> >>
> >>James
> >>
> >>
> >>
> >>-----------------------------------------------------------------
> >>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> >>initiative of OASIS <http://www.oasis-open.org>
> >>
> >>The list archives are at http://lists.xml.org/archives/xml-dev/
> >>
> >>To subscribe or unsubscribe from this list use the subscription
> >>manager: <http://lists.xml.org/ob/adm.pl>
> >
> >
> >--
> >-- -----------------------------------------------------------------
> >Robin La Fontaine, Director, Monsell EDM Ltd
> >DeltaXML: "Change control for XML in XML"
> >Tel: +44 1684 592 144 Fax: +44 1684 594 504
> >Email: [EMAIL PROTECTED]      http://www.deltaxml.com
>
>
> _______________________________________________
> dom4j-dev mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/dom4j-dev
>


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


_______________________________________________
dom4j-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dom4j-dev

Re: [dom4j-dev] Fwd: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator

Reply via email to