While investigating a number of security concerns for the Abdera
project, I noticed that there were a number of problems with DTD
handling in the various stax parser implementations.  For instance, if
you parse the following xml document with Axiom using the Woodstox
parser, then reserialize it the xml will be invalid.

Input:

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE feed [
    <!ENTITY foo "bar">
    <!ENTITY bar "foo">
  ]>
  <feed xmlns="http://www.w3.org/2005/Atom"; >
  </feed>

Output using Woodstox:

  <?xml version="1.0" encoding="utf-8"?>

    <!ENTITY foo "bar">
    <!ENTITY bar "foo">

  <feed xmlns="http://www.w3.org/2005/Atom"; >
  </feed>

Output using Stax Reference Impl

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE feed [
    <!ENTITY foo "bar">
    <!ENTITY bar "foo">
  ]>
  <feed xmlns="http://www.w3.org/2005/Atom"; >
  </feed>

Comparing these two, it would appear as if there is a bug in Woodstox.
Unfortunately, Woodstox is apparently acting exactly as the Stax spec
says it should and it's actually the Stax reference impl that's doing it
wrong... apparently.  So I had to dig a little deeper.

In StAXOMBuilder, the createDTD method calls parser.getText() to get the
DTD contents.  According to the Stax javadocs and spec, getText returns
the internal subset of the DTD, not the complete doctype declaration.
So while the stax reference implementation is doing what we want, it's
apparently not doing what the stax spec says it should be doing.

According to the woodstox developers, there is currently no way of
getting to the complete DTD doctype declaration using the standardized
XMLStreamReader interface.  The XMLEventReader interface, however, works
just fine.

So where does this leave us?  Using Axiom and Woodstox to parse
documents containing doctype decls produces invalid XML; Using Axiom and
the Stax ref impl requires relying on what is apparently either a bug or
a deliberate incompatibility with the spec.

Now, by this point you should note that I am using the word "apparently"
a lot.  That's because I'm basing this information off what one woodstox
developer told me and I've been unable to verify.

Another problem that I've noticed with the stax DTD handling is that
even when you tell it not to replace entity references, it will still
replace entity references found in attribute values.... which is more
than just slightly annoying.

In any case, I wanted to report these issues.  In the very near future I
will also post some feedback on various experiences we've had developing
with Axiom and suggestions on how to make things better.

- James

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to