Hi Scott

From: "Scott Willy" <[EMAIL PROTECTED]>

> My concern is that it is easy to create a DOM in Java which contains
invalid
> XML characters (those outside the range of: #x9 | #xA | #xD |
[#x20-#xD7FF]
> | [#xE000-#xFFFD] | [#x10000-#x10FFFF]). So the question is when/where do
> you catch invalid characters.
>
> On input, on output, or when you re-parse the previous invalid output.
>
> I would suggest when you re-parse is way too late. I would agree to filter
> on input could be a bit painful (and not what Java guys want) .

Most parsers do the validation at the parse stage which seems a good place.
In dom4j's case, putting this at the DocumentFactory level may make sense.
dom4j is developed around the idea of pluggable DocumentFactory objects
which can be used to create pluggable XML tree model implementations.

Whenever you're parsing documents, creating documents from DOM trees or
programatically via Java code, it might be nice to use a
ValidatingDocumentFactory rather than the default DocumentFactory. This
ValidatingDocumentFactory could fail & throw exceptions if an invalid string
is used for a name or value.

So this would mean you could plugin validation when you need it via
application code or via the System property org.dom4j.factory (*).

e.g. via code

    SAXReader reader = new SAXReader();
    reader.setDocumentFactory( ValidatingDocumentFactory.getInstance() );
    Document document = reader.read( "foo.xml" );

or via a system property...

    java -Dorg.dom4j.factory=org.dom4j.ValidatingDocumentFactory MyApp

Maybe an EncodingDocumentFactory could be an optional DocumentFactory -
rather than failing on invalid characters it would explicitly encode them?


(*) at some point I'd like a more flexible configuration mechanism for dom4j
such as a config file on the CLASSPATH.


> Then this leaves trapping invalid characters on output. I would suggest
that
> at minimum the outpputer should be able to complain (maybe be able to this
> off) that it has been asked to generate invalid XML.


I'd rather not make all writing of XML text slow. I'd rather have an
optional check. Putting this check in the DocumentFactory seems the best
place, since a dom4j tree can be written to SAX, DOM, text and XSLT which
would mean 4 checks.
So better to put these validation checks in a single place, the
ValidatingDocumentFactory.


> A hook to do something to fix the output at this point would be great. I
> think hook is needed, because, unless I have missed something in XML,
there
> is no standard way to do the encoding of non-XML chars (base64 is
> recommended, but you also need to somehow flag the encoding, so this is
user
> code).

Agreed.

If a developer knows that their String may contain invalid characters then
something like the following may be used...

    // encode XML sensitive characters
    String escapedString = DocumentHelper.encodeString( "some text > < &
" );

    // encode binary data
    String escapedString = DocumentHelper.encodeBinary( byte[] someData );

Note that if you've regular text which contains special XML characters like
< > & then you can always use CDATA sections.
For example

    Element element = ...;
    element.addCDATA( "<this contains > & special characters !" );

James


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


_______________________________________________
dom4j-user mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/dom4j-user

Reply via email to