Re: DOMNormalizer question

Michael Glavassevich Mon, 18 Dec 2006 10:28:42 -0800

Hi Jake,

The code you found in DOMNormalizer is looping over the attributes in the 
document not all of the possible attributes in the DTD. If a defaulted 
attribute is missing from the DOM then there's probably a bug somewhere 
else in the class which wouldn't surprise me. Around this time last year 
[1] in memory DTD validation was completely broken. I spent a couple weeks 
fixing most of the major issues [2][3][4][5][6][7][8] but I didn't get 
through all of them and haven't found the time to clear up the rest.


Thanks.

[1] http://marc.theaimsgroup.com/?l=xerces-j-dev&m=113285279019052&w=2
[2] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113333032523512&w=2
[3] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113338115425840&w=2
[4] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113402200124272&w=2
[5] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113337500722384&w=2
[6] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113389841006312&w=2
[7] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113399680924552&w=2
[8] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113330014128271&w=2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [EMAIL PROTECTED]
E-mail: [EMAIL PROTECTED]

Jacob Kjome <[EMAIL PROTECTED]> wrote on 12/15/2006 01:55:38 PM:

> Based on something Michael Glavassevich said about validating an HTML 
> document in memory using normalizeDocument() [1] (to get "id" 
> attributes registered as type "ID", for optimized getElementById() 
> lookup), I tried an experiment.  I parsed an HTML document using the 
> Xerces DOMParser, providing it with the NekoHTML 
> HTMLConfiguration.  First I tried validating against the HTML 4.01 
> DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid 
> and now this? who writes these flippin things????), I took the XHTML 
> 1.0 Strict DTD and changed all the elements to be declared in upper 
> case (and removed "xmlns" and "xml:space" stuff) and obtained the 
> local URL via a Catalog-based entity resolver.  I set the following 
> parameters...
> 
>      config.setParameter("validate", Boolean.TRUE);
>      config.setParameter("schema-type", javax.xml.XMLConstants.
> XML_DTD_NS_URI);
>      config.setParameter("schema-location", url.toExternalForm());
>          config.setParameter("namespaces", Boolean.FALSE);
>      config.setParameter("well-formed", Boolean.FALSE);
> 
> It all loads up just fine, but fails because of a 
> NullPointerException in HTMLElementImpl when calling 
> getAttributeNodeNS() inside DOMNormalizer.startElement() (see line 
1790)...
> 
> for (int i = 0; i < attrCount; i++) {
>      attributes.getName(i, fAttrQName);
>      Attr attr = null;
> 
>      attr = currentElement.getAttributeNodeNS(fAttrQName.uri, 
> fAttrQName.localpart);
>          ....
>          ....
> }
> 
> This is because HTMLElementIImpl, on line 158, calls toLowerCase() on 
> the localName...
> 
> return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );
> 
> 
> The reason why the localName is null in this case is that the "for" 
> loop above loops over *all* possible attributes of the element 
> without checking for attribute.isSpecified() before calling 
> getAttributeNodeNS().  If the attribute is not specified, of course 
> it is going to be null, so why bother calling it?
> 
> I worked around this by modifying 
> HTMLElementImpl.getAttributeNodeNS() to return null if the provided 
> 'localName' is null, avoiding the inevitable NullPointerException 
> upon the toLowerCase() call.  The in memory validation works after 
> this change!  Yippie!
> 
> So, the question is, where is this properly fixed?  I suppose it 
> would be smart for HTMLElementImpl to be checking for null before 
> attempting to manipulate the string to put it in all lowercase, so, 
> maybe that should be patched regardless.  However, shouldn't the 
> first line in the "for" loop of DOMNormalizer.startElement() be....
> 
> if (!attributes.isSpecified(i)) continue;
> 
> If the attribute isn't specified, why attempt to get the attribute 
> node?  It's already known that it's going to be null, isn't 
> it?  Wouldn't this even be a minor optimization?  Is there a good 
> reason not to do this?
> 
> 
> Jake
> 
> 
> [1] http://issues.apache.org/jira/browse/XERCESJ-1200 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DOMNormalizer question

Reply via email to