Hi Jake, The code you found in DOMNormalizer is looping over the attributes in the document not all of the possible attributes in the DTD. If a defaulted attribute is missing from the DOM then there's probably a bug somewhere else in the class which wouldn't surprise me. Around this time last year [1] in memory DTD validation was completely broken. I spent a couple weeks fixing most of the major issues [2][3][4][5][6][7][8] but I didn't get through all of them and haven't found the time to clear up the rest.
Thanks. [1] http://marc.theaimsgroup.com/?l=xerces-j-dev&m=113285279019052&w=2 [2] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113333032523512&w=2 [3] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113338115425840&w=2 [4] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113402200124272&w=2 [5] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113337500722384&w=2 [6] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113389841006312&w=2 [7] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113399680924552&w=2 [8] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113330014128271&w=2 Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [EMAIL PROTECTED] E-mail: [EMAIL PROTECTED] Jacob Kjome <[EMAIL PROTECTED]> wrote on 12/15/2006 01:55:38 PM: > Based on something Michael Glavassevich said about validating an HTML > document in memory using normalizeDocument() [1] (to get "id" > attributes registered as type "ID", for optimized getElementById() > lookup), I tried an experiment. I parsed an HTML document using the > Xerces DOMParser, providing it with the NekoHTML > HTMLConfiguration. First I tried validating against the HTML 4.01 > DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid > and now this? who writes these flippin things????), I took the XHTML > 1.0 Strict DTD and changed all the elements to be declared in upper > case (and removed "xmlns" and "xml:space" stuff) and obtained the > local URL via a Catalog-based entity resolver. I set the following > parameters... > > config.setParameter("validate", Boolean.TRUE); > config.setParameter("schema-type", javax.xml.XMLConstants. > XML_DTD_NS_URI); > config.setParameter("schema-location", url.toExternalForm()); > config.setParameter("namespaces", Boolean.FALSE); > config.setParameter("well-formed", Boolean.FALSE); > > It all loads up just fine, but fails because of a > NullPointerException in HTMLElementImpl when calling > getAttributeNodeNS() inside DOMNormalizer.startElement() (see line 1790)... > > for (int i = 0; i < attrCount; i++) { > attributes.getName(i, fAttrQName); > Attr attr = null; > > attr = currentElement.getAttributeNodeNS(fAttrQName.uri, > fAttrQName.localpart); > .... > .... > } > > This is because HTMLElementIImpl, on line 158, calls toLowerCase() on > the localName... > > return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) ); > > > The reason why the localName is null in this case is that the "for" > loop above loops over *all* possible attributes of the element > without checking for attribute.isSpecified() before calling > getAttributeNodeNS(). If the attribute is not specified, of course > it is going to be null, so why bother calling it? > > I worked around this by modifying > HTMLElementImpl.getAttributeNodeNS() to return null if the provided > 'localName' is null, avoiding the inevitable NullPointerException > upon the toLowerCase() call. The in memory validation works after > this change! Yippie! > > So, the question is, where is this properly fixed? I suppose it > would be smart for HTMLElementImpl to be checking for null before > attempting to manipulate the string to put it in all lowercase, so, > maybe that should be patched regardless. However, shouldn't the > first line in the "for" loop of DOMNormalizer.startElement() be.... > > if (!attributes.isSpecified(i)) continue; > > If the attribute isn't specified, why attempt to get the attribute > node? It's already known that it's going to be null, isn't > it? Wouldn't this even be a minor optimization? Is there a good > reason not to do this? > > > Jake > > > [1] http://issues.apache.org/jira/browse/XERCESJ-1200 > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]