Re: Problem with parsing HTML

Yizhou Z. Sun, 13 May 2012 20:13:37 -0700

Just tried out parsing some other HTML files, and found Xerces worked well
for the "input" tags in these HTML files. The previous problem seems to
have something to do with NekoHTML's parser.


On Sun, May 13, 2012 at 1:22 PM, Yizhou Z. <westward.zh...@gmail.com> wrote:

> NekoHTML parser uses Xerces' HTML DOM implementation. And it seems that it
> can always return the appropriate HTML DOM element objects for other types
> of element nodes.  But for <input />, I found it returns an object of type
> "org.apache.xerces.dom.ElementNSImpl". I wonder if this is a bug in the
> version of Xerces that I use.
>
> Thanks.
>
>
> On Sun, May 13, 2012 at 5:34 AM, Michael Glavassevich <mrgla...@ca.ibm.com
> > wrote:
>
>> Have you tried setting the 'document-class-name' property [1] so that it
>> points to Xerces' HTML DOM implementation?
>>
>> Thanks.
>>
>> [1]
>> http://xerces.apache.org/xerces2-j/properties.html#dom.document-class-name
>>
>> Michael Glavassevich
>> XML Technologies and WAS Development
>> IBM Toronto Lab
>> E-mail: mrgla...@ca.ibm.com
>> E-mail: mrgla...@apache.org
>>
>> "Yizhou Z." <westward.zh...@gmail.com> wrote on 12/05/2012 11:40:23 AM:
>>
>>
>> > Hi. I am using NekoHTML to parse a piece of HTML code which includes
>> > an input element:
>>
>> > <input type="password" name="pw" maxlength="20" class="password"
>> > id="Password1" />
>> >
>> > My program for parsing HTML is below.
>> >
>> > DOMParser parser = new DOMParser();
>> > parser.setProperty("
>> http://cyberneko.org/html/properties/default-encoding
>> > ", "UTF-8");
>> > parser.setProperty("http://cyberneko.org/html/properties/filters";,
>> >   new XMLDocumentFilter[] { new DefaultFilter() {
>> >     public void startElement(QName element, XMLAttributes attrs,
>> > Augmentations augs)
>> >     throws XNIException {
>> >       element.uri = null;
>> >       super.startElement(element, attrs, augs);
>> >     }
>> > } });
>> > BufferedReader in = new BufferedReader(new FileReader("./test.html"));
>> > parser.parse(new InputSource(in));
>> > HTMLDocument d = (HTMLDocument) parser.getDocument();
>> > System.out.println(d.getElementById("Password1").getClass());
>> >
>> > The print out of the above program is "class
>> > org.apache.xerces.dom.ElementNSImpl" rather than "class
>> > org.apache.html.dom.HTMLInputElementImpl", which puzzles me. Is
>> > there anything I went wrong with?
>> >
>> > Thanks!
>>
>
>

Re: Problem with parsing HTML

Reply via email to