Re: HTML Parsing problems...

Peter Becker Thu, 18 Sep 2003 19:40:14 -0700

Tatu Saloranta wrote:

On Thursday 18 September 2003 14:50, Michael Giles wrote:

I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but I also know that it is updated from time to time and performs much better than the other ones that I have tested. Frustratingly, the very first page I tried to parse failed (<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister .co.uk/content/54/32593.html). It seems to be choking on tags that are being written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');. Obviously, the simple solution (that I am using with another parser) is to just ignore everything inside of <script> tags. It appears that the parser is ignoring text inside script tags, but it seems like it needs to be a bit smarter (or maybe dumber) about how it deals with this (so it doesn't get

I would guess that often ignoring stuff in <script> (for indexing purposes) makes sense; exception being if someone wants to create HTML site creation IDE (like specifically wants to search for stuff in javascript sections?). Nonetheless HTML parser has to be able to handle these I think.

confused by such occurrences). I see a bug has been filed regarding trouble parsing JavaScript, has anyone given it thought?

I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work through many of such issues (<script> tag, unquoted single '&' and '<' chars, in attr values and elements, simplistic approach to optional end tags). Since it was dead-optimized for speed (anything fully in memory in a char array, optimizing based on that) I thought it might be useful for indexing (even more so than for its original purpose which was to be very fast utility for filtering [adding and/or removing stuff] of HTML pages).

If anyone would be interested I could give the source code and/or (if I have time) to implement efficient fault-tolerant indexer. Like I said this also works equally well for well-formed XML, but that's nothing special.

We had reasonably good experiences with this simple bit of code, using Swing's HTML parser:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/HtmlDocumentHandler.java?rev=1.4&content-type=text/vnd.viewcvs-markup

We haven't tested it much, but it does grok a local copy of the link given.

Here is our XML parsing code:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/XmlDocumentHandler.java?rev=1.6&content-type=text/vnd.viewcvs-markup
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/SaxTextContentParser.java?rev=1.3&content-type=text/vnd.viewcvs-markup

The XML bit is not too good yet. E.g. it chokes on large XML easily since it reads all content into memory at once.

Some of the code will require JDK 1.4, though. The XML relies on JAXP, I don't know about the HTMLEditorKit.

JTidy seems to be another option.

Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML Parsing problems...

Reply via email to