On Thursday 18 September 2003 14:50, Michael Giles wrote:We had reasonably good experiences with this simple bit of code, using Swing's HTML parser:
I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
I also know that it is updated from time to time and performs much better
than the other ones that I have tested. Frustratingly, the very first page
I tried to parse failed
(<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
.co.uk/content/54/32593.html). It seems to be choking on tags that are being
written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');. Obviously, the simple solution (that I am using with another parser) is to
just ignore everything inside of <script> tags. It appears that the parser
is ignoring text inside script tags, but it seems like it needs to be a bit
smarter (or maybe dumber) about how it deals with this (so it doesn't get
I would guess that often ignoring stuff in <script> (for indexing purposes) makes sense; exception being if someone wants to create HTML site creation IDE (like specifically wants to search for stuff in javascript sections?).
Nonetheless HTML parser has to be able to handle these I think.
confused by such occurrences). I see a bug has been filed regarding
trouble parsing JavaScript, has anyone given it thought?
I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
through many of such issues (<script> tag, unquoted single '&' and '<' chars,
in attr values and elements, simplistic approach to optional end tags). Since it was dead-optimized for speed (anything fully in memory in a char array, optimizing based on that) I thought it might be useful for indexing (even more so than for its original purpose which was to be very fast utility for filtering [adding and/or removing stuff] of HTML pages).
If anyone would be interested I could give the source code and/or (if I have time) to implement efficient fault-tolerant indexer.
Like I said this also works equally well for well-formed XML, but that's nothing special.
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/HtmlDocumentHandler.java?rev=1.4&content-type=text/vnd.viewcvs-markup
We haven't tested it much, but it does grok a local copy of the link given.
Here is our XML parsing code:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/XmlDocumentHandler.java?rev=1.6&content-type=text/vnd.viewcvs-markup http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/SaxTextContentParser.java?rev=1.3&content-type=text/vnd.viewcvs-markup
The XML bit is not too good yet. E.g. it chokes on large XML easily since it reads all content into memory at once.
Some of the code will require JDK 1.4, though. The XML relies on JAXP, I don't know about the HTMLEditorKit.
JTidy seems to be another option.
Peter
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
