For a long time users have asked if Xerces can parse HTML files. 
But since most HTML documents are not well-formed XML documents, 
it is generally not possible to use a conforming XML parser to 
read HTML documents. 

However, the Xerces Native Interface (XNI) that is the foundation 
of the Xerces2 implementation defines a framework that allows 
different kinds of parsers to be constructed by connecting a
pipeline of parser components. Therefore, as long as a component 
can be written that generates the appropriate XNI "events", then
it can be used to emit SAX events, build DOM trees, or anything
else that you can think of.

So, as a fun little exercise, I have written a basic HTML parser 
using XNI. It consists of an HTML scanner component that can scan
HTML files and generate XNI events and a tag balancing component.
The tag balancer cleans up the events produced by the scanner,
balancing mismatched tags and adding tags where necessary. And
it does all of this in a streaming manner to minimize the amount
of memory required.

Since I wrote the HTML parser as an example of using XNI and
because the code is considered alpha quality (but it seems to
work quite well, actually!), I am posting the code with a very
limited license. Even though it contains the complete source
code for the HTML parser, the license only allows the user to
experiment but gives no right to actually use the code in a 
product.

If the source isn't "free" or "open", why release it at all?
I want to get an idea of what people think of the code first.
Then, if there's enough interest, I would like to either donate
the code to the Xerces-J project or make it available elsewhere
under a true open source license.

So, if you've been looking for a way to parse HTML documents
please try out the HTML parser and let me know what you think. 
There should be enough information in the documentation to get 
you started. Check out the "NekoHTML" project listed on my
Apache web site: http://www.apache.org/~andyc/

Have fun!

-- 
Andy Clark * [EMAIL PROTECTED]

---------------------------------------------------------------------
In case of troubles, e-mail:     [EMAIL PROTECTED]
To unsubscribe, e-mail:          [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to