I am using http://htmlparser.sourseforge.net for my Data Mining engine. It has 'lexer' package, lightweight, and I don't need to perform ANY html/xml error checking etc., - it's lightweight low-level 'parser', it is not a parser, it is not DOM, SAX, etc. We do not need to create DOM to extract Outlink[], and to extract plain text. What about licensing?
We can develop own low-lewel HTML (InputSource) processing engine from scratch, we need only Outlink[] and PlainText.
