I am using  http://htmlparser.sourseforge.net for my Data Mining engine.
It has 'lexer' package, lightweight, and I don't need to perform ANY
html/xml error checking etc., - it's lightweight low-level 'parser', it is
not a parser, it is not DOM, SAX, etc. We do not need to create DOM to
extract Outlink[], and to extract plain text.
What about licensing?

We can develop own low-lewel HTML (InputSource) processing engine from
scratch, we need only Outlink[] and PlainText.



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to