I am using http://htmlparser.sourseforge.net for my Data Mining engine. It has 'lexer' package, lightweight, and I don't need to perform ANY html/xml error checking etc., - it's lightweight low-level 'parser', it is not a parser, it is not DOM, SAX, etc. We do not need to create DOM to extract Outlink[], and to extract plain text. What about licensing?
We can develop own low-lewel HTML (InputSource) processing engine from scratch, we need only Outlink[] and PlainText. ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
