I determined the same. With my Site is the HTML source 160 kByte per Page largely. The Parser has here definitely problems (whether Javascript on a side is used or not).
Before my decision for Nutch I tested the Java/Lucene based open source solution Oxygen ( http://sourceforge.net/projects/oxyus/ ). Here the Parser does not have problems with the Site. I am not a developer, perhaps one of the nutch developers may throw a view in into the source code of the Parser used there. Perhaps it helps. Elwin <[EMAIL PROTECTED]> wrote on 17.02.2006 08:51:06: > It seems that the parse-html plguin may not process many pages well, because > I have found that the plugin can't extract all valid links in a page when I > test it in my code. > I guess that it may be caused by the style of a html page? When I "view > source" of a html page I used to parse, I saw that some elements in the > source are segmented by some unrequired spaces. However, the situation is > quiet often to the pages of large portal sites or news sites. ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
