It seems that the parse-html plguin may not process many pages well, because I have found that the plugin can't extract all valid links in a page when I test it in my code. I guess that it may be caused by the style of a html page? When I "view source" of a html page I used to parse, I saw that some elements in the source are segmented by some unrequired spaces. However, the situation is quiet often to the pages of large portal sites or news sites.
- [Nutch-general] extract links problem with parse-html pl... Elwin
