[Nutch-general] extract links problem with parse-html plugin

Elwin Thu, 16 Feb 2006 23:51:59 -0800

It seems that the parse-html plguin may not process many pages well, because
I have found that the plugin can't extract all valid links in a page when I
test it in my code.
I guess that it may be caused by the style of a html page? When I "view
source" of a html page I used to parse, I saw that some elements in the
source are segmented by some unrequired spaces. However, the situation is
quiet often to the pages of large portal sites or news sites.

[Nutch-general] extract links problem with parse-html plugin

Reply via email to