Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Andrzej Bialecki Fri, 26 Oct 2007 09:36:16 -0700

eyal edri wrote:

Hi,


Is there a way to tell nutch not to parse the pages it fetches? meaning just
to extract the links from it?

Extracting links requires that a page is downloaded first (otherwisewhere would you extract the links from?) and parsed (otherwise how wouldyou extract links from an unintelligible byte[]?).

I know there is a "-no parsing" attribute,but still i need to d/l some
contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
use the option.


No download -> no parsing -> no outlinks.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Reply via email to