You could use suffix filters to filter out any document that isn't a PDF.

Dennis

Marco Vanossi wrote:
Hi,

Do you think there is an easy way to do make nutch generate a list of only
certain documents type to fetch?

For example:
If one would like to crawl only PDF docs (after some pages was already
crawled, wich linked to PDF docs), the command:
"bin/nutch generate db segments -topN 1000 -type:pdf" could do that.

Thanks for any help and comment,
Marco

Reply via email to