You could use suffix filters to filter out any document that isn't a PDF.
Dennis Marco Vanossi wrote:
Hi,Do you think there is an easy way to do make nutch generate a list of onlycertain documents type to fetch? For example: If one would like to crawl only PDF docs (after some pages was already crawled, wich linked to PDF docs), the command: "bin/nutch generate db segments -topN 1000 -type:pdf" could do that. Thanks for any help and comment, Marco
