just take a look at config file like "crawl-urlfilter.txt"... nutch can use "regular expressions" to filtrate the URLs. On 12/31/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > hi > > I want to know whether nutch can be set to crawl specified type files and > specified name files? > > for example: If I crawl a website that contains many document files , and I > want nutch only crawl pdf and doc files but not html files,how to do? > > and another question is can I want nutch only to crawl specified name files > like index.htm or so ? > > thanks in advance > >
------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
