I just installed nutch and have been trying to understand the filters.

I need to crawl 1 fairly large site (http://www.largesite.com) that I know contains hundreds of MS Word and pdf files. All I want to do is to index the .doc and .pdf files. I don't want to index the HTML pages containing links to these 2 document types. I don't want to index *anything* except .doc and .pdf files.

What do I need to do? Where do I start.



--
Asim Baig
Cognizo Technologies, Inc.
10501 Wayzata Blvd., Suite 100
Minnetonka, MN 55305
p: (952) 417-0067 x101
f: (952) 417-0068
c: (612) 382-7474
e: [EMAIL PROTECTED]
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to