I just installed nutch and have been trying to understand the filters.
I need to crawl 1 fairly large site (http://www.largesite.com) that I know contains hundreds of MS Word and pdf files. All I want to do is to index the .doc and .pdf files. I don't want to index the HTML pages containing links to these 2 document types. I don't want to index *anything* except .doc and .pdf files.
What do I need to do? Where do I start. -- Asim Baig Cognizo Technologies, Inc. 10501 Wayzata Blvd., Suite 100 Minnetonka, MN 55305 p: (952) 417-0067 x101 f: (952) 417-0068 c: (612) 382-7474 e: [EMAIL PROTECTED]
