My response inline. On 2/20/08, Mario Méndez Villegas <[EMAIL PROTECTED]> wrote: Hello, > > I\'ve been doing some crawling to understand how the crawl filter works but I > cannot figure out, I followed the tutorial inside the wiki, and even I have > added the urls I want to crawl to a file called urls and configured the
Did you place the urls file inside a directory? Let me give you an example of a proper way of doing it. 0. Let us assume your current directory is the Nutch project directory (i.e. the directory which contains the bin directory). 1. In the current directory create a directory called: urls (This has to be passed as an argument to bin/nutch crawl command) 2. In the urls directory, create a file called: url (This can be any name though) 3. Write the URLs with which you want to start the crawl in urls/url file. Write one URL per line. 4. Start the crawl as: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > crawl-urlfilter.txt, when I run the crawl nutch fetchs sites that are not > listed in any of these files, can anybody tell me why this happens? You can read the last paragraph in the comments written in crawl-urlfilter.txt. It clearly explains how the crawl-urlfilter.txt works. Regards, Susam Pal
