Re: Help to understand the crawl filter

Susam Pal Tue, 19 Feb 2008 21:04:40 -0800

My response inline.

On 2/20/08, Mario Méndez Villegas <[EMAIL PROTECTED]> wrote:
Hello,
>
> I\'ve been doing some crawling to understand how the crawl filter works but I
> cannot figure out, I followed the tutorial inside the wiki, and even I have
> added the urls I want to crawl to a file called urls and configured the


Did you place the urls file inside a directory? Let me give you an
example of a proper way of doing it.

0. Let us assume your current directory is the Nutch project directory
(i.e. the directory which contains the bin directory).
1. In the current directory create a directory called: urls (This has
to be passed as an argument to bin/nutch crawl command)
2. In the urls directory, create a file called: url (This can be any
name though)
3. Write the URLs with which you want to start the crawl in urls/url
file. Write one URL per line.
4. Start the crawl as: bin/nutch crawl urls -dir crawl -depth 3 -topN 50

> crawl-urlfilter.txt, when I run the crawl nutch fetchs sites that are not
> listed in any of these files, can anybody tell me why this happens?

You can read the last paragraph in the comments written in
crawl-urlfilter.txt. It clearly explains how the crawl-urlfilter.txt
works.

Regards,
Susam Pal

Re: Help to understand the crawl filter

Reply via email to