Re: crawl-tool.xml

reinhard schwab Mon, 27 Jul 2009 01:23:52 -0700

its not only confusing me,
its also confusing the author,  FrankMcCown, of the nutch tutorial


http://wiki.apache.org/nutch/NutchTutorial


        Crawl Command: Configuration

To configure things for the crawl command you must:

    *

      Create a directory with a flat file of root urls. For example, to
      crawl the nutch site you might start with a file named urls/nutch
      containing the url of just the Nutch home page. All other Nutch
      pages should be reachable from this page. The urls/nutch file
      would thus contain:

       http://lucene.apache.org/nutch/ 

    *

      Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME
      with the name of the domain you wish to crawl. For example, if you
      wished to limit the crawl to the apache.org domain, the line
      should read:

       +^http://([a-z0-9]*\.)*apache.org/ 

      This will include any url in the domain apache.org.

* Until someone could explain this...When I use the file
crawl-urlfilter.txt the filter doesn't work, instead of it use the file
conf/regex-urlfilter.txt and change the last line from "+." to "-."


reinhard schwab schrieb:
> i have tried the recrawl script of susam pal and have wondered why
> url filtering no longer works.
> http://wiki.apache.org/nutch/Crawl
>
> the mystery is
>
> only Crawl.java adds crawl-tool.xml to the NutchConfiguration.
>
> Configuration conf = NutchConfiguration.create();
> conf.addResource("crawl-tool.xml");
>
> Fetcher.java and all the other tools which filter the outlinks do not
> add this.
> this is really confusing me and i have spent some time to figure this out.
>
> regards
> reinhard
>
>
>
>
>
>
>
>
>

Re: crawl-tool.xml

Reply via email to