Re: Problem with fetcher

Kevin MacDonald Wed, 24 Sep 2008 09:24:02 -0700

I think your regex shown below is incorrect. I think that will ONLY crawl
stoweb01.scan.bombardier.com and exclude all other links.
+^http://stoweb01.scan.bombardier.com/


To test this out, try replacing it with
+^https?://([a-z0-9]*\.)*\S*

and see what happens! That regex should allow absolutely everything.

Kevin

On Wed, Sep 24, 2008 at 5:00 AM, Henrik Jönsson <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm a complete newbie on nutch and lucene. I want to setup nutch to
> crawl our company intranet. I followed the tutorial from the Wiki
> (http://peterpuwang.googlepages.com/NutchGuideForDummies.htm). I use
> nutch on Solaris 8.
>
> I specified nutch to crawl
> http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/index.php and
> used the following urlfilter:
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://stoweb01.scan.bombardier.com/
>
> # skip everything else
> #-.
>
> +^http://stoweb01.scan.bombardier.com/index.php.*
>
>
> When I run nutch I don't seem to get any result. The fetcher does not
> seem to follow my pages. Here is the command and outpu:
> [EMAIL PROTECTED] > bin/nutch crawl urls -dir crawl -depth 10 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 10
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080924135354
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080924135354
> Fetcher: threads: 10
> fetching http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/index.php
> fetching http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080924135354]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080924135628
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080924135354
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080924135354
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Dedup: done
> merging indexes to: crawl/index
> Adding crawl/indexes/part-00000
> done merging
> crawl finished: crawl
> [EMAIL PROTECTED] >
>
>
> Can someone help me with this?
>
> Best regards
>
> Henrik
>

Re: Problem with fetcher

Reply via email to