Problem with fetcher

Henrik Jönsson Wed, 24 Sep 2008 05:00:38 -0700

Hi,

I'm a complete newbie on nutch and lucene. I want to setup nutch to
crawl our company intranet. I followed the tutorial from the Wiki
(http://peterpuwang.googlepages.com/NutchGuideForDummies.htm). I use
nutch on Solaris 8.


I specified nutch to crawl
http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/index.php and
used the following urlfilter:
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://stoweb01.scan.bombardier.com/

# skip everything else
#-.

+^http://stoweb01.scan.bombardier.com/index.php.*


When I run nutch I don't seem to get any result. The fetcher does not
seem to follow my pages. Here is the command and outpu:
[EMAIL PROTECTED] > bin/nutch crawl urls -dir crawl -depth 10 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 10
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080924135354
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080924135354
Fetcher: threads: 10
fetching http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/index.php
fetching http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080924135354]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080924135628
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080924135354
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080924135354
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
Adding crawl/indexes/part-00000
done merging
crawl finished: crawl
[EMAIL PROTECTED] >


Can someone help me with this?

Best regards

Henrik

Problem with fetcher

Reply via email to