I think your regex shown below is incorrect. I think that will ONLY crawl stoweb01.scan.bombardier.com and exclude all other links. +^http://stoweb01.scan.bombardier.com/
To test this out, try replacing it with +^https?://([a-z0-9]*\.)*\S* and see what happens! That regex should allow absolutely everything. Kevin On Wed, Sep 24, 2008 at 5:00 AM, Henrik Jönsson <[EMAIL PROTECTED]> wrote: > Hi, > > I'm a complete newbie on nutch and lucene. I want to setup nutch to > crawl our company intranet. I followed the tutorial from the Wiki > (http://peterpuwang.googlepages.com/NutchGuideForDummies.htm). I use > nutch on Solaris 8. > > I specified nutch to crawl > http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/index.php and > used the following urlfilter: > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > #-.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > +^http://stoweb01.scan.bombardier.com/ > > # skip everything else > #-. > > +^http://stoweb01.scan.bombardier.com/index.php.* > > > When I run nutch I don't seem to get any result. The fetcher does not > seem to follow my pages. Here is the command and outpu: > [EMAIL PROTECTED] > bin/nutch crawl urls -dir crawl -depth 10 -topN 50 > crawl started in: crawl > rootUrlDir = urls > threads = 10 > depth = 10 > topN = 50 > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080924135354 > Generator: filtering: false > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20080924135354 > Fetcher: threads: 10 > fetching http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/index.php > fetching http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20080924135354] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080924135628 > Generator: filtering: false > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=1 - no more URLs to fetch. > LinkDb: starting > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: crawl/segments/20080924135354 > LinkDb: done > Indexer: starting > Indexer: linkdb: crawl/linkdb > Indexer: adding segment: crawl/segments/20080924135354 > Optimizing index. > Indexer: done > Dedup: starting > Dedup: adding indexes in: crawl/indexes > Dedup: done > merging indexes to: crawl/index > Adding crawl/indexes/part-00000 > done merging > crawl finished: crawl > [EMAIL PROTECTED] > > > > Can someone help me with this? > > Best regards > > Henrik >
