Hi, I'm a complete newbie on nutch and lucene. I want to setup nutch to crawl our company intranet. I followed the tutorial from the Wiki (http://peterpuwang.googlepages.com/NutchGuideForDummies.htm). I use nutch on Solaris 8.
I specified nutch to crawl http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/index.php and used the following urlfilter: # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME +^http://stoweb01.scan.bombardier.com/ # skip everything else #-. +^http://stoweb01.scan.bombardier.com/index.php.* When I run nutch I don't seem to get any result. The fetcher does not seem to follow my pages. Here is the command and outpu: [EMAIL PROTECTED] > bin/nutch crawl urls -dir crawl -depth 10 -topN 50 crawl started in: crawl rootUrlDir = urls threads = 10 depth = 10 topN = 50 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080924135354 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080924135354 Fetcher: threads: 10 fetching http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/index.php fetching http://stoweb01.scan.bombardier.com/~ebiconfig/intranet/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080924135354] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080924135628 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20080924135354 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20080924135354 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding crawl/indexes/part-00000 done merging crawl finished: crawl [EMAIL PROTECTED] > Can someone help me with this? Best regards Henrik
