Nutch is not crawling relative URLs. I think nutch is not capable of crawling relative URLs at this time.
See this discussion: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01279.html and this bug report, NUTCH-119, "Regexp to extract outlinks incorrect": http://issues.apache.org/jira/browse/NUTCH-119 ----- Original Message ---- From: Kai_testing Middleton <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Wednesday, June 20, 2007 12:08:29 PM Subject: not crawling relative URLs Setup: A fresh install of nutch 0.9 running under a cygwin (uname -a reports this: CYGWIN_NT-5.1 microlith 1.5.24(0.156/4/2) 2007-01-31 10:57 i686 Cygwin) with java 1.5.0_09-b03. Problem: I crawl www.sf911truth.org, a very small site, and don't get www.sf911truth.org/about.html which is linked directly from the main page via this HTML: <a href="about.html"><strong>Mission Statement and Meetings<br></strong></a> Here's my crawl command: [EMAIL PROTECTED] /cygdrive/c/nutch-0.9 $ bin/nutch crawl conf/urls -dir mydir -depth 3 2>&1 | tee crawl.log I think I should see the following (but I don't) in my log file: fetching http://www.sf911truth.org/about.html Here's my crawl.log: crawl started in: mydir rootUrlDir = conf/urls threads = 10 depth = 3 Injector: starting Injector: crawlDb: mydir/crawldb Injector: urlDir: conf/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: mydir/segments/20070620115517 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: mydir/segments/20070620115517 Fetcher: threads: 10 fetching http://www.sf911truth.org/ Fetcher: done CrawlDb update: starting CrawlDb update: db: mydir/crawldb CrawlDb update: segments: [mydir/segments/20070620115517] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: mydir/segments/20070620115527 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: mydir/segments/20070620115527 Fetcher: threads: 10 fetching http://www.sf911truth.org/popWin.js Fetcher: done CrawlDb update: starting CrawlDb update: db: mydir/crawldb CrawlDb update: segments: [mydir/segments/20070620115527] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: mydir/segments/20070620115535 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=2 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: mydir/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: mydir/segments/20070620115517 LinkDb: adding segment: mydir/segments/20070620115527 LinkDb: done Indexer: starting Indexer: linkdb: mydir/linkdb Indexer: adding segment: mydir/segments/20070620115517 Indexer: adding segment: mydir/segments/20070620115527 Indexing [http://www.sf911truth.org/] with analyzer [EMAIL PROTECTED] (null) Indexing [http://www.sf911truth.org/popWin.js] with analyzer [EMAIL PROTECTED] (null) Optimizing index. merging segments _ram_0 (1 docs) _ram_1 (1 docs) into _0 (2 docs) Indexer: done Dedup: starting Dedup: adding indexes in: mydir/indexes Dedup: done merging indexes to: mydir/index Adding mydir/indexes/part-00000 done merging crawl finished: mydir Here's my conf/nutch-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6'</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> </configuration> ____________________________________________________________________________________ Get the free Yahoo! toolbar and rest assured with the added security of spyware protection. http://new.toolbar.yahoo.com/toolbar/features/norton/index.php ____________________________________________________________________________________ Choose the right car based on your needs. Check out Yahoo! Autos new Car Finder tool. http://autos.yahoo.com/carfinder/