Re: not crawling relative URLs

Kai_testing Middleton Tue, 26 Jun 2007 12:19:18 -0700

Nutch is not crawling relative URLs.
I think nutch is not capable of crawling relative URLs at this time.


See this discussion:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01279.html

and this bug report, NUTCH-119, "Regexp to extract outlinks incorrect":
http://issues.apache.org/jira/browse/NUTCH-119


----- Original Message ----
From: Kai_testing Middleton <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Wednesday, June 20, 2007 12:08:29 PM
Subject: not crawling relative URLs

Setup:
A fresh install of nutch 0.9 running under a cygwin (uname -a reports this: 
CYGWIN_NT-5.1 microlith 1.5.24(0.156/4/2) 2007-01-31 10:57 i686 Cygwin) with 
java 1.5.0_09-b03.

Problem:
I crawl www.sf911truth.org, a very small site, and don't get 
www.sf911truth.org/about.html which is linked directly from the main page via 
this HTML:
<a href="about.html"><strong>Mission&nbsp;Statement and 
Meetings<br></strong></a>

Here's my crawl command:
[EMAIL PROTECTED] /cygdrive/c/nutch-0.9
$ bin/nutch crawl conf/urls -dir mydir -depth 3 2>&1 | tee crawl.log

I think I should see the following (but I don't) in my log file:
fetching http://www.sf911truth.org/about.html

Here's my crawl.log:

crawl started in: mydir
rootUrlDir = conf/urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: mydir/crawldb
Injector: urlDir: conf/urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070620115517
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20070620115517
Fetcher: threads: 10
fetching http://www.sf911truth.org/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20070620115517]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070620115527
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20070620115527
Fetcher: threads: 10
fetching http://www.sf911truth.org/popWin.js
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20070620115527]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070620115535
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=2 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: mydir/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: mydir/segments/20070620115517
LinkDb: adding segment: mydir/segments/20070620115527
LinkDb: done
Indexer: starting
Indexer: linkdb: mydir/linkdb
Indexer: adding segment: mydir/segments/20070620115517
Indexer: adding segment: mydir/segments/20070620115527
 Indexing [http://www.sf911truth.org/] with analyzer [EMAIL PROTECTED] (null)
 Indexing [http://www.sf911truth.org/popWin.js] with analyzer [EMAIL PROTECTED] 
(null)
Optimizing index.
merging segments _ram_0 (1 docs) _ram_1 (1 docs) into _0 (2 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: mydir/indexes
Dedup: done
merging indexes to: mydir/index
Adding mydir/indexes/part-00000
done merging
crawl finished: mydir



Here's my conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>http.agent.name</name>
  <value>'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;rv:1.4b) 
Gecko/20030516 Mozilla Firebird/0.6'</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.
  NOTE: You should also check other related properties:
    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version
  and set their values appropriately.
  </description>
</property>
</configuration>






       
____________________________________________________________________________________
Get the free Yahoo! toolbar and rest assured with the added security of spyware 
protection.
http://new.toolbar.yahoo.com/toolbar/features/norton/index.php






       
____________________________________________________________________________________
Choose the right car based on your needs.  Check out Yahoo! Autos new Car 
Finder tool.
http://autos.yahoo.com/carfinder/

Re: not crawling relative URLs

Reply via email to