Setup:
A fresh install of nutch 0.9 running under a cygwin (uname -a reports this:
CYGWIN_NT-5.1 microlith 1.5.24(0.156/4/2) 2007-01-31 10:57 i686 Cygwin) with
java 1.5.0_09-b03.
Problem:
I crawl www.sf911truth.org, a very small site, and don't get
www.sf911truth.org/about.html which is linked directly from the main page via
this HTML:
<a href="about.html"><strong>Mission Statement and
Meetings<br></strong></a>
Here's my crawl command:
[EMAIL PROTECTED] /cygdrive/c/nutch-0.9
$ bin/nutch crawl conf/urls -dir mydir -depth 3 2>&1 | tee crawl.log
I think I should see the following (but I don't) in my log file:
fetching http://www.sf911truth.org/about.html
Here's my crawl.log:
crawl started in: mydir
rootUrlDir = conf/urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: mydir/crawldb
Injector: urlDir: conf/urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070620115517
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20070620115517
Fetcher: threads: 10
fetching http://www.sf911truth.org/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20070620115517]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070620115527
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20070620115527
Fetcher: threads: 10
fetching http://www.sf911truth.org/popWin.js
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20070620115527]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070620115535
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=2 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: mydir/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: mydir/segments/20070620115517
LinkDb: adding segment: mydir/segments/20070620115527
LinkDb: done
Indexer: starting
Indexer: linkdb: mydir/linkdb
Indexer: adding segment: mydir/segments/20070620115517
Indexer: adding segment: mydir/segments/20070620115527
Indexing [http://www.sf911truth.org/] with analyzer [EMAIL PROTECTED] (null)
Indexing [http://www.sf911truth.org/popWin.js] with analyzer [EMAIL PROTECTED]
(null)
Optimizing index.
merging segments _ram_0 (1 docs) _ram_1 (1 docs) into _0 (2 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: mydir/indexes
Dedup: done
merging indexes to: mydir/index
Adding mydir/indexes/part-00000
done merging
crawl finished: mydir
Here's my conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;rv:1.4b)
Gecko/20030516 Mozilla Firebird/0.6'</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
</configuration>
____________________________________________________________________________________
Get the free Yahoo! toolbar and rest assured with the added security of spyware
protection.
http://new.toolbar.yahoo.com/toolbar/features/norton/index.php-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general