Re: Nutch Crawl Question

Ian Holsman Tue, 17 Apr 2007 19:39:28 -0700

I'll try

First.. I don't use 'crawl' I do it the long winded way i find itworks better.from what I can guess, I'm thinking you haven't modified the regex-urlfilter.txt file to allow yahoo to be crawled.

you would need to add
+^http://([a-z0-9]*\.)*yahoo.com/

the easiest documentation on how to get all this working isdocumented here: http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html


page down to the section 'setting up nutch'
and follow the 4 step process documented there.

just change the last nutch from:

bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb$BASEDIR/linkdb $SEGMENT


to
bin/nutch index $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT

and it should do all the right things.


regards
Ian
(ps... I'm no expert, just 1-2 steps ahead of where you are)

On 18/04/2007, at 12:12 PM, Meryl Silverburgh wrote:

Ian,

can you please help me with my problem too?

i am trying to setup nutch 0.9 to crawl www.yahoo.com.
I am using this command "bin/nutch crawl urls -dir crawl -depth 3".

But after the command, no links have been fetch.

the only strange thing I see in the hadoop log is this warning:

2007-04-16 23:22:48,062 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'outlink', using default

Is that something I need to setup before www.yahoo.com can be crawled?

Here is the output:
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070416230326
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070416230326
Fetcher: threads: 10
fetching http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20070416230326]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070416230338
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20070416230326
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070416230326
Indexing [http://www.yahoo.com/] with analyzer
[EMAIL PROTECTED] (null)
Optimizing index.
merging segments _ram_0 (1 docs) into _0 (1 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
Adding crawl/indexes/part-00000
done merging
crawl finished: crawl
CrawlDb topN: starting (topN=25, min=0.0)
CrawlDb db: crawl/crawldb
CrawlDb topN: collecting topN scores.
CrawlDb topN: done
Match


On 4/17/07, Ian Holsman <[EMAIL PROTECTED]> wrote:

Hi Anita.

I tried crawling autos.aols.com, and I could find pages similar to
what our looking at in 3 crawls. (I injected http://autos.aol.com/
and added autos.aol.com to my regex filter to allow it)


eg.
fetching http://autos.aol.com/bmw-650-2007:8774-photos
fetching http://autos.aol.com/article/general/v2/_a/auto-
financing-101/20060818153509990001
fetching http://autos.aol.com/options_trimless?v=8544
fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl
fetching http://autos.aol.com/bmw-m-2007:8905-overview
fetching http://autos.aol.com/getaquote?myid=8623
fetching http://autos.aol.com/options_trimless?v=8226
fetching http://autos.aol.com/options_trimless?v=7803
fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger-
srt8/20061030193309990001
fetching http://autos.aol.com/bmw-x3-2007:8770-specs
fetching http://autos.aol.com/saturn-vue-2007:8371-overview

fetching http://autos.aol.com/aston-martin-vanquish-2006:8115-overview

fetching http://autos.aol.com/options_trimless?v=8394
fetching http://autos.aol.com/jaguar-listings:JA---
fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview
fetching http://autos.aol.com/bmw-x5-2007:8817-overview
fetching http://autos.aol.com/audi-a4-2007:8622-specs
fetching http://autos.aol.com/options_trimless?v=8416
fetching http://autos.aol.com/getaquote?myid=8774

the differences is that I am using the latest nutch (SVN head), and
am just using a local store, not hadoop.

what I would do next if I were you is to check your regex filters to
make sure you are not blocking things with a colon ':' in them for
some strange reason,
and possibly upgrade to the latest and greatest version of nutch.
(0.9.1)

regards
Ian.



On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote:

> Hi
>
> I am a new Nutch user, and am using Nutch 8.1 with  Hadoop. The
> domain I am
> trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I
> am crawling
> to the depth  of 10.
> There are certain pages that Nutch could not fetch. An  example
> would be
> _http://autos.aol.com/acura-rl-2006:8060-review_
> (http://autos.aol.com/acura-rl-2006:8060-review) .
>
> The referring url to this page is
> _http://autos.aol.com/acura-rl-2007:8060-review_ (http://
> autos.aol.com/acura-rl-2007:8060-review) .  This url was there
> in the fetch list.
>
> I did a mini crawl pointing directly to
> _http://autos.aol.com/acura-rl-2007:8060-review_ (http://
> autos.aol.com/acura-rl-2007:8060-review) ,  then the page
> _http://autos.aol.com/acura-rl-2006:8060-review_
> (http://autos.aol.com/acura-rl-2006:8060-review)  gets  fetched.
>
> Does anyone have any ideas on why I am seeing this  behavior.
>
>
> Thanks
> Anita Bidari (X55746)
>
>
>
>
> ************************************** See what's free at http://
> www.aol.com.

Ian Holsman
[EMAIL PROTECTED]
http://parent-chatter.com -- what do parents know?


--
Ian Holsman
[EMAIL PROTECTED]
http://zyons.com/ build a Community with Django

Re: Nutch Crawl Question

Reply via email to