Re: [Nutch-general] Nutch Crawl Question

Ian Holsman Tue, 17 Apr 2007 19:39:39 -0700

I'll try

First.. I don't use 'crawl' I do it the long winded way i find it  
works better.
from what I can guess, I'm thinking you haven't modified the regex- 
urlfilter.txt file to allow yahoo to be crawled.
you would need to add
+^http://([a-z0-9]*\.)*yahoo.com/



the easiest documentation on how to get all this working is  
documented here: http://blog.foofactory.fi/2007/02/online-indexing- 
integrating-nutch-with.html

page down to the section 'setting up nutch'
and follow the 4 step process documented there.

just change the last nutch from:
bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb  
$BASEDIR/linkdb $SEGMENT

to
bin/nutch index $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT

and it should do all the right things.


regards
Ian
(ps... I'm no expert, just 1-2 steps ahead of where you are)

On 18/04/2007, at 12:12 PM, Meryl Silverburgh wrote:

> Ian,
>
> can you please help me with my problem too?
>
> i am trying to setup nutch 0.9 to crawl www.yahoo.com.
> I am using this command "bin/nutch crawl urls -dir crawl -depth 3".
>
> But after the command, no links have been fetch.
>
> the only strange thing I see in the hadoop log is this warning:
>
> 2007-04-16 23:22:48,062 WARN  regex.RegexURLNormalizer - can't find
> rules for scope 'outlink', using default
>
> Is that something I need to setup before www.yahoo.com can be crawled?
>
> Here is the output:
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070416230326
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20070416230326
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20070416230326]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070416230338
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20070416230326
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20070416230326
> Indexing [http://www.yahoo.com/] with analyzer
> [EMAIL PROTECTED] (null)
> Optimizing index.
> merging segments _ram_0 (1 docs) into _0 (1 docs)
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Dedup: done
> merging indexes to: crawl/index
> Adding crawl/indexes/part-00000
> done merging
> crawl finished: crawl
> CrawlDb topN: starting (topN=25, min=0.0)
> CrawlDb db: crawl/crawldb
> CrawlDb topN: collecting topN scores.
> CrawlDb topN: done
> Match
>
>
> On 4/17/07, Ian Holsman <[EMAIL PROTECTED]> wrote:
>> Hi Anita.
>>
>> I tried crawling autos.aols.com, and I could find pages similar to
>> what our looking at in 3 crawls. (I injected http://autos.aol.com/
>> and added autos.aol.com to my regex filter to allow it)
>>
>>
>> eg.
>> fetching http://autos.aol.com/bmw-650-2007:8774-photos
>> fetching http://autos.aol.com/article/general/v2/_a/auto-
>> financing-101/20060818153509990001
>> fetching http://autos.aol.com/options_trimless?v=8544
>> fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl
>> fetching http://autos.aol.com/bmw-m-2007:8905-overview
>> fetching http://autos.aol.com/getaquote?myid=8623
>> fetching http://autos.aol.com/options_trimless?v=8226
>> fetching http://autos.aol.com/options_trimless?v=7803
>> fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger-
>> srt8/20061030193309990001
>> fetching http://autos.aol.com/bmw-x3-2007:8770-specs
>> fetching http://autos.aol.com/saturn-vue-2007:8371-overview
>> fetching http://autos.aol.com/aston-martin-vanquish-2006:8115- 
>> overview
>> fetching http://autos.aol.com/options_trimless?v=8394
>> fetching http://autos.aol.com/jaguar-listings:JA---
>> fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview
>> fetching http://autos.aol.com/bmw-x5-2007:8817-overview
>> fetching http://autos.aol.com/audi-a4-2007:8622-specs
>> fetching http://autos.aol.com/options_trimless?v=8416
>> fetching http://autos.aol.com/getaquote?myid=8774
>>
>> the differences is that I am using the latest nutch (SVN head), and
>> am just using a local store, not hadoop.
>>
>> what I would do next if I were you is to check your regex filters to
>> make sure you are not blocking things with a colon ':' in them for
>> some strange reason,
>> and possibly upgrade to the latest and greatest version of nutch.
>> (0.9.1)
>>
>> regards
>> Ian.
>>
>>
>>
>> On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote:
>>
>> > Hi
>> >
>> > I am a new Nutch user, and am using Nutch 8.1 with  Hadoop. The
>> > domain I am
>> > trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I
>> > am crawling
>> > to the depth  of 10.
>> > There are certain pages that Nutch could not fetch. An  example
>> > would be
>> > _http://autos.aol.com/acura-rl-2006:8060-review_
>> > (http://autos.aol.com/acura-rl-2006:8060-review) .
>> >
>> > The referring url to this page is
>> > _http://autos.aol.com/acura-rl-2007:8060-review_ (http://
>> > autos.aol.com/acura-rl-2007:8060-review) .  This url was there
>> > in the fetch list.
>> >
>> > I did a mini crawl pointing directly to
>> > _http://autos.aol.com/acura-rl-2007:8060-review_ (http://
>> > autos.aol.com/acura-rl-2007:8060-review) ,  then the page
>> > _http://autos.aol.com/acura-rl-2006:8060-review_
>> > (http://autos.aol.com/acura-rl-2006:8060-review)  gets  fetched.
>> >
>> > Does anyone have any ideas on why I am seeing this  behavior.
>> >
>> >
>> > Thanks
>> > Anita Bidari (X55746)
>> >
>> >
>> >
>> >
>> > ************************************** See what's free at http://
>> > www.aol.com.
>>
>> Ian Holsman
>> [EMAIL PROTECTED]
>> http://parent-chatter.com -- what do parents know?
>>
>>
>>

--
Ian Holsman
[EMAIL PROTECTED]
http://zyons.com/ build a Community with Django



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch Crawl Question

Reply via email to