Re: Re: Crawl www.yahoo.com with nutch

songjue Mon, 16 Apr 2007 02:11:13 -0700

Well, I met that before. Make sure you have set the "http.agent.name" property 
in conf/nutch-default.xml





songjue
2007-04-16



发件人： Meryl Silverburgh
发送时间： 2007-04-16 12:08:26
收件人： [email protected]
抄送： 
主题： Re: Crawl www.yahoo.com with nutch

I am using 0.9 too.

I am now getting further, but I get a bunch of NullPointerException:

fetch of http://www.yahoo.com/s/557760 failed with:
java.lang.NullPointerException
fetch of http://www.yahoo.com/r/hq failed with: java.lang.NullPointerException
fetch of http://www.yahoo.com/s/557762 failed with:
java.lang.NullPointerException


On 4/15/07, songjue  <[EMAIL PROTECTED] > wrote:
> I try this, Nutch0.9 works just fine. What's your Nutch version?
>
>
>
> songjue
> 2007-04-16
>
>
>
> 发件人： Meryl Silverburgh
> 发送时间： 2007-04-16 11:33:05
> 收件人： [email protected]
> 抄送：
> 主题： Crawl www.yahoo.com with nutch
>
> I setup nutch to crawl, in my input file, I only have 1 site
> "http://www.yahoo.com";
>
> $ bin/nutch crawl urls -dir crawl -depth 3
>
> and I have added 'yahoo.com' as my domain name in crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/
>
> But no links is being fetched, when I change the link to www.cnn.com,
> it works. Can you please tell me what do I need to work to make
> www.yahoo.com works?
>
> $ bin/nutch crawl urls -dir crawl -depth 3
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070415222440
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20070415222440
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20070415222440]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070415222449
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20070415222440
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20070415222440
>  Indexing [http://www.yahoo.com/] with analyzer
> [EMAIL PROTECTED] (null)
> Optimizing index.
> merging segments _ram_0 (1 docs) into _0 (1 docs)
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Dedup: done
> merging indexes to: crawl/index
> Adding crawl/indexes/part-00000
> done merging
> crawl finished: crawl
>

Re: Re: Crawl www.yahoo.com with nutch

Reply via email to