Re: Re: Crawl www.yahoo.com with nutch

songjue Mon, 16 Apr 2007 02:15:00 -0700

Really strang. Did you try Luke? Its's much more convenient for debugging.




songjue
2007-04-16



发件人： Meryl Silverburgh
发送时间： 2007-04-16 12:15:39
收件人： [email protected]
抄送： 
主题： Re: Crawl www.yahoo.com with nutch

I have use this command to crawl, up to the depth of 6
$ bin/nutch crawl urls -dir crawl -depth 6
and then this commad to read the link
$ bin/nutch readdb crawl/crawldb -topN 50 test

but I only 10 links, can you please tell me why?
$ more test/part-00000
2.1111112       http://www.yahoo.com/
0.11111111      http://srd.yahoo.com/hp5-v
0.11111111      http://www.yahoo.com/+document.cookie+
0.11111111      http://www.yahoo.com/1.0
0.11111111      http://www.yahoo.com/2.0.0
0.11111111      http://www.yahoo.com/r/hf
0.11111111      http://www.yahoo.com/r/hq
0.11111111      http://www.yahoo.com/r\/1m
0.11111111      http://www.yahoo.com/s/557762
0.11111111      http://www.yahoo.com/s/557770


On 4/15/07, Meryl Silverburgh  <[EMAIL PROTECTED] > wrote:
> I am using 0.9 too.
>
> I am now getting further, but I get a bunch of NullPointerException:
>
> fetch of http://www.yahoo.com/s/557760 failed with:
> java.lang.NullPointerException
> fetch of http://www.yahoo.com/r/hq failed with: java.lang.NullPointerException
> fetch of http://www.yahoo.com/s/557762 failed with:
> java.lang.NullPointerException
>
>
> On 4/15/07, songjue  <[EMAIL PROTECTED] > wrote:
>  > I try this, Nutch0.9 works just fine. What's your Nutch version?
>  >
>  >
>  >
>  > songjue
>  > 2007-04-16
>  >
>  >
>  >
>  > 发件人： Meryl Silverburgh
>  > 发送时间： 2007-04-16 11:33:05
>  > 收件人： [email protected]
>  > 抄送：
>  > 主题： Crawl www.yahoo.com with nutch
>  >
>  > I setup nutch to crawl, in my input file, I only have 1 site
>  > "http://www.yahoo.com";
>  >
>  > $ bin/nutch crawl urls -dir crawl -depth 3
>  >
>  > and I have added 'yahoo.com' as my domain name in crawl-urlfilter.txt
>  >
>  > # accept hosts in MY.DOMAIN.NAME
>  > +^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/
>  >
>  > But no links is being fetched, when I change the link to www.cnn.com,
>  > it works. Can you please tell me what do I need to work to make
>  > www.yahoo.com works?
>  >
>  > $ bin/nutch crawl urls -dir crawl -depth 3
>  > crawl started in: crawl
>  > rootUrlDir = urls
>  > threads = 10
>  > depth = 3
>  > Injector: starting
>  > Injector: crawlDb: crawl/crawldb
>  > Injector: urlDir: urls
>  > Injector: Converting injected urls to crawl db entries.
>  > Injector: Merging injected urls into crawl db.
>  > Injector: done
>  > Generator: Selecting best-scoring urls due for fetch.
>  > Generator: starting
>  > Generator: segment: crawl/segments/20070415222440
>  > Generator: filtering: false
>  > Generator: topN: 2147483647
>  > Generator: jobtracker is 'local', generating exactly one partition.
>  > Generator: Partitioning selected urls by host, for politeness.
>  > Generator: done.
>  > Fetcher: starting
>  > Fetcher: segment: crawl/segments/20070415222440
>  > Fetcher: threads: 10
>  > fetching http://www.yahoo.com/
>  > Fetcher: done
>  > CrawlDb update: starting
>  > CrawlDb update: db: crawl/crawldb
>  > CrawlDb update: segments: [crawl/segments/20070415222440]
>  > CrawlDb update: additions allowed: true
>  > CrawlDb update: URL normalizing: true
>  > CrawlDb update: URL filtering: true
>  > CrawlDb update: Merging segment data into db.
>  > CrawlDb update: done
>  > Generator: Selecting best-scoring urls due for fetch.
>  > Generator: starting
>  > Generator: segment: crawl/segments/20070415222449
>  > Generator: filtering: false
>  > Generator: topN: 2147483647
>  > Generator: jobtracker is 'local', generating exactly one partition.
>  > Generator: 0 records selected for fetching, exiting ...
>  > Stopping at depth=1 - no more URLs to fetch.
>  > LinkDb: starting
>  > LinkDb: linkdb: crawl/linkdb
>  > LinkDb: URL normalize: true
>  > LinkDb: URL filter: true
>  > LinkDb: adding segment: crawl/segments/20070415222440
>  > LinkDb: done
>  > Indexer: starting
>  > Indexer: linkdb: crawl/linkdb
>  > Indexer: adding segment: crawl/segments/20070415222440
>  >  Indexing [http://www.yahoo.com/] with analyzer
>  > [EMAIL PROTECTED] (null)
>  > Optimizing index.
>  > merging segments _ram_0 (1 docs) into _0 (1 docs)
>  > Indexer: done
>  > Dedup: starting
>  > Dedup: adding indexes in: crawl/indexes
>  > Dedup: done
>  > merging indexes to: crawl/index
>  > Adding crawl/indexes/part-00000
>  > done merging
>  > crawl finished: crawl
>  >
>

Re: Re: Crawl www.yahoo.com with nutch

Reply via email to