Really strang. Did you try Luke? Its's much more convenient for debugging.
songjue 2007-04-16 发件人: Meryl Silverburgh 发送时间: 2007-04-16 12:15:39 收件人: [email protected] 抄送: 主题: Re: Crawl www.yahoo.com with nutch I have use this command to crawl, up to the depth of 6 $ bin/nutch crawl urls -dir crawl -depth 6 and then this commad to read the link $ bin/nutch readdb crawl/crawldb -topN 50 test but I only 10 links, can you please tell me why? $ more test/part-00000 2.1111112 http://www.yahoo.com/ 0.11111111 http://srd.yahoo.com/hp5-v 0.11111111 http://www.yahoo.com/+document.cookie+ 0.11111111 http://www.yahoo.com/1.0 0.11111111 http://www.yahoo.com/2.0.0 0.11111111 http://www.yahoo.com/r/hf 0.11111111 http://www.yahoo.com/r/hq 0.11111111 http://www.yahoo.com/r\/1m 0.11111111 http://www.yahoo.com/s/557762 0.11111111 http://www.yahoo.com/s/557770 On 4/15/07, Meryl Silverburgh <[EMAIL PROTECTED] > wrote: > I am using 0.9 too. > > I am now getting further, but I get a bunch of NullPointerException: > > fetch of http://www.yahoo.com/s/557760 failed with: > java.lang.NullPointerException > fetch of http://www.yahoo.com/r/hq failed with: java.lang.NullPointerException > fetch of http://www.yahoo.com/s/557762 failed with: > java.lang.NullPointerException > > > On 4/15/07, songjue <[EMAIL PROTECTED] > wrote: > > I try this, Nutch0.9 works just fine. What's your Nutch version? > > > > > > > > songjue > > 2007-04-16 > > > > > > > > 发件人: Meryl Silverburgh > > 发送时间: 2007-04-16 11:33:05 > > 收件人: [email protected] > > 抄送: > > 主题: Crawl www.yahoo.com with nutch > > > > I setup nutch to crawl, in my input file, I only have 1 site > > "http://www.yahoo.com" > > > > $ bin/nutch crawl urls -dir crawl -depth 3 > > > > and I have added 'yahoo.com' as my domain name in crawl-urlfilter.txt > > > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/ > > > > But no links is being fetched, when I change the link to www.cnn.com, > > it works. Can you please tell me what do I need to work to make > > www.yahoo.com works? > > > > $ bin/nutch crawl urls -dir crawl -depth 3 > > crawl started in: crawl > > rootUrlDir = urls > > threads = 10 > > depth = 3 > > Injector: starting > > Injector: crawlDb: crawl/crawldb > > Injector: urlDir: urls > > Injector: Converting injected urls to crawl db entries. > > Injector: Merging injected urls into crawl db. > > Injector: done > > Generator: Selecting best-scoring urls due for fetch. > > Generator: starting > > Generator: segment: crawl/segments/20070415222440 > > Generator: filtering: false > > Generator: topN: 2147483647 > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls by host, for politeness. > > Generator: done. > > Fetcher: starting > > Fetcher: segment: crawl/segments/20070415222440 > > Fetcher: threads: 10 > > fetching http://www.yahoo.com/ > > Fetcher: done > > CrawlDb update: starting > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/segments/20070415222440] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true > > CrawlDb update: Merging segment data into db. > > CrawlDb update: done > > Generator: Selecting best-scoring urls due for fetch. > > Generator: starting > > Generator: segment: crawl/segments/20070415222449 > > Generator: filtering: false > > Generator: topN: 2147483647 > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: 0 records selected for fetching, exiting ... > > Stopping at depth=1 - no more URLs to fetch. > > LinkDb: starting > > LinkDb: linkdb: crawl/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: crawl/segments/20070415222440 > > LinkDb: done > > Indexer: starting > > Indexer: linkdb: crawl/linkdb > > Indexer: adding segment: crawl/segments/20070415222440 > > Indexing [http://www.yahoo.com/] with analyzer > > [EMAIL PROTECTED] (null) > > Optimizing index. > > merging segments _ram_0 (1 docs) into _0 (1 docs) > > Indexer: done > > Dedup: starting > > Dedup: adding indexes in: crawl/indexes > > Dedup: done > > merging indexes to: crawl/index > > Adding crawl/indexes/part-00000 > > done merging > > crawl finished: crawl > > >
