Well, I met that before. Make sure you have set the "http.agent.name" property in conf/nutch-default.xml
songjue 2007-04-16 发件人: Meryl Silverburgh 发送时间: 2007-04-16 12:08:26 收件人: [email protected] 抄送: 主题: Re: Crawl www.yahoo.com with nutch I am using 0.9 too. I am now getting further, but I get a bunch of NullPointerException: fetch of http://www.yahoo.com/s/557760 failed with: java.lang.NullPointerException fetch of http://www.yahoo.com/r/hq failed with: java.lang.NullPointerException fetch of http://www.yahoo.com/s/557762 failed with: java.lang.NullPointerException On 4/15/07, songjue <[EMAIL PROTECTED] > wrote: > I try this, Nutch0.9 works just fine. What's your Nutch version? > > > > songjue > 2007-04-16 > > > > 发件人: Meryl Silverburgh > 发送时间: 2007-04-16 11:33:05 > 收件人: [email protected] > 抄送: > 主题: Crawl www.yahoo.com with nutch > > I setup nutch to crawl, in my input file, I only have 1 site > "http://www.yahoo.com" > > $ bin/nutch crawl urls -dir crawl -depth 3 > > and I have added 'yahoo.com' as my domain name in crawl-urlfilter.txt > > # accept hosts in MY.DOMAIN.NAME > +^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/ > > But no links is being fetched, when I change the link to www.cnn.com, > it works. Can you please tell me what do I need to work to make > www.yahoo.com works? > > $ bin/nutch crawl urls -dir crawl -depth 3 > crawl started in: crawl > rootUrlDir = urls > threads = 10 > depth = 3 > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20070415222440 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20070415222440 > Fetcher: threads: 10 > fetching http://www.yahoo.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20070415222440] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20070415222449 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=1 - no more URLs to fetch. > LinkDb: starting > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: crawl/segments/20070415222440 > LinkDb: done > Indexer: starting > Indexer: linkdb: crawl/linkdb > Indexer: adding segment: crawl/segments/20070415222440 > Indexing [http://www.yahoo.com/] with analyzer > [EMAIL PROTECTED] (null) > Optimizing index. > merging segments _ram_0 (1 docs) into _0 (1 docs) > Indexer: done > Dedup: starting > Dedup: adding indexes in: crawl/indexes > Dedup: done > merging indexes to: crawl/index > Adding crawl/indexes/part-00000 > done merging > crawl finished: crawl >
