Hi, I am a newbie. Please assist! I am using cygwin (windows xp) and Nutch 0.8.1.
In crawl-urlfilter.txt, I modified: #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ +^http://([a-z0-9]*\.)*cnn.com/ $ mkdir urls $ echo 'http://www.cnn.com" > urls/seeds.txt $ nutch crawl urls -dir db -depth 1 -topN 10 I got the following error: [EMAIL PROTECTED] /cygdrive/d/corpus/data $ nutch crawl urls -dir db -depth 1 -threads 1 -topN 10 crawl started in: db rootUrlDir = urls threads = 1 depth = 1 topN = 10 Injector: starting Injector: crawlDb: db/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: starting Generator: segment: db/segments/20061026061130 Generator: Selecting best-scoring urls due for fetch. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: db/segments/20061026061130 Fetcher: threads: 1 fetching http://www.cnn.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: db/crawldb CrawlDb update: segment: db/segments/20061026061130 CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: db/linkdb LinkDb: adding segment: db/segments/20061026061130 LinkDb: done Indexer: starting Indexer: linkdb: db/linkdb Indexer: adding segment: db/segments/20061026061130 Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) at org.apache.nutch.indexer.Indexer.index(Indexer.java:296) at org.apache.nutch.crawl.Crawl.main(Crawl.java:121) Help!!! Regards, Haward ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
