i crawl "http://lucene.apache.org" and in conf/crawl-urlfilter.txt i set that "+^http://([a-z0-9]*\.)*apache.org/" when i use command "bin/nutch crawl urls -dir crawled -depth 3" have error that
- crawl started in: crawled - rootUrlDir = urls - threads = 10 - depth = 3 - Injector: starting - Injector: crawlDb: crawled/crawldb - Injector: urlDir: urls - Injector: Converting injected urls to crawl db entries. Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /user/nutch/urls at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) at org.apache.nutch.crawl.Injector.inject(Injector.java:162) at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) -bash-3.1$ bin/nutch crawl inputs -dir crawled -depth 3 - crawl started in: crawled - rootUrlDir = inputs - threads = 10 - depth = 3 - Injector: starting - Injector: crawlDb: crawled/crawldb - Injector: urlDir: inputs - Injector: Converting injected urls to crawl db entries. - Total input paths to process : 1 - Running job: job_0001 - map 0% reduce 0% - map 100% reduce 0% - map 100% reduce 16% - map 100% reduce 58% - map 100% reduce 100% - Job complete: job_0001 - Counters: 6 - Map-Reduce Framework - Map input records=3 - Map output records=1 - Map input bytes=25 - Map output bytes=55 - Reduce input records=1 - Reduce output records=1 - Injector: Merging injected urls into crawl db. - Total input paths to process : 2 - Running job: job_0002 - map 0% reduce 0% - Task Id : task_0002_m_000000_0, Status : FAILED task_0002_m_000000_0: - Plugins: looking in: /nutch/search/build/plugins task_0002_m_000000_0: - Plugin Auto-activation mode: [true] task_0002_m_000000_0: - Registered Plugins: task_0002_m_000000_0: - the nutch core extension points (nutch-extensionpoints) task_0002_m_000000_0: - Basic Query Filter (query-basic) task_0002_m_000000_0: - Basic URL Normalizer (urlnormalizer-basic) task_0002_m_000000_0: - Basic Indexing Filter (index-basic) task_0002_m_000000_0: - Html Parse Plug-in (parse-html) task_0002_m_000000_0: - Basic Summarizer Plug-in (summary-basic) task_0002_m_000000_0: - Site Query Filter (query-site) task_0002_m_000000_0: - HTTP Framework (lib-http) task_0002_m_000000_0: - Text Parse Plug-in (parse-text) task_0002_m_000000_0: - Regex URL Filter (urlfilter-regex) task_0002_m_000000_0: - Pass-through URL Normalizer (urlnormalizer-pass) task_0002_m_000000_0: - Http Protocol Plug-in (protocol-http) task_0002_m_000000_0: - Regex URL Normalizer (urlnormalizer-regex) task_0002_m_000000_0: - OPIC Scoring Plug-in (scoring-opic) task_0002_m_000000_0: - CyberNeko HTML Parser (lib-nekohtml) task_0002_m_000000_0: - JavaScript Parser (parse-js) task_0002_m_000000_0: - URL Query Filter (query-url) task_0002_m_000000_0: - Regex URL Filter Framework (lib-regex-filter) task_0002_m_000000_0: - Registered Extension-Points: task_0002_m_000000_0: - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) task_0002_m_000000_0: - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) task_0002_m_000000_0: - Nutch Protocol (org.apache.nutch.protocol.Protocol) task_0002_m_000000_0: - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) task_0002_m_000000_0: - Nutch URL Filter (org.apache.nutch.net.URLFilter) task_0002_m_000000_0: - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) task_0002_m_000000_0: - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) task_0002_m_000000_0: - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) task_0002_m_000000_0: - Nutch Content Parser (org.apache.nutch.parse.Parser) task_0002_m_000000_0: - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) task_0002_m_000000_0: - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) task_0002_m_000000_0: - Ontology Model Loader (org.apache.nutch.ontology.Ontology) task_0002_m_000000_0: - found resource crawl-urlfilter.txt at file:/nutch/search/conf/crawl-urlfilter.txt - map 50% reduce 0% - map 100% reduce 0% - map 100% reduce 8% - map 100% reduce 25% - map 100% reduce 58% - map 100% reduce 100% - Job complete: job_0002 - Counters: 6 - Map-Reduce Framework - Map input records=3 - Map output records=1 - Map input bytes=63 - Map output bytes=55 - Reduce input records=1 - Reduce output records=1 - Injector: done - Generator: Selecting best-scoring urls due for fetch. - Generator: starting - Generator: segment: crawled/segments/25510102165746 - Generator: filtering: false - Generator: topN: 2147483647 - Total input paths to process : 2 - Running job: job_0003 - map 0% reduce 0% - map 50% reduce 0% - map 100% reduce 0% - map 100% reduce 8% - map 100% reduce 16% - map 100% reduce 58% - map 100% reduce 100% - Job complete: job_0003 - Counters: 6 - Map-Reduce Framework - Map input records=3 - Map output records=1 - Map input bytes=62 - Map output bytes=80 - Reduce input records=1 - Reduce output records=1 - Generator: Partitioning selected urls by host, for politeness. - Total input paths to process : 2 - Running job: job_0004 - map 0% reduce 0% - map 50% reduce 0% - map 100% reduce 0% - Task Id : task_0004_r_000000_0, Status : FAILED - Task Id : task_0004_r_000001_0, Status : FAILED - map 100% reduce 8% - map 100% reduce 0% - Task Id : task_0004_r_000000_1, Status : FAILED - Task Id : task_0004_r_000001_1, Status : FAILED - map 100% reduce 8% - map 100% reduce 0% - Task Id : task_0004_r_000000_2, Status : FAILED now i use hadoop-0.12.2, nutch-0.9 and java jdk1.6.0. Why? i can't solve it 1 month ago. -- View this message in context: http://www.nabble.com/Nutch-crawl-problem-tp14327978p14575918.html Sent from the Hadoop Users mailing list archive at Nabble.com.