I remarked for showing that the Fetcher works correctly with local fs, so there problem is not in the code.
The logs I got are the following ones: hadoop-root-datanode... 2010-01-20 18:50:41,748 INFO mortbay.log - Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2010-01-20 18:50:41,804 INFO mortbay.log - jetty-6.1.14 2010-01-20 18:50:42,046 INFO mortbay.log - Started selectchannelconnec...@0.0.0.0:50075 hadoop-root-jobtracker 2010-01-20 18:50:44,032 INFO mortbay.log - Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2010-01-20 18:50:44,149 INFO mortbay.log - jetty-6.1.14 2010-01-20 18:50:48,910 INFO mortbay.log - Started selectchannelconnec...@0.0.0.0:50030 hadoop-root-namenode 2010-01-20 18:50:40,529 INFO mortbay.log - Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2010-01-20 18:50:40,587 INFO mortbay.log - jetty-6.1.14 2010-01-20 18:50:40,866 INFO mortbay.log - Started selectchannelconnec...@0.0.0.0:50070 hadoop-root-secondarynamenode 2010-01-20 18:50:43,007 INFO mortbay.log - Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2010-01-20 18:50:43,063 INFO mortbay.log - jetty-6.1.14 2010-01-20 18:50:52,091 INFO mortbay.log - Started selectchannelconnec...@0.0.0.0:50090 2010-01-20 18:50:52,092 WARN namenode.SecondaryNameNode - Checkpoint Period :3600 secs (60 min) 2010-01-20 18:50:52,092 WARN namenode.SecondaryNameNode - Log Size Trigger :67108864 bytes (65536 KB) 2010-01-20 18:55:52,499 WARN namenode.SecondaryNameNode - Checkpoint done. New Image Size: 10329 hadoop-root-tasktracker 2010-01-20 18:50:45,196 INFO mortbay.log - Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2010-01-20 18:50:45,299 INFO mortbay.log - jetty-6.1.14 2010-01-20 18:50:54,703 INFO mortbay.log - Started selectchannelconnec...@0.0.0.0:50060 2010-01-20 18:50:54,817 WARN mapred.TaskTracker - TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is disabled. hadoop (resumed) 2010-01-20 18:50:47,055 INFO crawl.Crawl - crawl started in: crawled 2010-01-20 18:50:47,055 INFO crawl.Crawl - rootUrlDir = urls 2010-01-20 18:50:47,055 INFO crawl.Crawl - threads = 10 2010-01-20 18:50:47,055 INFO crawl.Crawl - depth = 2 2010-01-20 18:51:21,767 INFO crawl.Injector - Injector: starting 2010-01-20 18:51:21,767 INFO crawl.Injector - Injector: crawlDb: crawled/crawldb 2010-01-20 18:51:21,767 INFO crawl.Injector - Injector: urlDir: urls 2010-01-20 18:51:21,794 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2010-01-20 18:51:56,353 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2010-01-20 18:52:34,852 INFO crawl.Injector - Injector: done 2010-01-20 18:52:35,888 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2010-01-20 18:52:35,888 INFO crawl.Generator - Generator: starting 2010-01-20 18:52:35,888 INFO crawl.Generator - Generator: segment: crawled/segments/20100120185235 2010-01-20 18:52:35,889 INFO crawl.Generator - Generator: filtering: true 2010-01-20 18:53:54,795 INFO http.Http - http.proxy.host = null 2010-01-20 18:53:54,795 INFO http.Http - http.proxy.port = 8080 2010-01-20 18:53:54,795 INFO http.Http - http.timeout = 10000 2010-01-20 18:53:54,795 INFO http.Http - http.content.limit = 65536 2010-01-20 18:53:54,795 INFO http.Http - http.agent = cierzo/Nutch-1.0 2010-01-20 18:53:54,795 INFO http.Http - protocol.plugin.check.blocking = false 2010-01-20 18:53:54,795 INFO http.Http - protocol.plugin.check.robots = false 2010-01-20 18:53:55,626 INFO fetcher.Fetcher - fetch of http://aldea-irreductible.blogspot.com/ failed with: java.lang.NullPointerException 2010-01-20 18:53:55,828 INFO fetcher.Fetcher - -activeThreads=0 2010-01-20 18:53:57,854 INFO fetcher.Fetcher - Fetcher: threads: 10 2010-01-20 18:53:57,875 INFO fetcher.Fetcher - QueueFeeder finished: total 0 records. 2010-01-20 18:54:11,824 INFO fetcher.Fetcher - Fetcher: done 2010-01-20 18:54:11,868 INFO crawl.CrawlDb - CrawlDb update: starting 2010-01-20 18:54:11,868 INFO crawl.CrawlDb - CrawlDb update: db: crawled/crawldb 2010-01-20 18:54:11,868 INFO crawl.CrawlDb - CrawlDb update: segments: [crawled/segments/20100120185235] 2010-01-20 18:54:11,869 INFO crawl.CrawlDb - CrawlDb update: additions allowed: true 2010-01-20 18:54:11,869 INFO crawl.CrawlDb - CrawlDb update: URL normalizing: true 2010-01-20 18:54:11,869 INFO crawl.CrawlDb - CrawlDb update: URL filtering: true 2010-01-20 18:54:11,871 INFO crawl.CrawlDb - CrawlDb update: Merging segment data into db. 2010-01-20 18:54:48,640 INFO crawl.CrawlDb - CrawlDb update: done 2010-01-20 18:54:49,865 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2010-01-20 18:54:49,865 INFO crawl.Generator - Generator: starting 2010-01-20 18:54:49,865 INFO crawl.Generator - Generator: segment: crawled/segments/20100120185449 2010-01-20 18:54:49,865 INFO crawl.Generator - Generator: filtering: true 2010-01-20 18:55:01,361 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2010-01-20 18:55:01,362 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2010-01-20 18:55:01,362 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2010-01-20 18:55:01,419 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2010-01-20 18:55:01,420 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2010-01-20 18:55:01,420 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2010-01-20 18:55:10,036 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2010-01-20 18:55:10,036 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2010-01-20 18:55:10,036 INFO crawl.AbstractFetchSchedule - maxInterval=77760002010-01-20 18:55:13,021 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2010-01-20 18:55:13,022 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2010-01-20 18:55:13,022 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2010-01-20 18:55:21,335 WARN crawl.Generator - Generator: 0 records selected for fetching, exiting ... 2010-01-20 18:55:21,340 INFO crawl.Crawl - Stopping at depth=1 - no more URLs to fetch. 2010-01-20 18:55:21,340 INFO crawl.Crawl - crawl finished: crawled MilleBii wrote: > > Well that does not work this way really. > > If you want to use is it make it run on one node (pseudo-distributed mode) > and then deploy. > If you have it running in pseudo-distributed it won't use the local > filesystem this is why I don't understand your remarks in the initial > mail. > > NUTCH logs are in NUTCH_HOME/logs look for the hadoop file it will tell > you > what is happening more or less. > > > > 2010/1/20 Santiago Pérez <elara...@gmail.com> > >> >> I launch the hdfs because I want to make it work in one computer and when >> it >> works, launching in several as a distributed version. >> >> Which logs do you need to check? >> >> >> MilleBii wrote: >> > >> > Why do you launch hdfs if you don't want use it ? >> > >> > What are the logs saying, all fetch urls are logged usually ? But >> > nothing is displaid >> > >> > 2010/1/20, Santiago Pérez <elara...@gmail.com>: >> >> >> >> Hej, >> >> >> >> I am configuring Nutch for just crawling webs in several machines >> >> (currently >> >> I want to test with only one). >> >> Building Nutch with ant was successfully >> >> >> >> bin/hadoop namenode -format >> >> bin/start-all.sh >> >> >> >> They show correct logs >> >> >> >> bin/hadoop dfs -put urls urls >> >> bin/hadoop dfs -ls >> >> >> >> They show the urls directory correctly >> >> >> >> But when I launch it the fetcher starts but does not show any message >> of >> >> parsing and it stops in the second depth. The crawl-urlfilter and >> >> nutch-default are well configured because they work great using local >> >> filesystem (instead of hdfs). I guess it is because nutch-site is >> empty. >> >> >> >> What should be its content? >> >> >> >> core-site.xml: >> >> >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> >> >> <!-- Put site-specific property overrides in this file. --> >> >> >> >> <configuration> >> >> >> >> <property> >> >> <name>fs.default.name</name> >> >> <value>hdfs://localhost:9000/</value> >> >> <description> >> >> The name of the default file system. Either the literal string >> >> "local" or a host:port for NDFS. >> >> </description> >> >> </property> >> >> >> >> </configuration> >> >> >> >> >> >> --------------------------------------- >> >> >> >> hdfs-site.xml: >> >> >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> >> >> <!-- Put site-specific property overrides in this file. --> >> >> >> >> <configuration> >> >> >> >> <property> >> >> <name>dfs.name.dir</name> >> >> <value>/root/filesystem/name</value> >> >> </property> >> >> >> >> <property> >> >> <name>dfs.data.dir</name> >> >> <value>/root/filesystem/data</value> >> >> </property> >> >> >> >> <property> >> >> <name>dfs.replication</name> >> >> <value>1</value> >> >> </property> >> >> >> >> </configuration> >> >> >> >> >> >> --------------------------------------- >> >> >> >> >> >> mapred-site.xml: >> >> >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> >> >> <!-- Put site-specific property overrides in this file. --> >> >> >> >> <configuration> >> >> >> >> <property> >> >> <name>mapred.job.tracker</name> >> >> <value>hdfs://localhost:9001/</value> >> >> <description> >> >> The host and port that the MapReduce job tracker runs at. If >> >> "local", then jobs are run in-process as a single map and >> >> reduce task. >> >> </description> >> >> </property> >> >> >> >> <property> >> >> <name>mapred.map.tasks</name> >> >> <value>2</value> >> >> <description> >> >> define mapred.map tasks to be number of slave hosts >> >> </description> >> >> </property> >> >> >> >> <property> >> >> <name>mapred.reduce.tasks</name> >> >> <value>2</value> >> >> <description> >> >> define mapred.reduce tasks to be number of slave hosts >> >> </description> >> >> </property> >> >> >> >> <property> >> >> <name>mapred.system.dir</name> >> >> <value>/root/filesystem/mapreduce/system</value> >> >> </property> >> >> >> >> <property> >> >> <name>mapred.local.dir</name> >> >> <value>/root/filesystem/mapreduce/local</value> >> >> </property> >> >> >> >> </configuration> >> >> -- >> >> View this message in context: >> >> >> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> > -- >> > -MilleBii- >> > >> > >> >> -- >> View this message in context: >> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27248860.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > -- > -MilleBii- > > -- View this message in context: http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27255194.html Sent from the Nutch - User mailing list archive at Nabble.com.