Re: Configurin nutch-site.xml

Santiago Pérez Thu, 21 Jan 2010 01:49:33 -0800

I remarked for showing that the Fetcher works correctly with local fs, so
there problem is not in the code.


The logs I got are the following ones:

hadoop-root-datanode...
2010-01-20 18:50:41,748 INFO  mortbay.log - Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
2010-01-20 18:50:41,804 INFO  mortbay.log - jetty-6.1.14
2010-01-20 18:50:42,046 INFO  mortbay.log - Started
selectchannelconnec...@0.0.0.0:50075

hadoop-root-jobtracker
2010-01-20 18:50:44,032 INFO  mortbay.log - Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
2010-01-20 18:50:44,149 INFO  mortbay.log - jetty-6.1.14
2010-01-20 18:50:48,910 INFO  mortbay.log - Started
selectchannelconnec...@0.0.0.0:50030

hadoop-root-namenode
2010-01-20 18:50:40,529 INFO  mortbay.log - Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
2010-01-20 18:50:40,587 INFO  mortbay.log - jetty-6.1.14
2010-01-20 18:50:40,866 INFO  mortbay.log - Started
selectchannelconnec...@0.0.0.0:50070

hadoop-root-secondarynamenode
2010-01-20 18:50:43,007 INFO  mortbay.log - Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
2010-01-20 18:50:43,063 INFO  mortbay.log - jetty-6.1.14
2010-01-20 18:50:52,091 INFO  mortbay.log - Started
selectchannelconnec...@0.0.0.0:50090
2010-01-20 18:50:52,092 WARN  namenode.SecondaryNameNode - Checkpoint Period  
:3600 secs (60 min)
2010-01-20 18:50:52,092 WARN  namenode.SecondaryNameNode - Log Size Trigger   
:67108864 bytes (65536 KB)
2010-01-20 18:55:52,499 WARN  namenode.SecondaryNameNode - Checkpoint done.
New Image Size: 10329

hadoop-root-tasktracker
2010-01-20 18:50:45,196 INFO  mortbay.log - Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
2010-01-20 18:50:45,299 INFO  mortbay.log - jetty-6.1.14
2010-01-20 18:50:54,703 INFO  mortbay.log - Started
selectchannelconnec...@0.0.0.0:50060
2010-01-20 18:50:54,817 WARN  mapred.TaskTracker - TaskTracker's
totalMemoryAllottedForTasks is -1. TaskMemoryManager is disabled.


hadoop (resumed)
2010-01-20 18:50:47,055 INFO  crawl.Crawl - crawl started in: crawled
2010-01-20 18:50:47,055 INFO  crawl.Crawl - rootUrlDir = urls
2010-01-20 18:50:47,055 INFO  crawl.Crawl - threads = 10
2010-01-20 18:50:47,055 INFO  crawl.Crawl - depth = 2
2010-01-20 18:51:21,767 INFO  crawl.Injector - Injector: starting
2010-01-20 18:51:21,767 INFO  crawl.Injector - Injector: crawlDb:
crawled/crawldb
2010-01-20 18:51:21,767 INFO  crawl.Injector - Injector: urlDir: urls
2010-01-20 18:51:21,794 INFO  crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2010-01-20 18:51:56,353 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2010-01-20 18:52:34,852 INFO  crawl.Injector - Injector: done
2010-01-20 18:52:35,888 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2010-01-20 18:52:35,888 INFO  crawl.Generator - Generator: starting
2010-01-20 18:52:35,888 INFO  crawl.Generator - Generator: segment:
crawled/segments/20100120185235
2010-01-20 18:52:35,889 INFO  crawl.Generator - Generator: filtering: true
2010-01-20 18:53:54,795 INFO  http.Http - http.proxy.host = null
2010-01-20 18:53:54,795 INFO  http.Http - http.proxy.port = 8080
2010-01-20 18:53:54,795 INFO  http.Http - http.timeout = 10000
2010-01-20 18:53:54,795 INFO  http.Http - http.content.limit = 65536
2010-01-20 18:53:54,795 INFO  http.Http - http.agent = cierzo/Nutch-1.0
2010-01-20 18:53:54,795 INFO  http.Http - protocol.plugin.check.blocking =
false
2010-01-20 18:53:54,795 INFO  http.Http - protocol.plugin.check.robots =
false
2010-01-20 18:53:55,626 INFO  fetcher.Fetcher - fetch of
http://aldea-irreductible.blogspot.com/ failed with:
java.lang.NullPointerException
2010-01-20 18:53:55,828 INFO  fetcher.Fetcher - -activeThreads=0
2010-01-20 18:53:57,854 INFO  fetcher.Fetcher - Fetcher: threads: 10
2010-01-20 18:53:57,875 INFO  fetcher.Fetcher - QueueFeeder finished: total
0 records.
2010-01-20 18:54:11,824 INFO  fetcher.Fetcher - Fetcher: done
2010-01-20 18:54:11,868 INFO  crawl.CrawlDb - CrawlDb update: starting
2010-01-20 18:54:11,868 INFO  crawl.CrawlDb - CrawlDb update: db:
crawled/crawldb
2010-01-20 18:54:11,868 INFO  crawl.CrawlDb - CrawlDb update: segments:
[crawled/segments/20100120185235]
2010-01-20 18:54:11,869 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2010-01-20 18:54:11,869 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2010-01-20 18:54:11,869 INFO  crawl.CrawlDb - CrawlDb update: URL filtering:
true
2010-01-20 18:54:11,871 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2010-01-20 18:54:48,640 INFO  crawl.CrawlDb - CrawlDb update: done
2010-01-20 18:54:49,865 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2010-01-20 18:54:49,865 INFO  crawl.Generator - Generator: starting
2010-01-20 18:54:49,865 INFO  crawl.Generator - Generator: segment:
crawled/segments/20100120185449
2010-01-20 18:54:49,865 INFO  crawl.Generator - Generator: filtering: true
2010-01-20 18:55:01,361 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2010-01-20 18:55:01,362 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2010-01-20 18:55:01,362 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2010-01-20 18:55:01,419 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2010-01-20 18:55:01,420 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2010-01-20 18:55:01,420 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2010-01-20 18:55:10,036 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2010-01-20 18:55:10,036 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2010-01-20 18:55:10,036 INFO  crawl.AbstractFetchSchedule -
maxInterval=77760002010-01-20 18:55:13,021 INFO  crawl.FetchScheduleFactory
- Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2010-01-20 18:55:13,022 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2010-01-20 18:55:13,022 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2010-01-20 18:55:21,335 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2010-01-20 18:55:21,340 INFO  crawl.Crawl - Stopping at depth=1 - no more
URLs to fetch.
2010-01-20 18:55:21,340 INFO  crawl.Crawl - crawl finished: crawled




MilleBii wrote:
> 
> Well that does not work this way really.
> 
> If you want to use is it make it run on one node (pseudo-distributed mode)
> and then deploy.
> If you have it running in pseudo-distributed it won't use the local
> filesystem this is why I don't understand your remarks in the initial
> mail.
> 
> NUTCH logs are in NUTCH_HOME/logs look for the hadoop file it will tell
> you
> what is happening more or less.
> 
> 
> 
> 2010/1/20 Santiago Pérez <elara...@gmail.com>
> 
>>
>> I launch the hdfs because I want to make it work in one computer and when
>> it
>> works, launching in several as a distributed version.
>>
>> Which logs do you need to check?
>>
>>
>> MilleBii wrote:
>> >
>> > Why do you launch hdfs if you don't want use it ?
>> >
>> > What are the logs saying,  all fetch urls are logged usually ? But
>> > nothing is displaid
>> >
>> > 2010/1/20, Santiago Pérez <elara...@gmail.com>:
>> >>
>> >> Hej,
>> >>
>> >> I am configuring Nutch for just crawling webs in several machines
>> >> (currently
>> >> I want to test with only one).
>> >> Building Nutch with ant was successfully
>> >>
>> >>    bin/hadoop namenode -format
>> >>    bin/start-all.sh
>> >>
>> >> They show correct logs
>> >>
>> >>   bin/hadoop dfs -put urls urls
>> >>   bin/hadoop dfs -ls
>> >>
>> >> They show the urls directory correctly
>> >>
>> >> But when I launch it the fetcher starts but does not show any message
>> of
>> >> parsing and it stops in the second depth. The crawl-urlfilter and
>> >> nutch-default are well configured because they work great using local
>> >> filesystem (instead of hdfs). I guess it is because nutch-site is
>> empty.
>> >>
>> >> What should be its content?
>> >>
>> >> core-site.xml:
>> >>
>> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> >>
>> >> <!-- Put site-specific property overrides in this file. -->
>> >>
>> >> <configuration>
>> >>
>> >> <property>
>> >>   <name>fs.default.name</name>
>> >>   <value>hdfs://localhost:9000/</value>
>> >>   <description>
>> >>     The name of the default file system. Either the literal string
>> >>     "local" or a host:port for NDFS.
>> >>   </description>
>> >> </property>
>> >>
>> >> </configuration>
>> >>
>> >>
>> >> ---------------------------------------
>> >>
>> >> hdfs-site.xml:
>> >>
>> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> >>
>> >> <!-- Put site-specific property overrides in this file. -->
>> >>
>> >> <configuration>
>> >>
>> >> <property>
>> >>   <name>dfs.name.dir</name>
>> >>   <value>/root/filesystem/name</value>
>> >> </property>
>> >>
>> >> <property>
>> >>   <name>dfs.data.dir</name>
>> >>   <value>/root/filesystem/data</value>
>> >> </property>
>> >>
>> >> <property>
>> >>   <name>dfs.replication</name>
>> >>   <value>1</value>
>> >> </property>
>> >>
>> >> </configuration>
>> >>
>> >>
>> >> ---------------------------------------
>> >>
>> >>
>> >> mapred-site.xml:
>> >>
>> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> >>
>> >> <!-- Put site-specific property overrides in this file. -->
>> >>
>> >> <configuration>
>> >>
>> >> <property>
>> >>   <name>mapred.job.tracker</name>
>> >>   <value>hdfs://localhost:9001/</value>
>> >>   <description>
>> >>     The host and port that the MapReduce job tracker runs at. If
>> >>     "local", then jobs are run in-process as a single map and
>> >>     reduce task.
>> >>   </description>
>> >> </property>
>> >>
>> >> <property>
>> >>   <name>mapred.map.tasks</name>
>> >>   <value>2</value>
>> >>   <description>
>> >>     define mapred.map tasks to be number of slave hosts
>> >>   </description>
>> >> </property>
>> >>
>> >> <property>
>> >>   <name>mapred.reduce.tasks</name>
>> >>   <value>2</value>
>> >>   <description>
>> >>     define mapred.reduce tasks to be number of slave hosts
>> >>   </description>
>> >> </property>
>> >>
>> >> <property>
>> >>   <name>mapred.system.dir</name>
>> >>   <value>/root/filesystem/mapreduce/system</value>
>> >> </property>
>> >>
>> >> <property>
>> >>   <name>mapred.local.dir</name>
>> >>   <value>/root/filesystem/mapreduce/local</value>
>> >> </property>
>> >>
>> >> </configuration>
>> >> --
>> >> View this message in context:
>> >>
>> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27248860.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> -MilleBii-
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27255194.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Configurin nutch-site.xml

Reply via email to