fatal error regarding http.robots.agents You should check or configure the following properties on nutch-site.xml properly:
<name>http.max.delays</name> <name>http.robots.agents</name> <name>http.agent.name</name> <name>http.agent.description</name> <name>http.agent.url</name> <name>http.agent.email</name> On Tue, Aug 5, 2008 at 8:56 AM, Alexander Aristov <[EMAIL PROTECTED]> wrote: > Do you have proxy in your network? > > 2008/8/5 Mohammad Monirul Hoque <[EMAIL PROTECTED]> > >> >> Hi, >> >> What i only modify in crawl-urlfilter.txt is to add the line >> >> +^http://([a-z0-9]*\.)*wikipedia.org/ >> >> I also commented out the previous line like the following: >> >> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ >> >> I also tried many other urls but each time it returned same type of >> result. >> >> Another imp things : I am trying nutch on ubuntu now which is showing >> problem but when i used it in fedora core 8 it just worked fine. >> >> I was trying previously on pseudo-distributed mode but after having >> problem i tried yesterday in stand-alone mode it returned same type of >> result. >> >> When i see the hadoop.log it indicates that lots of pages were being >> fetched with lots of error, fatal error regarding http.robots.agents, >> parser not found, java.net.SocketTimeOut exection etc. >> >> Pls tell me where i m wrong. >> >> regards, >> --monirul >> >> >> >> >> ----- Original Message ---- >> From: Tristan Buckner <[EMAIL PROTECTED]> >> To: [email protected] >> Sent: Tuesday, August 5, 2008 12:46:21 AM >> Subject: Re: problem in crawling >> >> Are your urls of the form >> http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo >> ? If it does the robots file excludes these. >> >> Also is there a line above that line for which the urls fail? >> >> On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote: >> >> > Hi, >> > >> > Thanks for ur reply. In my crawl-urlfilter.txt i included the >> > following line >> > >> > +^http://([a-z0-9]*\.)*wikipedia.org/ as i want to crawl wiki. >> > >> > My urls/urllist.txt contains urls of wikipedia like below: >> > >> > http://en.wikipedia.org/ >> > >> > I used nutch 0.9 previously in fedora 8.It worked fine. >> > >> > So pls tell me if u have any idea. >> > >> > best regards, >> > >> > --monirul >> > >> > >> > >> > >> > ----- Original Message ---- >> > From: Alexander Aristov <[EMAIL PROTECTED]> >> > To: [email protected] >> > Sent: Monday, August 4, 2008 1:28:58 PM >> > Subject: Re: problem in crawling >> > >> > Hi >> > >> > what is in your crawl -urlfilter.txt file? >> > >> > Did you include your URLs in the filter? By default all urls are >> > excluded. >> > >> > Alexander >> > >> > 2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]> >> > >> >> Hi, >> >> >> >> I m using nutch 0.9 on ubuntu on a single machine with pseudo- >> >> distributed >> >> mode. >> >> When i executing the following command >> >> >> >> bin/nutch crawl urls -dir crawled -depth 10 >> >> >> >> this is what i got from the hadoop log: >> >> >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - crawl started in: crawled >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - rootUrlDir = urls >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - threads = 10 >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - depth = 10 >> >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: starting >> >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: crawlDb: >> >> crawled/crawldb >> >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: urlDir: urls >> >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: Converting >> >> injected urls to crawl db entries. >> >> 2008-08-03 03:10:35,227 INFO crawl.Injector - Injector: Merging >> >> injected >> >> urls into crawl db. >> >> 2008-08-03 03:10:59,724 INFO crawl.Injector - Injector: done >> >> 2008-08-03 03:11:00,791 INFO crawl.Generator - Generator: Selecting >> >> best-scoring urls due for fetch. >> >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: starting >> >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: segment: >> >> crawled/segments/20080803031100 >> >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: >> >> filtering: false >> >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: topN: >> >> 2147483647 >> >> 2008-08-03 03:11:24,239 INFO crawl.Generator - Generator: >> >> Partitioning >> >> selected urls by host, for politeness. >> >> 2008-08-03 03:11:47,583 INFO crawl.Generator - Generator: done. >> >> 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: starting >> >> 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: segment: >> >> crawled/segments/20080803031100 >> >> 2008-08-03 03:12:36,915 INFO fetcher.Fetcher - Fetcher: done >> >> 2008-08-03 03:12:36,951 INFO crawl.CrawlDb - CrawlDb update: >> >> starting >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: db: >> >> crawled/crawldb >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: >> >> segments: >> >> [crawled/segments/20080803031100] >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: >> >> additions >> >> allowed: true >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL >> >> normalizing: true >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL >> >> filtering: true >> >> 2008-08-03 03:12:36,967 INFO crawl.CrawlDb - CrawlDb update: Merging >> >> segment data into db. >> >> 2008-08-03 03:13:20,341 INFO crawl.CrawlDb - CrawlDb update: done >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: Selecting >> >> best-scoring urls due for fetch. >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: starting >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: segment: >> >> crawled/segments/20080803031321 >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: >> >> filtering: false >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: topN: >> >> 2147483647 >> >> 2008-08-03 03:13:39,667 INFO crawl.Generator - Generator: >> >> Partitioning >> >> selected urls by host, for politeness. >> >> 2008-08-03 03:14:04,963 INFO crawl.Generator - Generator: done. >> >> 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: starting >> >> 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: segment: >> >> crawled/segments/20080803031321 >> >> 2008-08-03 03:21:26,809 INFO fetcher.Fetcher - Fetcher: done >> >> 2008-08-03 03:21:26,851 INFO crawl.CrawlDb - CrawlDb update: >> >> starting >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: db: >> >> crawled/crawldb >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: >> >> segments: >> >> [crawled/segments/20080803031321] >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: >> >> additions >> >> allowed: true >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL >> >> normalizing: true >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL >> >> filtering: true >> >> 2008-08-03 03:21:26,866 INFO crawl.CrawlDb - CrawlDb update: Merging >> >> segment data into db. >> >> 2008-08-03 03:22:13,223 INFO crawl.CrawlDb - CrawlDb update: done >> >> 2008-08-03 03:22:14,251 INFO crawl.Generator - Generator: Selecting >> >> best-scoring urls due for fetch. >> >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: starting >> >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: segment: >> >> crawled/segments/20080803032214 >> >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: >> >> filtering: false >> >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: topN: >> >> 2147483647 >> >> 2008-08-03 03:22:34,459 INFO crawl.Generator - Generator: >> >> Partitioning >> >> selected urls by host, for politeness. >> >> 2008-08-03 03:22:59,733 INFO crawl.Generator - Generator: done. >> >> 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: starting >> >> 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: segment: >> >> crawled/segments/20080803032214 >> >> 2008-08-03 04:24:53,193 INFO fetcher.Fetcher - Fetcher: done >> >> >> >> What i found executing the command: >> >> bin/hadoop dfs -ls >> >> Found 2 items >> >> /user/nutch/crawled <dir> >> >> /user/nutch/urls <dir> >> >> $ bin/hadoop dfs -ls crawled >> >> Found 2 items >> >> /user/nutch/crawled/crawldb <dir> >> >> /user/nutch/crawled/segments <dir> >> >> >> >> Where is linkdb,indexes and index? So pls tell me which may be the >> >> error. >> >> >> >> Here is my hadoop-site.xml: >> >> >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> >> >> <!-- Put site-specific property overrides in this file. --> >> >> >> >> <configuration> >> >> <property> >> >> <name>fs.default.name</name> >> >> <value>sysmonitor:9000</value> >> >> <description> >> >> The name of the default file system. Either the literal string >> >> "local" or a host:port for NDFS. >> >> </description> >> >> </property> >> >> <property> >> >> <name>mapred.job.tracker</name> >> >> <value>sysmonitor:9001</value> >> >> <description> >> >> The host and port that the MapReduce job tracker runs at. If >> >> "local", then jobs are run in-process as a single map and >> >> reduce task. >> >> </description> >> >> </property> >> >> <property> >> >> <name>mapred.tasktracker.tasks.maximum</name> >> >> <value>2</value> >> >> <description> >> >> The maximum number of tasks that will be run simultaneously by >> >> a task tracker. This should be adjusted according to the heap size >> >> per task, the amount of RAM available, and CPU consumption of >> >> each task. >> >> </description> >> >> </property> >> >> <property> >> >> <name>mapred.child.java.opts</name> >> >> <value>-Xmx200m</value> >> >> <description> >> >> You can specify other Java options for each map or reduce task >> >> here, >> >> but most likely you will want to adjust the heap size. >> >> </description> >> >> </property> >> >> <property> >> >> <name>dfs.name.dir</name> >> >> <value>/nutch/filesystem/name</value> >> >> </property> >> >> <property> >> >> <name>dfs.data.dir</name> >> >> <value>/nutch/filesystem/data</value> >> >> </property> >> >> >> >> <property> >> >> <name>mapred.system.dir</name> >> >> <value>/nutch/filesystem/mapreduce/system</value> >> >> </property> >> >> <property> >> >> <name>mapred.local.dir</name> >> >> <value>/nutch/filesystem/mapreduce/local</value> >> >> </property> >> >> >> >> <property> >> >> <name>dfs.replication</name> >> >> <value>1</value> >> >> </property> >> >> </configuration> >> >> >> >> >> >> My urls/urllist.txt contains almost 100 seed urls and depth is 10 >> >> but it >> >> seems there is little crawling done. >> >> >> >> >> >> regards >> >> --monirul >> >> >> >> >> >> >> > >> > >> > >> > >> > -- >> > Best Regards >> > Alexander Aristov >> > >> > >> > >> >> >> >> > > > > -- > Best Regards > Alexander Aristov >
