Re: problem in crawling

brainstorm Tue, 05 Aug 2008 00:16:11 -0700

fatal error  regarding  http.robots.agents

You should check or configure the following properties on
nutch-site.xml properly:


  <name>http.max.delays</name>
  <name>http.robots.agents</name>
  <name>http.agent.name</name>
  <name>http.agent.description</name>
  <name>http.agent.url</name>
  <name>http.agent.email</name>


On Tue, Aug 5, 2008 at 8:56 AM, Alexander Aristov
<[EMAIL PROTECTED]> wrote:
> Do you have proxy in your network?
>
> 2008/8/5 Mohammad Monirul Hoque <[EMAIL PROTECTED]>
>
>>
>> Hi,
>>
>> What i only modify in crawl-urlfilter.txt is to add the line
>>
>> +^http://([a-z0-9]*\.)*wikipedia.org/
>>
>> I also commented out the previous line like the following:
>>
>> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>
>> I also tried many other  urls  but  each time  it  returned  same  type  of
>>  result.
>>
>> Another imp things : I am trying nutch on ubuntu now which is showing
>> problem but when i used it in fedora core 8 it just worked fine.
>>
>> I was trying previously on pseudo-distributed  mode  but  after having
>> problem i tried yesterday  in stand-alone mode it returned same type of
>> result.
>>
>> When i see the hadoop.log it indicates that lots of pages were being
>> fetched  with  lots of  error,  fatal error  regarding  http.robots.agents,
>> parser not found, java.net.SocketTimeOut exection etc.
>>
>> Pls tell me where i m wrong.
>>
>> regards,
>> --monirul
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Tristan Buckner <[EMAIL PROTECTED]>
>> To: [email protected]
>> Sent: Tuesday, August 5, 2008 12:46:21 AM
>> Subject: Re: problem in crawling
>>
>> Are your urls of the form
>> http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo
>>  ?  If it does the robots file excludes these.
>>
>> Also is there a line above that line for which the urls fail?
>>
>> On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote:
>>
>> > Hi,
>> >
>> > Thanks for ur reply. In my crawl-urlfilter.txt i included the
>> > following line
>> >
>> > +^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.
>> >
>> > My urls/urllist.txt contains urls of wikipedia like below:
>> >
>> > http://en.wikipedia.org/
>> >
>> > I used nutch 0.9 previously in fedora 8.It worked fine.
>> >
>> > So pls tell me if u have any idea.
>> >
>> > best regards,
>> >
>> > --monirul
>> >
>> >
>> >
>> >
>> > ----- Original Message ----
>> > From: Alexander Aristov <[EMAIL PROTECTED]>
>> > To: [email protected]
>> > Sent: Monday, August 4, 2008 1:28:58 PM
>> > Subject: Re: problem in crawling
>> >
>> > Hi
>> >
>> > what is in your crawl -urlfilter.txt file?
>> >
>> > Did you include your URLs in the filter? By default all urls are
>> > excluded.
>> >
>> > Alexander
>> >
>> > 2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]>
>> >
>> >> Hi,
>> >>
>> >> I m using nutch 0.9 on ubuntu on a single machine with pseudo-
>> >> distributed
>> >> mode.
>> >> When i executing  the  following command
>> >>
>> >> bin/nutch crawl urls -dir crawled -depth 10
>> >>
>> >> this is what i got from the hadoop log:
>> >>
>> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
>> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
>> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
>> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
>> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
>> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
>> >> crawled/crawldb
>> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
>> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
>> >> injected urls to crawl db entries.
>> >> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging
>> >> injected
>> >> urls into crawl db.
>> >> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
>> >> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
>> >> best-scoring urls due for fetch.
>> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
>> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
>> >> crawled/segments/20080803031100
>> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator:
>> >> filtering: false
>> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN:
>> >> 2147483647
>> >> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator:
>> >> Partitioning
>> >> selected urls by host, for politeness.
>> >> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
>> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
>> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
>> >> crawled/segments/20080803031100
>> >> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
>> >> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update:
>> >> starting
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
>> >> crawled/crawldb
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
>> >> segments:
>> >> [crawled/segments/20080803031100]
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
>> >> additions
>> >> allowed: true
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> normalizing: true
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> filtering: true
>> >> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> >> segment data into db.
>> >> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
>> >> best-scoring urls due for fetch.
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
>> >> crawled/segments/20080803031321
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator:
>> >> filtering: false
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN:
>> >> 2147483647
>> >> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator:
>> >> Partitioning
>> >> selected urls by host, for politeness.
>> >> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
>> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
>> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
>> >> crawled/segments/20080803031321
>> >> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
>> >> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update:
>> >> starting
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
>> >> crawled/crawldb
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
>> >> segments:
>> >> [crawled/segments/20080803031321]
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
>> >> additions
>> >> allowed: true
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> normalizing: true
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> filtering: true
>> >> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> >> segment data into db.
>> >> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
>> >> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
>> >> best-scoring urls due for fetch.
>> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
>> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
>> >> crawled/segments/20080803032214
>> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator:
>> >> filtering: false
>> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN:
>> >> 2147483647
>> >> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator:
>> >> Partitioning
>> >> selected urls by host, for politeness.
>> >> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
>> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
>> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
>> >> crawled/segments/20080803032214
>> >> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
>> >>
>> >> What i found executing the command:
>> >> bin/hadoop dfs -ls
>> >> Found 2 items
>> >> /user/nutch/crawled     <dir>
>> >> /user/nutch/urls        <dir>
>> >> $ bin/hadoop dfs -ls crawled
>> >> Found 2 items
>> >> /user/nutch/crawled/crawldb     <dir>
>> >> /user/nutch/crawled/segments    <dir>
>> >>
>> >> Where is linkdb,indexes and index? So pls tell me which may be the
>> >> error.
>> >>
>> >> Here is my hadoop-site.xml:
>> >>
>> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> >>
>> >> <!-- Put site-specific property overrides in this file. -->
>> >>
>> >> <configuration>
>> >> <property>
>> >> <name>fs.default.name</name>
>> >> <value>sysmonitor:9000</value>
>> >> <description>
>> >>   The name of the default file system. Either the literal string
>> >>   "local" or a host:port for NDFS.
>> >> </description>
>> >> </property>
>> >> <property>
>> >> <name>mapred.job.tracker</name>
>> >> <value>sysmonitor:9001</value>
>> >> <description>
>> >>   The host and port that the MapReduce job tracker runs at. If
>> >>   "local", then jobs are run in-process as a single map and
>> >>   reduce task.
>> >> </description>
>> >> </property>
>> >> <property>
>> >> <name>mapred.tasktracker.tasks.maximum</name>
>> >> <value>2</value>
>> >> <description>
>> >>   The maximum number of tasks that will be run simultaneously by
>> >>   a task tracker. This should be adjusted according to the heap size
>> >>   per task, the amount of RAM available, and CPU consumption of
>> >> each task.
>> >> </description>
>> >> </property>
>> >> <property>
>> >> <name>mapred.child.java.opts</name>
>> >> <value>-Xmx200m</value>
>> >> <description>
>> >>   You can specify other Java options for each map or reduce task
>> >> here,
>> >>   but most likely you will want to adjust the heap size.
>> >> </description>
>> >> </property>
>> >> <property>
>> >> <name>dfs.name.dir</name>
>> >> <value>/nutch/filesystem/name</value>
>> >> </property>
>> >> <property>
>> >> <name>dfs.data.dir</name>
>> >> <value>/nutch/filesystem/data</value>
>> >> </property>
>> >>
>> >> <property>
>> >> <name>mapred.system.dir</name>
>> >> <value>/nutch/filesystem/mapreduce/system</value>
>> >> </property>
>> >> <property>
>> >> <name>mapred.local.dir</name>
>> >> <value>/nutch/filesystem/mapreduce/local</value>
>> >> </property>
>> >>
>> >> <property>
>> >> <name>dfs.replication</name>
>> >> <value>1</value>
>> >> </property>
>> >> </configuration>
>> >>
>> >>
>> >> My urls/urllist.txt contains almost 100 seed urls and depth is 10
>> >> but it
>> >> seems there is  little crawling done.
>> >>
>> >>
>> >> regards
>> >> --monirul
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards
>> > Alexander Aristov
>> >
>> >
>> >
>>
>>
>>
>>
>
>
>
> --
> Best Regards
> Alexander Aristov
>

Re: problem in crawling

Reply via email to