Re: [Nutch-general] Problem in Distributed crawling using nutch 0.8

mohanlal sankaranarayanan Fri, 29 Sep 2006 05:14:59 -0700

Thanks "Håvard"

now its working fine


Rgds
Mohan Lal

On 9/29/06, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote:


see:

http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E

Before you start tomcat remeber to change the path of your search
directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
directory

#This is an example of my configuration

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>LSearchDev01:9000</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/user/root/crawld</value>
  </property>

</configuration>



Mohan Lal wrote:
> Hi,
>
> thanks for your valuable information, i have solved that problem after
that
> iam facing another problem ....
> i have 2 slaves
>  1)  MAC1
>   2)  MAC2
>
> but the job was running in MAC1 itself, and it take a long time to
finish
> the crawling process
> how can i assign job to distributed machines i specified in tha slaves
file
> ?
>
> But my Crowling process done successfully..........also how ccan i
specify
> the searcher dir in the nutch-site.xml file
>
>      <property>
>           <name>searcher.dir</name>
>           <value> ? </value>
>      </property>
>
> please help me.........
>
>
> I have done the following setting.....
>
> [EMAIL PROTECTED] ~]# cd /home/lucene/nutch-0.8.1/
> [EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop namenode -format
> Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
> Formatted /tmp/hadoop/dfs/name
> [EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh
> starting namenode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> amenode-mohanlal.qburst.local.out
> fpo: ssh: fpo: Name or service not known
> localhost: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/ha
> doop-root-datanode-mohanlal.qburst.local.out
> starting jobtracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> -jobtracker-mohanlal.qburst.local.out
> fpo: ssh: fpo: Name or service not known
> localhost: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs
> /hadoop-root-tasktracker-mohanlal.qburst.local.out
> [EMAIL PROTECTED] nutch-0.8.1]# bin/stop-all.sh
> stopping jobtracker
> localhost: stopping tasktracker
> sonu: no tasktracker to stop
> stopping namenode
> sonu: no datanode to stop
> localhost: stopping datanode
> [EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh
> starting namenode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> amenode-mohanlal.qburst.local.out
> sonu: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-
> root-datanode-sonu.qburst.local.out
> localhost: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/ha
> doop-root-datanode-mohanlal.qburst.local.out
> starting jobtracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> -jobtracker-mohanlal.qburst.local.out
> localhost: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs
> /hadoop-root-tasktracker-mohanlal.qburst.local.out
> sonu: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hado
> op-root-tasktracker-sonu.qburst.local.out
> [EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop dfs -put  urls urls
> [EMAIL PROTECTED] nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
> -topN 10 crawl started in: crawl.1
> rootUrlDir = urls
> threads = 100
> depth = 2
> topN = 10
> Injector: starting
> Injector: crawlDb: crawl.1/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: crawl.1/segments/20060929120038
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl.1/segments/20060929120038
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.1/crawldb
> CrawlDb update: segment: crawl.1/segments/20060929120038
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: crawl.1/segments/20060929120235
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl.1/segments/20060929120235
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.1/crawldb
> CrawlDb update: segment: crawl.1/segments/20060929120235
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl.1/linkdb
> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl.1/linkdb
> Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
> Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl.1/indexes
> Dedup: done
> Adding /user/root/crawl.1/indexes/part-00000
> Adding /user/root/crawl.1/indexes/part-00001
> crawl finished: crawl.1
>
>
> Thanks and Regards
> Mohanlal
>
>
> &quot;H?vard W. Kongsg?rd&quot;-2 wrote:
>
>> Do /user/root/url exist, have you uploaded  the url folder to you dfs
>> system?
>>
>> bin/hadoop dfs -mkdir urls
>> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
>>
>> or
>>
>> bin/hadoop -put <localsrc> <dst>
>>
>>
>> Mohan Lal wrote:
>>
>>> Hi all,
>>>
>>> While iam try to crawl using distributed machines its throw an error
>>>
>>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>>> crawl started in: crawl
>>> rootUrlDir = urls
>>> threads = 10
>>> depth = 10
>>> topN = 50
>>> Injector: starting
>>> Injector: crawlDb: crawl/crawldb
>>> Injector: urlDir: urls
>>> Injector: Converting injected urls to crawl db entries.
>>> Exception in thread "main" java.io.IOException: Input directory
>>> /user/root/urls in localhost:9000 is invalid.
>>>         at
>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java
:327)
>>>         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>>
>>> whats wrong with my configuration,  please help  me..................
>>>
>>>
>>> Regards
>>> Mohan Lal
>>>
>>>
>>
>>
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Problem in Distributed crawling using nutch 0.8

Reply via email to