[Nutch-general] Re: Custom Distributed crawl - NDFS?

Marko Bauhardt Thu, 16 Mar 2006 05:19:06 -0800


Am 16.03.2006 um 12:50 schrieb Grégory Debord:

Hi all,
I would like to implement a distributed crawl which would besomething like
this :

The hadoop project is used for working with a dfs. In hadoop existsone master (namenode, jobtracker) and n slaves (datanodes andtasktrackers).


start on the master:
bin/hadoop-daemon.sh start namenode

start on the slaves:
bin/hadoop-daemon.sh start datanode

start on the master:
bin/hadoop-daemon.sh start jobtracker

start on the slaves:
bin/hadoop-daemon.sh start tasktracker


- A main machine that would store all nutch database (1)


The datanodes store all datas.

- n machines that would only be used for fetching (because I usespecific
computations in the fetcher process which are time consuming). (2)


The tasktrackers do that.
start on the master: bin/nutch fetch segments/SEGMENT_NAME

After finishing fetching, each of this n machines (2) woulddirectly update
the database on the main machine (1).


This is also doing by the tasktrackers.
start on the master: bin/nutch updatedb crawldb segments/SEGMENT_NAME

Here my questions :
1 - After reading documentation on NDFS, I'm not sure it can beused in that
way. If so, is there a better way to do it?


No. ;-)

2 - If it's possible, what happen if two machines (2) try toupdatedb on the
main machine (1) at the same time?

you should not start the updatedb process twice at the same time withdifferent segments.


hope this helps
Marko



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Custom Distributed crawl - NDFS?

Reply via email to