Am 16.03.2006 um 12:50 schrieb Grégory Debord:

Hi all,

I would like to implement a distributed crawl which would be something like
this :

The hadoop project is used for working with a dfs. In hadoop exists one master (namenode, jobtracker) and n slaves (datanodes and tasktrackers).

start on the master:
bin/hadoop-daemon.sh start namenode

start on the slaves:
bin/hadoop-daemon.sh start datanode

start on the master:
bin/hadoop-daemon.sh start jobtracker

start on the slaves:
bin/hadoop-daemon.sh start tasktracker



- A main machine that would store all nutch database (1)

The datanodes store all datas.

- n machines that would only be used for fetching (because I use specific
computations in the fetcher process which are time consuming). (2)

The tasktrackers do that.
start on the master: bin/nutch fetch segments/SEGMENT_NAME


After finishing fetching, each of this n machines (2) would directly update
the database on the main machine (1).


This is also doing by the tasktrackers.
start on the master: bin/nutch updatedb crawldb segments/SEGMENT_NAME

Here my questions :

1 - After reading documentation on NDFS, I'm not sure it can be used in that
way. If so, is there a better way to do it?

No. ;-)


2 - If it's possible, what happen if two machines (2) try to updatedb on the
main machine (1) at the same time?

you should not start the updatedb process twice at the same time with different segments.

hope this helps
Marko



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to