Am 16.03.2006 um 12:50 schrieb Grégory Debord:
Hi all,
I would like to implement a distributed crawl which would be
something like
this :
The hadoop project is used for working with a dfs. In hadoop exists
one master (namenode, jobtracker) and n slaves (datanodes and
tasktrackers).
start on the master:
bin/hadoop-daemon.sh start namenode
start on the slaves:
bin/hadoop-daemon.sh start datanode
start on the master:
bin/hadoop-daemon.sh start jobtracker
start on the slaves:
bin/hadoop-daemon.sh start tasktracker
- A main machine that would store all nutch database (1)
The datanodes store all datas.
- n machines that would only be used for fetching (because I use
specific
computations in the fetcher process which are time consuming). (2)
The tasktrackers do that.
start on the master: bin/nutch fetch segments/SEGMENT_NAME
After finishing fetching, each of this n machines (2) would
directly update
the database on the main machine (1).
This is also doing by the tasktrackers.
start on the master: bin/nutch updatedb crawldb segments/SEGMENT_NAME
Here my questions :
1 - After reading documentation on NDFS, I'm not sure it can be
used in that
way. If so, is there a better way to do it?
No. ;-)
2 - If it's possible, what happen if two machines (2) try to
updatedb on the
main machine (1) at the same time?
you should not start the updatedb process twice at the same time with
different segments.
hope this helps
Marko
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general