Ok guys,
I once again need some advice here. I have 4 dual proc quad core 1.8 xeon
servers, each server has 4gb ram and runs linux. I am using nutch svn (build
#334 i think) and am using hadoop dfs. I need to know what parameters I can
set to get the optimal performance from these servers. I have a seed list of
about 10,000 urls (ignore external link will be set to true). My goal is to
crawl in the shortest period of time. Furthermore I intend to run one crawl
(depth 5) and thus have one index.
What advice would you give in terms of this approach and also in terms of
nutch/hadoop variables/parameters and their settings.
Regards,
Hilkiah G. Lavinier MEng (Hons), ACGI
6 Winston Lane,
Goodwill,
Roseau, Dominica
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201 / AOL hilkiah21
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs