Hi i can share my times with you. I'm fetching 500 000 pages in each run.
generating 5 hrs
fetching 8 hrs
parsing 2.5 hrs
updating 3.5 hrs
i have about 30 million urls in db right now and those times are for a
cluster of 3 machines. So yes it takes a lot of time. I think that using
native hadoop libraries could speed it up a bit, but unfortunately i can't
get them to work on debian. I will switch the cluster to fedora or some
other linux that is supported and check then, but it'll probably be next
week. I hope you do realize that you will need huge storage for storing
segments for 100 million pages. Also i'd suggest running nightly build of
nutch as it has hadoop 0.15 which is much more stable for me.

-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Reply via email to