Nutch - DFS environment. Is it stable?

tittutomen Mon, 05 Oct 2009 01:22:07 -0700

Hi,

I've been trying to set up a Nutch-hadoop distributed environment to crawl a
3 Million URL list.


My experience so far been is:

1. Nutch is working fine on a single machine environ. Here I wrote a script
file which calls nutch crawl command first to crawl 1000 urls. Then it
crawls the next 1000 urls. The first two indexes formed in these processes
are merged together to form another merged.index. It will repeatedly crawl
for 1000 urls and merge with the previous one. This is stable enough and
goes on smoothly.

2. I tried to create a Distributed environment. I tried with 4 machines.
There are 2 Master nodes each with 2 GB RAM, one for Namenode and another
for JobTracker. The rest 2 machines are 1 GB RAM. I made all the 4 machines
into slave nodes. I run the same script to take 5000 URLs from a list of 3
Million URLs and start crawling. Then the rest 5000 will be called and
merged with the previous one. I found here the DFS environ is not stable.
After running for 2/3 cycles it breaks in different ways. Either the crawl
fails or the merging fails. 

Now after trying with several different configurations like running the both
masters on a single node, running only 3 slaves etc. still I found it is not
going beyond more then 2/3 cycles. 

Could anybody suggest where I'm going wrong or if there is a better
alternative? I have read docs claiming Nutch in 100+ machines. So is that
mean it runs only once? How much time could we make the DFS environ stable?
Do I have to restart DFS before beginning every crawl/merge cycle? 

There are lot of errors like Datanode missing, FileAlreadyCreatedException,
JobFailed, RPCExceptions etc.

I will appreciate help in this regard. And I'm open to share my knowledge so
far also. Please write!

Thanks in advance!!!


-- 
View this message in context: 
http://www.nabble.com/Nutch---DFS-environment.-Is-it-stable--tp25746827p25746827.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch - DFS environment. Is it stable?

Reply via email to