Hi, I've been trying to set up a Nutch-hadoop distributed environment to crawl a 3 Million URL list.
My experience so far been is: 1. Nutch is working fine on a single machine environ. Here I wrote a script file which calls nutch crawl command first to crawl 1000 urls. Then it crawls the next 1000 urls. The first two indexes formed in these processes are merged together to form another merged.index. It will repeatedly crawl for 1000 urls and merge with the previous one. This is stable enough and goes on smoothly. 2. I tried to create a Distributed environment. I tried with 4 machines. There are 2 Master nodes each with 2 GB RAM, one for Namenode and another for JobTracker. The rest 2 machines are 1 GB RAM. I made all the 4 machines into slave nodes. I run the same script to take 5000 URLs from a list of 3 Million URLs and start crawling. Then the rest 5000 will be called and merged with the previous one. I found here the DFS environ is not stable. After running for 2/3 cycles it breaks in different ways. Either the crawl fails or the merging fails. Now after trying with several different configurations like running the both masters on a single node, running only 3 slaves etc. still I found it is not going beyond more then 2/3 cycles. Could anybody suggest where I'm going wrong or if there is a better alternative? I have read docs claiming Nutch in 100+ machines. So is that mean it runs only once? How much time could we make the DFS environ stable? Do I have to restart DFS before beginning every crawl/merge cycle? There are lot of errors like Datanode missing, FileAlreadyCreatedException, JobFailed, RPCExceptions etc. I will appreciate help in this regard. And I'm open to share my knowledge so far also. Please write! Thanks in advance!!! -- View this message in context: http://www.nabble.com/Nutch---DFS-environment.-Is-it-stable--tp25746827p25746827.html Sent from the Nutch - User mailing list archive at Nabble.com.