To all Nutch Gurus

 

My question is regarding Nutch distributed WebDB and distributed Web crawling.

 

It is mentioned on the Nutch website that to span WebDB database across multiple machines distributed WebDB is implemented. How do you configure distributed WebDB? I looked into several email threads and could not find a satisfactory answer. Only I know is how you query from distributed WebDB: you create a search-servers.txt at Nutch client i.e. Tomcat server and in that file you give host info with port number and run bin\nutch server portnumber at other servers where distributed WebDB is present. Is this understanding correct? However when you run the fetch and it is updating WebDB which is going to expand beyond one machine, do you need to configure something specific so that WebDB knows that it has to expand to another machine?

 

Also does Nutch support distributed Web crawling? For example if I have a cluster of machines and I need to perform web crawling utilizing them can I tell Nutch in any way to span across machines to perform web crawling? Also do you think that if the cluster of machines runs a software that creates a single virtual machine kind of behavior than it really does not matter to Nutch crawl procedure whether that virtual machine is internally distributing the load to multiple machines to perform distributed

crawling?

 

Thanks in advance for your quick pointers to my above mentioned queries.

 

~

Ashish

Reply via email to