Hi, I want to know if anyone is able to successfully run distributed crawl on multiple machines involving crawling millions of pages? and how hard is to do that? Do i just have to do some configuration and set up or do some implementations also?
Also can anyone tell me if i want to crawl around 20,000 websites (say for depth 5) in a day, is it possible and if yes then how many machines would i roughly require? and what all configurations i will need? I would appreciate even some very approximate numbers also as i can understand it might not be trivial to find out or may be :-) TIA Pushpesh
