It would seem that one could use http://wiki.apache.org/lucene-hadoop/AmazonEC2 to run many hours of spidering in a single hour by having a bunch of xen virtual machine instances set to do this. If one started up and shut down instances for the crawl, it would not seem to be expensive to do a lot of work concurrently on a big crawl. Is this practical? Why or why not?
Having a virtual server farm that apears just when I need it and only costs me 10 cents / machine per hour and zero when I'm done with my spidering sounds like something I should explore apart from just the coolness of it. What about actually running the web front end search site on EC2? Would that be wise? Will I get the performance that I need to be seen as a responsive website? When they talk about Amazon S3 (Simple Storage Service), is this raw disk space that they are selling or backed up, guaranteed to be there storage? Does anyone on the list of actual experience with either of these offerings or any competitive offerings from anyone else? I assume that Amazon has a huge infrastructure with great connectivity. Do I get the benefit of all that if I use EC2 really, in practical reality? Will it actually help me keep my search index up to date? Thanks!