You might consider Apache Whirr (http://whirr.apache.org/) for bringing up Hadoop clusters on EC2.
Cheers, Tom On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans <[email protected]> wrote: > Dmitry, > > It sounds like an interesting idea, but I have not really heard of anyone > doing it before. It would make for a good feature to have tiered file > systems all mapped into the same namespace, but that would be a lot of work > and complexity. > > The quick solution would be to know what data you want to process before hand > and then run distcp to copy it from S3 into HDFS before launching the other > map/reduce jobs. I don't think there is anything automatic out there. > > --Bobby Evans > > On 8/29/11 4:56 PM, "Dmitry Pushkarev" <[email protected]> wrote: > > Dear hadoop users, > > Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2, > and one thing that I'm trying to explore is whether we can use alternative > scheduling systems like SGE with shared FS for non data intensive tasks, > since they are easier to work with for lay users. > > One problem for now is how to create shared cluster filesystem similar to > HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks > and permissions), that will use amazon EC2 local nonpersistent storage. > > Idea is to keep original data on S3, then as needed fire up a bunch of > nodes, start shared filesystem, and quickly copy data from S3 to that FS, > run the analysis with SGE, save results and shut down that filesystem. > I tried things like S3FS and similar native S3 implementation but speed is > too bad. Currently I just have a FS on my master node that is shared via NFS > to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start > more than 10 nodes. > > Thank you. I'd appreciate any suggestions and links to relevant resources!. > > > Dmitry > >
