You might consider Apache Whirr (http://whirr.apache.org/) for
bringing up Hadoop clusters on EC2.

Cheers,
Tom

On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans <[email protected]> wrote:
> Dmitry,
>
> It sounds like an interesting idea, but I have not really heard of anyone 
> doing it before.  It would make for a good feature to have tiered file 
> systems all mapped into the same namespace, but that would be a lot of work 
> and complexity.
>
> The quick solution would be to know what data you want to process before hand 
> and then run distcp to copy it from S3 into HDFS before launching the other 
> map/reduce jobs.  I don't think there is anything automatic out there.
>
> --Bobby Evans
>
> On 8/29/11 4:56 PM, "Dmitry Pushkarev" <[email protected]> wrote:
>
> Dear hadoop users,
>
> Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
> and one thing that I'm trying to explore is whether we can use alternative
> scheduling systems like SGE with shared FS for non data intensive tasks,
> since they are easier to work with for lay users.
>
> One problem for now is how to create shared cluster filesystem similar to
> HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
> and permissions), that will use amazon EC2 local nonpersistent storage.
>
> Idea is to keep original data on S3, then as needed fire up a bunch of
> nodes, start shared filesystem, and quickly copy data from S3 to that FS,
> run the analysis with SGE, save results and shut down that filesystem.
> I tried things like S3FS and similar native S3 implementation but speed is
> too bad. Currently I just have a FS on my master node that is shared via NFS
> to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
> more than 10 nodes.
>
> Thank you. I'd appreciate any suggestions and links to relevant resources!.
>
>
> Dmitry
>
>

Reply via email to