Thank you for your suggestion, I have years of experience with HDFS and if I could I'd gladly use it as filesystem, however our developers require fast random access and symlinks, so I was wondering if there are other options. I'm not sensitive too much to data locality, but we need to eliminate network bottlenecks in the FS. Has anyone trieds filesystems like GlusterFS, GFS, and so on?
Thanks!. On Wed, Aug 31, 2011 at 8:50 AM, Tom White <[email protected]> wrote: > You might consider Apache Whirr (http://whirr.apache.org/) for > bringing up Hadoop clusters on EC2. > > Cheers, > Tom > > On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans <[email protected]> wrote: > > Dmitry, > > > > It sounds like an interesting idea, but I have not really heard of anyone > doing it before. It would make for a good feature to have tiered file > systems all mapped into the same namespace, but that would be a lot of work > and complexity. > > > > The quick solution would be to know what data you want to process before > hand and then run distcp to copy it from S3 into HDFS before launching the > other map/reduce jobs. I don't think there is anything automatic out there. > > > > --Bobby Evans > > > > On 8/29/11 4:56 PM, "Dmitry Pushkarev" <[email protected]> wrote: > > > > Dear hadoop users, > > > > Sorry for the off-topic. We're slowly migrating our hadoop cluster to > EC2, > > and one thing that I'm trying to explore is whether we can use > alternative > > scheduling systems like SGE with shared FS for non data intensive tasks, > > since they are easier to work with for lay users. > > > > One problem for now is how to create shared cluster filesystem similar to > > HDFS, distributed with high-performance, somewhat POSIX compliant > (symlinks > > and permissions), that will use amazon EC2 local nonpersistent storage. > > > > Idea is to keep original data on S3, then as needed fire up a bunch of > > nodes, start shared filesystem, and quickly copy data from S3 to that FS, > > run the analysis with SGE, save results and shut down that filesystem. > > I tried things like S3FS and similar native S3 implementation but speed > is > > too bad. Currently I just have a FS on my master node that is shared via > NFS > > to all the rest, but I pretty much saturate 1GB bandwidth as soon as I > start > > more than 10 nodes. > > > > Thank you. I'd appreciate any suggestions and links to relevant > resources!. > > > > > > Dmitry > > > > >
