Saikat, As Robert pointed out, performance is a primary criterion - maybe you can come back with benchmarks? Try sorts with >100G data.
Also, MRv2 makes it easy to play with these, you might want to try that. Arun On Sep 9, 2011, at 10:34 AM, Saikat Kanjilal wrote: > > How about using virtual box and centos 64 bit to serve as a linux container > for isolating map/reduce processes? I have setup this up in the past, its > really easy. > > >> From: ev...@yahoo-inc.com >> To: mapreduce-dev@hadoop.apache.org >> Date: Fri, 9 Sep 2011 10:30:37 -0700 >> Subject: Re: Research projects for hadoop >> >> The biggest issue with Xen and other virtualization technologies is that >> often there is an IO penalty involved with using them. For many jobs this >> is not an acceptable trade off. I do know, however, that there has been >> some discussion about using Linux Containers for isolation of Map/Reduce >> processes. I don't know if any JIRA has been filed for it or not, but they >> are much lighter weight then Xen and other virtualization tech, because all >> it really is concerned with is resource isolation, and not virtualizing an >> entire operating system. >> >> --Bobby Evans >> >> On 9/9/11 10:58 AM, "Saikat Kanjilal" <sxk1...@hotmail.com> wrote: >> >> >> >> Hi Folks,I was looking through the following wiki page: >> http://wiki.apache.org/hadoop/HadoopResearchProjects and was wondering if >> there's been any work done (or any interest to do work) for the following >> topics: >> Integration of Virtualization (such as Xen) with Hadoop toolsHow does one >> integrate sandboxing of arbitrary user code in C++ and other languages in a >> VM such as Xen with the Hadoop framework? How does this interact with SGE, >> Torque, Condor?As each individual machine has more and more cores/cpus, it >> makes sense to partition each machine into multiple virtual machines. That >> gives us a number of benefits:By assigning a virtual machine to a datanode, >> we effectively isolate the datanode from the load on the machine caused by >> other processes, making the datanode more responsive/reliable.With multiple >> virtual machines on each machine, we can lower the granularity of hod >> scheduling units, making it possible to schedule multiple tasktrackers on >> the same machine, improving the overall utilization of the whole >> clusters.With virtualization, we can easily snapshot a virtual cluster >> before releasing it, making it possible to re-activate the same cluster in >> the future and start to work from the snapshot.Provisioning of long running >> Services via HODWork on a computation model for services on the grid. The >> model would include:Various tools for defining clients and servers of the >> service, and at the least a C++ and Java instantiation of the >> abstractionsLogical definitions of how to partition work onto a set of >> servers, i.e. a generalized shard implementationA few useful abstractions >> like locks (exclusive and RW, fairness), leader election, >> transactions,Various communication models for groups of servers belonging to >> a service, such as broadcast, unicast, etc.Tools for assuring QoS, >> reliability, managing pools of servers for a service with spares, >> etc.Integration with HDFS for persistence, as well as access to local >> filesystemsIntegration with ZooKeeper so that applications can use the >> namespace >> I would like to either help out with a design for the above or prototyping >> code, please let me know if and what the process may be to move forward with >> this. >> Regards >> >