Saikat,

 As Robert pointed out, performance is a primary criterion - maybe you can come 
back with benchmarks? Try sorts with >100G data.

 Also, MRv2 makes it easy to play with these, you might want to try that.

Arun

On Sep 9, 2011, at 10:34 AM, Saikat Kanjilal wrote:

> 
> How about using virtual box and centos 64 bit to serve as a linux container 
> for isolating map/reduce processes?  I have setup this up in the past, its 
> really easy.
> 
> 
>> From: ev...@yahoo-inc.com
>> To: mapreduce-dev@hadoop.apache.org
>> Date: Fri, 9 Sep 2011 10:30:37 -0700
>> Subject: Re: Research projects for hadoop
>> 
>> The biggest issue with Xen and other virtualization technologies is that 
>> often there is an IO penalty involved with using them.  For many jobs this 
>> is not an acceptable trade off.  I do know, however, that there has been 
>> some discussion about using Linux Containers for isolation of Map/Reduce 
>> processes.  I don't know if any JIRA has been filed for it or not, but they 
>> are much lighter weight then Xen and other virtualization tech, because all 
>> it really is concerned with is resource isolation, and not virtualizing an 
>> entire operating system.
>> 
>> --Bobby Evans
>> 
>> On 9/9/11 10:58 AM, "Saikat Kanjilal" <sxk1...@hotmail.com> wrote:
>> 
>> 
>> 
>> Hi  Folks,I was looking through the following wiki page:  
>> http://wiki.apache.org/hadoop/HadoopResearchProjects and was wondering if 
>> there's been any work done (or any interest to do work) for the following 
>> topics:
>> Integration of Virtualization (such as Xen) with Hadoop toolsHow does one 
>> integrate sandboxing of arbitrary user code in C++ and other languages in a 
>> VM such as Xen with the Hadoop framework? How does this interact with SGE, 
>> Torque, Condor?As each individual machine has more and more cores/cpus, it 
>> makes sense to partition each machine into multiple virtual machines. That 
>> gives us a number of benefits:By assigning a virtual machine to a datanode, 
>> we effectively isolate the datanode from the load on the machine caused by 
>> other processes, making the datanode more responsive/reliable.With multiple 
>> virtual machines on each machine, we can lower the granularity of hod 
>> scheduling units, making it possible to schedule multiple tasktrackers on 
>> the same machine, improving the overall utilization of the whole 
>> clusters.With virtualization, we can easily snapshot a virtual cluster 
>> before releasing it, making it possible to re-activate the same cluster in 
>> the future and start to work from the snapshot.Provisioning of long running 
>> Services via HODWork on a computation model for services on the grid. The 
>> model would include:Various tools for defining clients and servers of the 
>> service, and at the least a C++ and Java instantiation of the 
>> abstractionsLogical definitions of how to partition work onto a set of 
>> servers, i.e. a generalized shard implementationA few useful abstractions 
>> like locks (exclusive and RW, fairness), leader election, 
>> transactions,Various communication models for groups of servers belonging to 
>> a service, such as broadcast, unicast, etc.Tools for assuring QoS, 
>> reliability, managing pools of servers for a service with spares, 
>> etc.Integration with HDFS for persistence, as well as access to local 
>> filesystemsIntegration with ZooKeeper so that applications can use the 
>> namespace
>> I would like to either help out with a design for the above or prototyping 
>> code, please let me know if and what the process may be to move forward with 
>> this.
>> Regards
>> 
>                                         

Reply via email to