Make Hadoop NetworkTopology and data locality more pluggable for other deploying topology like: virtualization.

Jun Ping Du Mon, 04 Jun 2012 08:49:05 -0700

Hello Folks,
      I just filed a Umbrella jira today to address current NetworkTopology 
issue that binding strictly to three tier network. The motivation here is to 
make hadoop more flexible for deploying topology (especially for 
cloud/virtualization case) and more configurable in data locality related 
policies like: replica placement, task scheduling, choosing block for DFSClient 
reading, balancing. 
      We submit a draft proposal in this Umbrella as well as the implementation 
code. As code base is large (~260K), the code is separated into 7 sub JIRA 
issues which seems to be more convenient for reviewing. However, we split the 
code based on functionality which cause some dependencies between patches which 
way we are not sure the best. Welcome to provide comments and suggestions on 
doc and code, and look forward to work with all of you to enhance hadoop in 
some new situations towards perfect.
      Hope this is a good start.

Cheers,

Junping

----- Original Message -----
From: "Junping Du (JIRA)" <[email protected]>
To: [email protected]
Sent: Monday, June 4, 2012 12:09:22 PM
Subject: [jira] [Created] (HADOOP-8468) Umbrella of enhancements to support 
different failure and locality topologies

Junping Du created HADOOP-8468:
----------------------------------

             Summary: Umbrella of enhancements to support different failure and 
locality topologies
                 Key: HADOOP-8468
                 URL: https://issues.apache.org/jira/browse/HADOOP-8468
             Project: Hadoop Common
          Issue Type: Bug
          Components: ha, io
    Affects Versions: 2.0.0-alpha, 1.0.0
            Reporter: Junping Du
            Assignee: Junping Du
            Priority: Critical

The current hadoop network topology (described in some previous issues like: 
Hadoop-692) works well in classic three-tiers network when it comes out. 
However, it does not take into account other failure models or changes in the 
infrastructure that can affect network bandwidth efficiency like: 
virtualization. 
Virtualized platform has following genes that shouldn't been ignored by hadoop 
topology in scheduling tasks, placing replica, do balancing or fetching block 
for reading: 
1. VMs on the same physical host are affected by the same hardware failure. In 
order to match the reliability of a physical deployment, replication of data 
across two virtual machines on the same host should be avoided.
2. The network between VMs on the same physical host has higher throughput and 
lower latency and does not consume any physical switch bandwidth.
Thus, we propose to make hadoop network topology extend-able and introduce a 
new level in the hierarchical topology, a node group level, which maps well 
onto an infrastructure that is based on a virtualized environment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Make Hadoop NetworkTopology and data locality more pluggable for other deploying topology like: virtualization.

Reply via email to