[Hadoop Wiki] Update of "FAQ" by ToddLipcon

Apache Wiki Wed, 27 May 2009 23:18:23 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The following page has been changed by ToddLipcon:
http://wiki.apache.org/hadoop/FAQ

------------------------------------------------------------------------------

You can subclass the
[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/OutputFormat.java?view=markup
OutputFormat.java] class and write your own. You can look at the code of
[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/TextOutputFormat.java?view=markup
TextOutputFormat]
[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/MultipleOutputFormat.java?view=markup
MultipleOutputFormat.java] etc. for reference. It might be the case that
you only need to do minor changes to any of the existing Output Format classes.
To do that you can just subclass that class and override the methods you need
to change.

+ [[BR]]
+ [[Anchor(28)]]
+ '''28. [#28 How much network bandwidth might I need between racks in a medium
size (40-80 node) Hadoop cluster?]'''
+
+ The true answer depends on the types of jobs you're running. As a back of the
envelope calculation one might figure something like this:
+
+ 60 nodes total on 2 racks = 30 nodes per rack
+ Each node might process about 100MB/sec of data
+ In the case of a sort job where the intermediate data is the same size as the
input data, that means each node needs to shuffle 100MB/sec of data
+ In aggregate, each rack is then producing about 3GB/sec of data
+ However, given even reducer spread across the racks, each rack will need to
send 1.5GB/sec to reducers running on the other rack.
+ Since the connection is full duplex, that means you need 1.5GB/sec of
bisection bandwidth for this theoretical job. So that's 12Gbps.
+
+ However, the above calculations are probably somewhat of an upper bound. A
large number of jobs have significant data reduction during the map phase,
either by some kind of filtering/selection going on in the Mapper itself, or by
good usage of Combiners. Additionally, intermediate data compression can cut
the intermediate data transfer by a significant factor. Lastly, although your
disks can probably provide 100MB sustained throughput, it's rare to see a MR
job which can sustain disk speed IO through the entire pipeline. So, I'd say my
estimate is at least a factor of 2 too high.
+
+ So, the simple answer is that 4-6Gbps is most likely just fine for most
practical jobs. If you want to be extra safe, many inexpensive switches can
operate in a "stacked" configuration where the bandwidth between them is
essentially backplane speed. That should scale you to 96 nodes with plenty of
headroom. Many inexpensive gigabit switches also have one or two 10GigE ports
which can be used effectively to connect to each other or to a 10GE core.
+

[Hadoop Wiki] Update of "FAQ" by ToddLipcon

Reply via email to