[Hadoop Wiki] Update of "HadoopIsNot" by SteveLoughran

Apache Wiki Wed, 22 Jul 2009 03:18:36 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/HadoopIsNot

The comment on the change is:
formatting, wikilinks

------------------------------------------------------------------------------
  
  MapReduce is a profound idea: taking a simple functional programming 
operation and applying it, in parallel, to gigabytes or terabytes of data. But 
there is a price. For that parallelism, you need to have each MR operation 
independent from all the others. If you need to know everything that has gone 
before, you have a problem. Such problems can be aided by
  
- * Iteration: run multiple MR jobs, with the output of one being the input to 
the next.
+  * Iteration: run multiple MR jobs, with the output of one being the input to 
the next.
- * Shared state information. ["HBase"] is an option to consider here, 
otherwise something like memcache is an option.
+  * Shared state information. ["HBase"] is an option to consider here, 
otherwise something like memcache is an option.
  
  Do not try to remember things in shared variables, as they are only 
remembered in a single JVM, for the life of that JVM. That is the wrong way to 
work in a massively parallel environment.
  
@@ -28, +28 @@

  
  == Hadoop is not an ideal place to learn networking error messages ==
  
- You will find things work a lot easier if you are already familiar with 
networking and the common error messages -for example, what "Connection 
Refused" means, and how is different from "No Route to Host".
+ You will find things work a lot easier if you are already familiar with 
networking and the common error messages -for example, what 
[:ConnectionRefused: "Connection Refused"] means, and how is different from "No 
Route to Host".
  
- A lot of people post onto the user list with problems related to "connection 
refused", "No Route to Host" and other common TCP-IP level errors. These are 
usually signs of an invalid cluster configuration, some parts of the cluster 
not running, or machines not being able to talk to each other on the LAN. 
People on the mailing list cannot debug your network configuration for you, as 
it is your network, not theirs. They can point you at some of the tools and 
tests to try, but since it will take a day for every email round trip, you 
won't find this a very fast way to get help.
+ A lot of people post onto the user list with problems related to 
[:ConnectionRefused: "Connection Refused"], [:NoRouteNoHost: "No Route to 
Host"] and other common TCP-IP level errors. These are usually signs of an 
invalid cluster configuration, some parts of the cluster not running, or 
machines not being able to talk to each other on the LAN. People on the mailing 
list cannot debug your network configuration for you, as it is your network, 
not theirs. They can point you at some of the tools and tests to try, but since 
it will take a day for every email round trip, you won't find this a very fast 
way to get help.
  
  Nobody in the Hadoop team are deliberately trying to make things hard, its 
just that when things do not work in a large distributed system, you get some 
interesting error messages. If you can help improve those network messages or 
diagnostics, we would love to have that code.
  
  
  == Hadoop clusters are not a place to learn Unix/Linux system administration 
==
  
- You need to know your way round a Unix/Linux system. How to install it, what 
the various files in /etc/ are for, how to set up networking, a good hosts 
table, debug DNS problems, why to keep logs on a separate disk from the root 
disk, etc. If you cannot look after a single machine, you aren't going to be 
able to handle a cluster of 80 of them. That said, don't try maintaining those 
80+ boxes using the same technique of hand-editing files lile ["/etc/hosts"], 
because it doesn't scale.
+ You need to know your way round a Unix/Linux system. How to install it, what 
the various files in /etc/ are for, how to set up networking, what is a good 
hosts table, debug DNS problems, why to keep logs on a separate disk from the 
root disk, etc. If you cannot look after a single machine, you aren't going to 
be able to handle a cluster of 80 of them. That said, don't try maintaining 
those 80+ boxes using the same technique of hand-editing files lile 
["/etc/hosts"], because it doesn't scale.
  
  == Hadoop Filesystem is not a substitute for a High Availability SAN-hosted 
FS ==
  
@@ -48, +48 @@

  
  Because of these limitations, if you want a secure filesystem that is always 
available, HDFS is not yet there. You can run Hadoop MapReduce over other 
filesystems, however.
  
- == HDFS is not a Posix filesystem ==
+ == [HDFS] is not a Posix filesystem ==
  
  The Posix filesystem model has files that can appended too, seek() calls 
made, files locked. Hadoop is only just adding (in July 2009) append() 
operations, and seek() operations throw away a lot of performance. You cannot 
seamlessly map code that assumes that all filesystems are Posix-compatible to 
HDFS.

[Hadoop Wiki] Update of "HadoopIsNot" by SteveLoughran

Reply via email to