Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "TestingNov2009" page has been changed by SteveLoughran.
The comment on this change is: more on testing.
http://wiki.apache.org/hadoop/TestingNov2009?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  == Benchmarking ==
  
- One use case that comes up is stress testing clusters; to see the cluster 
supports Hadoop "as well as it should", and trying to find out why it doesn't, 
if it is not adequate. What we have today is [[Terasort]], where you have to 
guess the approximate numbers then run the job. Terasort creates its own test 
data, which is good, but it doesn't stress the CPUs as realistically as many 
workloads, and it generates lots of intermediate and final data; there is no 
reduction.
+ One use case that comes up is stress testing clusters; to see the cluster 
supports Hadoop "as well as it should", and trying to find out why it doesn't, 
if it is not adequate. What we have today is TeraSort, where you have to guess 
the approximate numbers then run the job. TeraSort creates its own test data, 
which is good, but it doesn't stress the CPUs as realistically as many 
workloads, and it generates lots of intermediate and final data; there is no 
reduction.
  
   * [[http://www.slideshare.net/steve_l/benchmarking-1840029 | Benchmarking 
slides]]
  
  == Basic Cluster Health Tests ==
  
- There are currently no tests that work with Hadoop via the web pages, no job 
submission and monitoring. It is in fact possible to bring up a Hadoop cluster 
in which JSP doesn't work, but the basic tests all appear well -even including 
TeraSort, provided you use the low-level APIs
+ There are currently no tests that work with Hadoop via the web pages, no job 
submission and monitoring. It is in fact possible to bring up a Hadoop cluster 
in which JSP doesn't work, but the basic tests all appear well -even including 
TeraSort, provided you use the low-level APIs.
+ 
+ Options
+  * Create a set of JUnit/HtmlUnit tests that test the GUI; design these to 
run against any host. Either check out the source tree and run the against a 
remote cluster, or package the tests in a JAR and make this a project 
distributable. 
+ * We may need separate test JARs for HDFS and mapreduce.
+ 
  
  == Testing underlying platforms ==
  
  We need to test the underlying platforms, from the JVM and Linux 
distributions to any Infrastructure-on-Demand APIs that provide VMs on demand, 
machines which can run Hadoop.
  
+ === JVM Testing ===
+ 
+ An IBM need; can also be used to qualify new Sun releases. Any JVM defect 
which stops Hadoop running at scale should be viewed as a blocking issue by all 
JVM suppliers.
+ 
+  * Need to be able to install latest JVM build, run the stress tests. 
+ 
+ === OS Testing ===
+ 
+ Test Hadoop working on the target OS. If Hadoop is packaged in an OS specific 
format (e.g. RPM), those installations need to be tested. 
+ 
+  * Need to be able to create new machine images (PXE, kickstart, etc.), then 
push out Hadoop to the nodes and test the cluster.
+ 
+ === IaaS Testing ===
+ 
+ Hadoop can be used to stress test Infrastructure as a Service platforms, and 
is offered as a service by some companies (Cloudera, EC2).
+ 
+ Hadoop can be used on Eucalyptus installations using EC2 client libraries. 
This can show up problems with Eucalyptus (different fault messages compared 
EC2, time zone/clock differences. 
+ 
+ Other infrastructures will have different APIs, with different features 
(private subnets, machine restart and persistence)
+ 
+  * Need to be able to work with different infrastructures and unstable APIs. 
+  * Testing on EC2 runs up rapid bills if you create/destroy machines every 
junit test method, or even every test run. Best to create a small pool of 
machines at the start of the working day, release them in the evening. And to 
have build file targets to destroy all of a developer's machines -and to run it 
at night as part of the CI build.
+  * Troubleshooting on IaaS platforms can be interesting as the VMs get 
destroyed -the test runner needs to capture (relevant) local log data.
+  * SSH is the primary way to communicate with the (long-haul) cluster, even 
from a developer's local machine.
+  * Important not to embed private data -keys, logins, in build files, test 
reports-
+  * For testing local Hadoop builds on IaaS platforms, the build process needs 
to scp over and install the Hadoop binaries and the configuration files. This 
can be done by creating a new disk image that is then used to bootstrap every 
node, or you start with a base clean image and copy in Hadoop on demand. The 
latter is much more agile and cost effective during iterative development, but 
doesn't scale to very-large clusters (1000s of machines), unless you delegate 
the task of copy/install to the first few tens of allocated machines. For EC2, 
one tactic is to upload the binaries to S3, and have scripts on the nodes to 
copy down and install the files.
+ 
+ 
  == Exploring the Hadoop Configuration Space ==
  
- There are a lot of Hadoop configuration options, even ignoring those of the 
underlying machines and network.
+ There are a lot of Hadoop configuration options, even ignoring those of the 
underlying machines and network. For example, what impact does blocksize and 
replication factor have on your workload.
+ 
  
  == Testing applications that run on Hadoop ==
  
@@ -43, +77 @@

  
  This is a problem which Cloudera and others who distribute/internally package 
and deploy Hadoop have: you need to know that your RPMs or other 
redistributables work.
  
- It's similar to the cluster acceptance test problem, except that you need to 
create the distribution packages and install them on the remote machines, then 
run the tests.
+ It's similar to the cluster acceptance test problem, except that you need to 
create the distribution packages and install them on the remote machines, then 
run the tests. The testing-over-IaaS platforms use cases are closer. 
  
+  * Testing RPM upgrades from many past versions is tricky.
+ 
+ == Simulating Cluster Failures ==
+ 
+ Cluster failure handling -especially the loss of large portions of a large 
datacenter, is something that is not currently formally tested. 
+ 
+  * Network failures can be simulated on some IaaS platforms
+  * Forcibly killing processes is a more realistic approach which works on 
most platforms, though it is hard to choreograph
+ 

Reply via email to