Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "TestingNov2009" page has been changed by SteveLoughran.
The comment on this change is: Extra issues and past work.
http://wiki.apache.org/hadoop/TestingNov2009?action=diff&rev1=3&rev2=4

--------------------------------------------------

  Test Hadoop working on the target OS. If Hadoop is packaged in an OS specific 
format (e.g. RPM), those installations need to be tested. 
  
   * Need to be able to create new machine images (PXE, kickstart, etc.), then 
push out Hadoop to the nodes and test the cluster.
+  * Cluster setup times can be significant if you have to reboot and re-image 
physical machines.
  
  === IaaS Testing ===
  
@@ -53, +54 @@

  Other infrastructures will have different APIs, with different features 
(private subnets, machine restart and persistence)
  
   * Need to be able to work with different infrastructures and unstable APIs. 
+  * Machine Allocation/release becomes a big delay on every test case that 
creates new machines
   * Testing on EC2 runs up rapid bills if you create/destroy machines every 
junit test method, or even every test run. Best to create a small pool of 
machines at the start of the working day, release them in the evening. And to 
have build file targets to destroy all of a developer's machines -and to run it 
at night as part of the CI build.
   * Troubleshooting on IaaS platforms can be interesting as the VMs get 
destroyed -the test runner needs to capture (relevant) local log data.
   * SSH is the primary way to communicate with the (long-haul) cluster, even 
from a developer's local machine.
@@ -63, +65 @@

  
  == Exploring the Hadoop Configuration Space ==
  
- There are a lot of Hadoop configuration options, even ignoring those of the 
underlying machines and network. For example, what impact does blocksize and 
replication factor have on your workload.
+ There are a lot of Hadoop configuration options, even ignoring those of the 
underlying machines and network. For example, what impact does blocksize and 
replication factor have on your workload? What different network card 
configuration parameters give the best performance? Which combinations of 
options break things?
  
+ When combined with IaaS platforms, the configuration space gets even larger.
+ 
+ Manually exploring the configuration space takes too long; currently everyone 
tries to stick closed to the Yahoo! configurations which are believed to work 
-whenever someone strays off it, interesting things happen. For example, 
setting a replication factor of only 2 found a duplication bug; running Hadoop 
on a machine that isn't quite sure of its hostname shows up other assumptions 
as things you can not rely on. 
+ 
+  * There is existing work on automated configuration testing, notably the 
work done by Adam Porter and colleagues on 
[[http://www.youtube.com/watch?v=r0nn40O3mCY | Distributed Continuous Quality 
Assurance]]
+  * (Steve says) in HP we've used a Pseudo-RNG to drive transforms to the 
infrastructure and deployed applications, this explores some of the space and 
is somewhat replicable.
  
  == Testing applications that run on Hadoop ==
  
@@ -84, +92 @@

  
  == Simulating Cluster Failures ==
  
- Cluster failure handling -especially the loss of large portions of a large 
datacenter, is something that is not currently formally tested. 
+ Cluster failure handling -especially the loss of large portions of a large 
datacenter, is something that is not currently formally tested. There are big 
fixes that go into Hadoop to test some of this, but loss of a quarter of the 
datanodes is a disaster that doesn't get tested at scale before a release is 
made.
  
-  * Network failures can be simulated on some IaaS platforms
+  * Network failures can be simulated on some IaaS platforms just by breaking 
a virtual link
   * Forcibly killing processes is a more realistic approach which works on 
most platforms, though it is hard to choreograph
  

Reply via email to