I dispatched each unit test individually to 20 EC2 c1.mediums (64 bit system, 2 VCPUs, kind of slow on purpose but still allowing some thread concurrency). On the instance each test was run 100 times or until failure. For each iteration after Maven exited the process table was checked to see if any surefire processes lingered, and if so the test would also be reported failed.
OS: Amazon Linux AMI release 2012.03 uname: Linux 3.2.21-1.32.6.amzn1.x86_64 #1 SMP Sat Jun 23 02:32:15 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux JVM: java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.3) (amazon-52.1.11.3.45.amzn1-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) Here are the tests that failed to complete successful runs in the above: TestCatalogTracker hangs on a join in testServerNotRunningIOException waiting on a CT that is stuck on CatalogTracker.waitForMeta and will linger in the background TestColumnSeeking testDuplicateVersions(org.apache.hadoop.hbase.regionserver.TestColumnSeeking): expected:<0> but was:<200> TestAtomicOperation testMultiRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation): expected:<0> but was:<1> TestSplitLogManager testOrphanTaskAcquisition(org.apache.hadoop.hbase.master.TestSplitLogManager): java.lang.AssertionError TestRegionRebalancing testRebalanceOnRegionServerNumberChange(org.apache.hadoop.hbase.TestRegionRebalancing): After 5 attempts, region assignments were not balanced. TestDrainingServer junit.framework.AssertionFailedError from org.apache.hadoop.hbase.TestDrainingServer.setUpBeforeClass TestMasterObserver testTableOperations(org.apache.hadoop.hbase.coprocessor.TestMasterObserver): org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' does not exist testRegionTransitionOperations(org.apache.hadoop.hbase.coprocessor.TestMasterObserver): org.apache.hadoop.hbase.TableExistsException: observed_table TestServerCustomProtocol testSingleMethod(org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol): Results should contain region test,bbb,1342509423473.2c0326188f899f3e91ec5eb623959c13. for row 'bbb' TestFromClientSide testPoolBehavior(org.apache.hadoop.hbase.client.TestFromClientSide): expected:<3> but was:<4> TestZooKeeper testClientSessionExpired(org.apache.hadoop.hbase.TestZooKeeper) TestReplication testDisableInactivePeer(org.apache.hadoop.hbase.replication.TestReplication): Shutting down TestMasterReplication testSimplePutDelete(org.apache.hadoop.hbase.replication.TestMasterReplication): Waited too much time for put replication TestMultiSlaveReplication testMultiSlaveReplication(org.apache.hadoop.hbase.replication.TestMultiSlaveReplication): Unable to add peer TestReplicationPeer testResetZooKeeperSession(org.apache.hadoop.hbase.replication.TestReplicationPeer): ReplicationPeer ZooKeeper session was not properly expired. I didn't get to them all before AWS yanked back my spot instances but I ordered the list from most likely to least, the remaining tests were in io.hfile.*, thrift.*, and util.* I'll circle back, confirm each individually, and open JIRAs with more detail. The cluster of replication test failures are a concern, but I've seen in other environments such as this one that the tests are timing dependent. On a slow or busy test system they can fail with "waited too much time ...". So a solution for this is to not use the system clock but instead EnvironmentEdge or whatever incremented only when the test process has CPU time. I haven't looked into this in detail yet. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)