I dispatched each unit test individually to 20 EC2 c1.mediums (64 bit
system, 2 VCPUs, kind of slow on purpose but still allowing some
thread concurrency). On the instance each test was run 100 times or
until failure. For each iteration after Maven exited the process table
was checked to see if any surefire processes lingered, and if so the
test would also be reported failed.

OS: Amazon Linux AMI release 2012.03
uname: Linux 3.2.21-1.32.6.amzn1.x86_64 #1 SMP Sat Jun 23 02:32:15 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux
JVM: java version "1.6.0_24"
    OpenJDK Runtime Environment (IcedTea6 1.11.3)
(amazon-52.1.11.3.45.amzn1-x86_64)
    OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Here are the tests that failed to complete successful runs in the above:

TestCatalogTracker
    hangs on a join in testServerNotRunningIOException waiting on a CT
that is stuck on CatalogTracker.waitForMeta and will linger in the
background

TestColumnSeeking
    
testDuplicateVersions(org.apache.hadoop.hbase.regionserver.TestColumnSeeking):
expected:<0> but was:<200>

TestAtomicOperation
    
testMultiRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation):
expected:<0> but was:<1>

TestSplitLogManager
    
testOrphanTaskAcquisition(org.apache.hadoop.hbase.master.TestSplitLogManager):
java.lang.AssertionError

TestRegionRebalancing
    
testRebalanceOnRegionServerNumberChange(org.apache.hadoop.hbase.TestRegionRebalancing):
After 5 attempts, region assignments were not balanced.

TestDrainingServer
    junit.framework.AssertionFailedError from
org.apache.hadoop.hbase.TestDrainingServer.setUpBeforeClass

TestMasterObserver
    testTableOperations(org.apache.hadoop.hbase.coprocessor.TestMasterObserver):
org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family
'fam2' does not exist
    
testRegionTransitionOperations(org.apache.hadoop.hbase.coprocessor.TestMasterObserver):
org.apache.hadoop.hbase.TableExistsException: observed_table

TestServerCustomProtocol
    
testSingleMethod(org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol):
Results should contain region
test,bbb,1342509423473.2c0326188f899f3e91ec5eb623959c13. for row 'bbb'

TestFromClientSide
    testPoolBehavior(org.apache.hadoop.hbase.client.TestFromClientSide):
expected:<3> but was:<4>

TestZooKeeper
    testClientSessionExpired(org.apache.hadoop.hbase.TestZooKeeper)

TestReplication
    
testDisableInactivePeer(org.apache.hadoop.hbase.replication.TestReplication):
Shutting down

TestMasterReplication
    
testSimplePutDelete(org.apache.hadoop.hbase.replication.TestMasterReplication):
Waited too much time for put replication

TestMultiSlaveReplication
    
testMultiSlaveReplication(org.apache.hadoop.hbase.replication.TestMultiSlaveReplication):
Unable to add peer

TestReplicationPeer
    
testResetZooKeeperSession(org.apache.hadoop.hbase.replication.TestReplicationPeer):
ReplicationPeer ZooKeeper session was not properly expired.

I didn't get to them all before AWS yanked back my spot instances but
I ordered the list from most likely to least, the remaining tests were
in io.hfile.*, thrift.*, and util.*

I'll circle back, confirm each individually, and open JIRAs with more detail.

The cluster of replication test failures are a concern, but I've seen
in other environments such as this one that the tests are timing
dependent. On a slow or busy test system they can fail with "waited
too much time ...". So a solution for this is to not use the system
clock but instead EnvironmentEdge or whatever incremented only when
the test process has CPU time. I haven't looked into this in detail
yet.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Reply via email to