Re: Getting unit tests to pass

Stack Mon, 22 Jul 2013 21:51:20 -0700

Here is another from tail of
https://issues.apache.org/jira/browse/HBASE-5995


2013-07-23 01:23:29,574 INFO  [pool-1-thread-1] hbase.ResourceChecker(171):
after: regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart
Thread=39 (was 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) -
OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000),
SystemLoadAverage=351 (was 368), ProcessCount=144 (was 142) - ProcessCount
LEAK? -, AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0)

This one showed up as a zombie too; stuck.

Or here, https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/,
where we'd had a nice run of passing tests, of a sudden a test that I've
not seen fail before, fails:

https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/4282/

org.apache.hadoop.hbase.master.TestActiveMasterManager.testActiveMasterManagerFromZK

Near the end of the test, the resource checker reports:
*
*

 - Thread LEAK? -, OpenFileDescriptor=100 (was 92) -
OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000),
SystemLoadAverage=328 (was 331), ProcessCount=138 (was 138),
AvailableMemoryMB=1223 (was 1246), ConnectionCount=0 (was 0)



Getting tests to pass on these build boxes (other than hadoopqa which is a
different set of machines) seems unattainable.

I will write infra about the 40k to see if they can do something about that.

St.Ack




On Mon, Jul 22, 2013 at 9:13 PM, Stack <[email protected]> wrote:

> By way of illustration of how loaded Apache build boxes can be:
>
> Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? 
> -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), 
> ProcessCount=142 (was 144), AvailableMemoryMB=819 (was 892), 
> ConnectionCount=0 (was 0)
>
> This seems to have caused a test that usually passes to fail:
> https://issues.apache.org/jira/browse/HBASE-9023
>
> St.Ack
>
>
> On Mon, Jul 22, 2013 at 11:49 AM, Stack <[email protected]> wrote:
>
>> Below is a state of hbase 0.95/trunk unit tests (Includes a little
>> taxonomy of test failure type definitions).
>>
>> On Andrew's ec2 build box, 0.95 is passing most of the time:
>>
>> http://54.241.6.143/job/HBase-0.95/
>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/
>>
>> It is not as good on Apache build box but it is getting better:
>>
>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/
>>
>> On Apache, I have seen loads up in the 500s and all file descriptors used
>> according to the little resources report printed at the end of each test.
>>  If these numbers are to be believed (TBD), we may never achieve 100% pass
>> rate on Apache builds.
>>
>> Andrew's ec2 builds run the integration tests too where the apache builds
>> do not -- sometimes we'll fail an integration test run which makes the
>> Andrew ec2 red/green ratio look worse that it actually is.
>>
>> Trunk builds lag.  They are being worked on.
>>
>> We seem to be over the worst of the flakey unit tests.  We have a few
>> stragglers still but they are being hunted down by the likes of the
>> merciless Jimmy Xiang and Jeffrey Zhong.
>>
>> The "zombies" have been mostly nailed too (where "zombies" are tests that
>> refuse to die continuing after the suite has completed causing the build to
>> fail).  The zombie trap from test-patch.sh was ported over to apache and
>> ec2 build and it caught the last of undying.
>>
>> We are now into a new phase where "all" tests pass but the build still
>> fails.  Here is an example:
>> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The 
>> only clue I have to go on is the fact that when we fail, the number of
>> tests run is less than the total that shows for a successful run.
>>
>> Unless anyone has a better idea, to figure why the hang, I compare the
>> list of tests that show in a good run vs. those of a bad run.  Tests that
>> are in the good run but missing from the bad run are deemed suspect.  In
>> the absence of  other evidence or other ideas, I am blaming these
>> "invisibles" for the build fail.
>>
>> Here is an example:
>>
>> This is a good 0.95 hadoop2 run (notice how we are running integration
>> tests tooooo and they succeed!!  On hadoop2!!!!):
>>
>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/
>>
>> In hbase-server module:
>>
>> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
>>
>>
>> This is a bad run:
>>
>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/
>>
>> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
>>
>>
>> If I compare tests, the successful run has:
>>
>> > Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
>>
>>
>> ... where the bad run does not show the above test.
>>  TestHLogSplitCompressed has 34 tests one of which is disabled so that
>> would seem to account for the discrepancy.
>>
>> I've started to disable tests that fail likes this putting them aside for
>> original authors or the interested to take a look to see why they fail
>> occasionally.  I put them aside so we can enjoy passing builds in the
>> meantime.  I've already moved aside or disabled a few tests and test
>> classes:
>>
>> TestMultiTableInputFormat
>> TestReplicationKillSlaveRS
>> TestHCM.testDeleteForZKConnLeak was disabled
>>
>> ... and a few others.
>>
>> Finally (if you are still reading), I would suggest that test failures in
>> hadoopqa are now more worthy of investigation.   Illustrative is what
>> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
>> where the patch had +1s and on its first run, a unit test failed (though it
>> passed locally).  The second run obscured the first run's failure.  After
>> digging by another, the patch had actually broken the first test (though it
>> looked unrelated).  I would suggest that now tests are healthier, test
>> failures are worth paying more attention too.
>>
>> Yours,
>> St.Ack
>>
>>
>>
>>
>

Re: Getting unit tests to pass

Reply via email to