Here is another from tail of https://issues.apache.org/jira/browse/HBASE-5995
2013-07-23 01:23:29,574 INFO [pool-1-thread-1] hbase.ResourceChecker(171): after: regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart Thread=39 (was 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 368), ProcessCount=144 (was 142) - ProcessCount LEAK? -, AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0) This one showed up as a zombie too; stuck. Or here, https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/, where we'd had a nice run of passing tests, of a sudden a test that I've not seen fail before, fails: https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/4282/ org.apache.hadoop.hbase.master.TestActiveMasterManager.testActiveMasterManagerFromZK Near the end of the test, the resource checker reports: * * - Thread LEAK? -, OpenFileDescriptor=100 (was 92) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=328 (was 331), ProcessCount=138 (was 138), AvailableMemoryMB=1223 (was 1246), ConnectionCount=0 (was 0) Getting tests to pass on these build boxes (other than hadoopqa which is a different set of machines) seems unattainable. I will write infra about the 40k to see if they can do something about that. St.Ack On Mon, Jul 22, 2013 at 9:13 PM, Stack <[email protected]> wrote: > By way of illustration of how loaded Apache build boxes can be: > > Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? > -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), > ProcessCount=142 (was 144), AvailableMemoryMB=819 (was 892), > ConnectionCount=0 (was 0) > > This seems to have caused a test that usually passes to fail: > https://issues.apache.org/jira/browse/HBASE-9023 > > St.Ack > > > On Mon, Jul 22, 2013 at 11:49 AM, Stack <[email protected]> wrote: > >> Below is a state of hbase 0.95/trunk unit tests (Includes a little >> taxonomy of test failure type definitions). >> >> On Andrew's ec2 build box, 0.95 is passing most of the time: >> >> http://54.241.6.143/job/HBase-0.95/ >> http://54.241.6.143/job/HBase-0.95-Hadoop-2/ >> >> It is not as good on Apache build box but it is getting better: >> >> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/ >> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/ >> >> On Apache, I have seen loads up in the 500s and all file descriptors used >> according to the little resources report printed at the end of each test. >> If these numbers are to be believed (TBD), we may never achieve 100% pass >> rate on Apache builds. >> >> Andrew's ec2 builds run the integration tests too where the apache builds >> do not -- sometimes we'll fail an integration test run which makes the >> Andrew ec2 red/green ratio look worse that it actually is. >> >> Trunk builds lag. They are being worked on. >> >> We seem to be over the worst of the flakey unit tests. We have a few >> stragglers still but they are being hunted down by the likes of the >> merciless Jimmy Xiang and Jeffrey Zhong. >> >> The "zombies" have been mostly nailed too (where "zombies" are tests that >> refuse to die continuing after the suite has completed causing the build to >> fail). The zombie trap from test-patch.sh was ported over to apache and >> ec2 build and it caught the last of undying. >> >> We are now into a new phase where "all" tests pass but the build still >> fails. Here is an example: >> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The >> only clue I have to go on is the fact that when we fail, the number of >> tests run is less than the total that shows for a successful run. >> >> Unless anyone has a better idea, to figure why the hang, I compare the >> list of tests that show in a good run vs. those of a bad run. Tests that >> are in the good run but missing from the bad run are deemed suspect. In >> the absence of other evidence or other ideas, I am blaming these >> "invisibles" for the build fail. >> >> Here is an example: >> >> This is a good 0.95 hadoop2 run (notice how we are running integration >> tests tooooo and they succeed!! On hadoop2!!!!): >> >> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/ >> >> In hbase-server module: >> >> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19 >> >> >> This is a bad run: >> >> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/ >> >> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18 >> >> >> If I compare tests, the successful run has: >> >> > Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed >> >> >> ... where the bad run does not show the above test. >> TestHLogSplitCompressed has 34 tests one of which is disabled so that >> would seem to account for the discrepancy. >> >> I've started to disable tests that fail likes this putting them aside for >> original authors or the interested to take a look to see why they fail >> occasionally. I put them aside so we can enjoy passing builds in the >> meantime. I've already moved aside or disabled a few tests and test >> classes: >> >> TestMultiTableInputFormat >> TestReplicationKillSlaveRS >> TestHCM.testDeleteForZKConnLeak was disabled >> >> ... and a few others. >> >> Finally (if you are still reading), I would suggest that test failures in >> hadoopqa are now more worthy of investigation. Illustrative is what >> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections" >> where the patch had +1s and on its first run, a unit test failed (though it >> passed locally). The second run obscured the first run's failure. After >> digging by another, the patch had actually broken the first test (though it >> looked unrelated). I would suggest that now tests are healthier, test >> failures are worth paying more attention too. >> >> Yours, >> St.Ack >> >> >> >> >
