Hi, Pablo! On May 21, Pablo Estrada wrote: > Hello Sergei and all, > First of all, I'll explain quickly the terms that I was using: > > - *test_suite, test suite, test case* - When I say test suite or test > case, I am referring to a single test file. For instance ' > *pbxt.group_min_max*'. They are the ones that fail, and whose failures > we want to attempt to predict.
may I suggest to distinguish between a test *suite* and a test *case*? the latter is usually a one test file, but a suite (for mtr) is a directory with many test files. Like, "main", "pbxt", etc. > - *test_run, test run* - When I use this term, I refer to an entry in > the *test_run* table of the database. A test run is a set of > *test_suites* that run together at a certain time. > > I have in place now a basic script to do the simulations. I have tried to > keep the code clear, and I will upload a repository to github soon. > I have already run simulations on the data. The simulations used 2000 > test_runs as training data, and then attempted to predict behavior on the > following 3000 test_runs. Of course, maybe a wider spectrum of data might > be needed to truly asses the algorithm. > > I used four different ways to calculate a 'relevancy index' for a test: > > 1. Keep a relevancy index by test case > 2. Keep a relevancy index by test case by platform > 3. Keep a relevancy index by test case by branch > 4. Keep a relevancy index by test case by branch by platform (mixed) > > I graphed the results. The graph is attached. As can be seen from the > graph, the platform and the mixed model proved to be the best for recall. > I feel the results were quite similar to what Sergei encountered. Right. > I have not run the tests on a larger set of data (the data dump that I have > available contains 200,000 test_runs, so in theory I could test the > algorithm with all this data)... I feel that I want to consider a couple > things before going on to big testing: > > I feel that there is a bit of a potential fallacy in the model that I'm > following. Here's why: > The problem that I find in the model is that we don't know a-priori when a > test will fail for the first time. Strictly speaking, in the model, if a > test doesn't fail for the first time, it never starts running at all. In > the implementation that I made, I am using the first failure of each test > to start giving it a relevancy test (so the test would have to fail before > it even qualifies to run). > This results in a really high recall rate because it is natural that if a > test fails once, it might fail pretty soon after, so although we might have > missed the first failure, we still consider that we didn't miss it, and > based on it we will catch the two or three failures that come right after. > This inflates the recall rate of 'subsequent' failures, but it is not very > helpful when trying to catch failures that are not part of a trend... I > feel this is not realistic. > > Here are changes that I'd like to incorporate to the model: > > 1. The failure rate should stay, and should still be measured with > exponential decay or weighted average > 2. Include a new measure that increases relevancy: Time since last run. > The relevancy index should have a component that makes the test more > relevant the longer it spends not running > 1. A problem with this is that *test suites* that might have stopped > being used will stay and compete for resources, although in reality they > would not be relevant anymore > 3. Include also correlation. I still don't have a great idea of how > correlation will be considered, but it's something like this: > 1. The data contains the list of test_runs where each test_suite has > failed. If two test suites have failed together a certain percentage of > times (>30%?), then when test A fails, the relevancy test of test B also > goes up... and when test A runs without failing, the relevancy > test of test > B goes down too. > 2. Using only the times that tests fail together seems like a good > heuristic, without having to calculate the total correlation of all the > history of all the combinations of tests. > > If these measures were to be incorporated, a couple of changes would also > have to be considered: > > 1. Failures that are* not spotted* *on a test_run* might be *able to be > spotted *on the *next* two or three or *N test_runs*? What do you think? > 2. Considering these measures, probably *recall* will be *negatively > affected*, but I feel that the model would be *more realistic*. I don't think you should introduce artificial limitations that make the recall worse, because they "look realistic". You can do it realistic instead, not look realistic - simply pretend that your code is already running on buildbot and limits the number of tests to run. So, if the test didn't run - you don't have any failure information about it. And then you only need to do what improves recall, nothing else :) (of course, to calculate the recall you need to use all failures, even for tests that you didn't run) > Any input on my new suggestions? If all seems okay, I will proceed on to > try to implement these. > Also, I will soon upload the information so far to github. Can I also > upload queries made to the database? Or are these private? You mean the data tables? I think they're all public, they don't have anything one couldn't get from http://buildbot.askmonty.org/ Regards, Sergei _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : [email protected] Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp

