Here's the repository with the code I have in place so far. It's pretty simple stuff: https://github.com/pabloem/Kokiri
On Wed, May 21, 2014 at 1:06 PM, Pablo Estrada <[email protected]>wrote: > Hello Sergei and all, > First of all, I'll explain quickly the terms that I was using: > > - *test_suite, test suite, test case* - When I say test suite or test > case, I am referring to a single test file. For instance ' > *pbxt.group_min_max*'. They are the ones that fail, and whose failures > we want to attempt to predict. > - *test_run, test run* - When I use this term, I refer to an entry in > the *test_run* table of the database. A test run is a set of > *test_suites* that run together at a certain time. > > I have in place now a basic script to do the simulations. I have tried to > keep the code clear, and I will upload a repository to github soon. > I have already run simulations on the data. The simulations used 2000 > test_runs as training data, and then attempted to predict behavior on the > following 3000 test_runs. Of course, maybe a wider spectrum of data might > be needed to truly asses the algorithm. > > I used four different ways to calculate a 'relevancy index' for a test: > > 1. Keep a relevancy index by test case > 2. Keep a relevancy index by test case by platform > 3. Keep a relevancy index by test case by branch > 4. Keep a relevancy index by test case by branch by platform (mixed) > > I graphed the results. The graph is attached. As can be seen from the > graph, the platform and the mixed model proved to be the best for recall. > I feel the results were quite similar to what Sergei encountered. > > I have not run the tests on a larger set of data (the data dump that I > have available contains 200,000 test_runs, so in theory I could test the > algorithm with all this data)... I feel that I want to consider a couple > things before going on to big testing: > > I feel that there is a bit of a potential fallacy in the model that I'm > following. Here's why: > The problem that I find in the model is that we don't know a-priori when a > test will fail for the first time. Strictly speaking, in the model, if a > test doesn't fail for the first time, it never starts running at all. In > the implementation that I made, I am using the first failure of each test > to start giving it a relevancy test (so the test would have to fail before > it even qualifies to run). > This results in a really high recall rate because it is natural that if > a test fails once, it might fail pretty soon after, so although we might > have missed the first failure, we still consider that we didn't miss it, > and based on it we will catch the two or three failures that come right > after. > This inflates the recall rate of 'subsequent' failures, but it is not very > helpful when trying to catch failures that are not part of a trend... I > feel this is not realistic. > > Here are changes that I'd like to incorporate to the model: > > 1. The failure rate should stay, and should still be measured with > exponential decay or weighted average > 2. Include a new measure that increases relevancy: Time since last > run. The relevancy index should have a component that makes the test more > relevant the longer it spends not running > 1. A problem with this is that *test suites* that might have > stopped being used will stay and compete for resources, although in > reality > they would not be relevant anymore > 3. Include also correlation. I still don't have a great idea of how > correlation will be considered, but it's something like this: > 1. The data contains the list of test_runs where each test_suite > has failed. If two test suites have failed together a certain > percentage of > times (>30%?), then when test A fails, the relevancy test of test B also > goes up... and when test A runs without failing, the relevancy test of > test > B goes down too. > 2. Using only the times that tests fail together seems like a good > heuristic, without having to calculate the total correlation of all the > history of all the combinations of tests. > > If these measures were to be incorporated, a couple of changes would also > have to be considered: > > 1. Failures that are* not spotted* *on a test_run* might be *able to > be spotted *on the *next* two or three or *N test_runs*? What do you > think? > 2. Considering these measures, probably *recall* will be *negatively > affected*, but I feel that the model would be *more realistic*. > > Any input on my new suggestions? If all seems okay, I will proceed on to > try to implement these. > Also, I will soon upload the information so far to github. Can I also > upload queries made to the database? Or are these private? > > Regards > Pablo > > > On Wed, May 7, 2014 at 7:41 PM, Sergei Golubchik <[email protected]> wrote: > >> Hi, Pablo! >> >> On May 07, Pablo Estrada wrote: >> > >> > So here's what I'll do for the simulations: >> > >> > *1. Calculating the: "Relevancy index"* for a test, I have considered >> two >> > simple options so far: >> > >> > - *Exponential decay*: The relevancy index of a test is the *sum over >> > each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It >> > decreases exponentially as time passes, and increases if the test >> fails. >> > - DecayRate is >> > - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will >> > be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)). >> > - The unit to measure time is just seconds in UNIX_TIMESTAMP >> > - *Weighted moving average*: The relevancy index of a test is: >> *R[now] = >> > R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed >> in >> > this run, and 0 if it did not fail. The value is between 1 and 0. It >> > decreases slowly if a test runs without failing, and it increases >> slowly if >> > the test fails. >> > - 0 < alpha < 1 (Initially set at 0.95 for testing). >> > - i.e. If TestB failed for the first time in the last run, and >> again >> > in this run: R[t] = 1*0.95 + 1*0.5 = 1 >> > - If test B ran once more and did not fail, then: R[t+1] = 1*0.95 >> + >> > 0*0.5 = 0.95 >> > - The *advantage of this method* is that it doesn't have to look >> at >> > the whole history every time it's calculated (unlike the >> exponential decay >> > method) >> >> you don't need to look at the whole history for the exponential decay. >> Because it is >> >> exp((FailureTime - CurrentTime)/DecayRate) >> >> You simply have >> >> R[t] = exp(FailureTime/DecayRate) / exp(t/DecayRate) >> R[t+1] = R[t] / exp(1/DecayRate) (if there was no failure) >> >> > - Much like TCP protocol ( >> http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html) >> > >> > Regarding the *Relevancy Index*, it can be calculated grouping test >> results >> > in many ways: *Roughly* using test_name+variation, or *more granularly* >> by >> > *including* *branch* and *platform*. I'll add some thoughts regarding >> these >> > options at the bottom of the email. >> >> I've tested these options earlier, you may want to try them all too and >> see which one delivers better results. >> >> > *2. *To* run the simulation*, I'll gather data from the first few >> thousands >> > of test_run entries, and then start simulating results. Here's what >> I'll do: >> > >> > 1. *Gather data *first few thousands of test_run entries (i.e. 4 >> > thousand) >> > 2. After N thousand test_runs, I'll go through the test_run entries >> *one >> > by one*, and using the data gathered to that point, I will select >> '*running >> > sets*' of *100* *test suites* to run on each test_run entry. (The >> number >> > can be adjusted) >> >> Absolutely, it should be. This is the main parameter we can tune, after >> all. The larger your running set is, the better will be the recall. >> >> May be not now but later, but it would be very useful to see these >> graphs, recall as a function of the running set size. It's important to >> know whether by increasing the running set by 10% we get 1% recall >> increase of 70% recall increase (as you've seen, there's a region when >> recall increases very fast as the running set grows). >> >> > 3. If in this *test_run* entry, the list of *failed tests* contains >> > tests that are *NOT part* of the *running set*, the failure will be >> > ignored, and so the information of this failure will be lost (not >> used as >> > part of the relevancy index). *(See Comment 2)* >> > 4. If the set of *failed tests *in the *test_run* entry intersect >> with >> > the *running_set*, this is better *recall*. This information will be >> > used to continue calculating the *relevancy index*. >> >> Could you explain the terminology you're using? >> What is a "test suite" and what is a "test run"? >> How will you calculate the "recall"? >> >> > According to the results obtained from the simulations, we can adjust >> the >> > algorithm (i.e. to consider *relevancy index by* *platform* and >> *branch*, >> > etc.) >> > >> > Comments about the *relevancy index:* >> > >> > - The methods to calculate the relevancy index are very simple. There >> > are some other useful metrics that could be incorporated >> > - *Time since last run. *With the current methods, if a* >> > test*completely *stops >> > running*, it only* becomes less relevant with time*, and so even >> if >> > it could expose defects, it doesn't get to run because its >> > relevancy index >> > is just going down. Incorporating a function that* increases the >> > relevancy index* as the *time since the last run* *increases* can >> > help solve this issue. I believe this measure will be useful. >> >> Right. I will not comment on this measure now. >> But I absolutely agree that this is an issue that must be solved. >> >> > - *Correlation between test failures*. If two tests tend to fail >> > together, is it better to just run one of them? Incorporating >> > this measure >> > seems difficult, but it is on the table, in case we should >> consider it. >> >> Agree. Taking correlations into account looks very promising, but it >> does seem to be difficult :) >> >> > - As you might have seen, I decided to not consider any data >> concerned >> > with *code changes*. I'll work like this and see if the results are >> > satisfactory. >> >> Right! >> >> > Comments regarding *buildbot infrasturcture:* >> > These comments are out of the scope of this project, but it would be >> very >> > desirable features for the buildbot infrastructure. >> > >> > - Unfortunately, given the data available in the database, it is NOT >> > possible to know *which tests ran* on each *test_run*. This >> information >> > would be very useful, as it would help estimate the *exact failure >> > rate*of a test. I didn't look into the code, but it seems that *class >> > MtrLogObserver*( >> http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py_source.html >> ) >> > contains >> > most of the infrastructure necessary to just add one or two more >> tables ( >> > *test_suite*, and *test_suite_test_run*), some code, and start >> keeping >> > track of this information. >> > - Another problem with the data available in the database is that it >> is >> > not possible to know *how many test suites exist*. It is only >> possible >> > to estimate *how many different test suites have failed*. This would >> > also be helpful information. >> > - Actually, this information would be useful not only for this >> project, >> > but in general for book-keeping of the development of MariaDB. >> >> Regards, >> Sergei >> > >
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : [email protected] Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp

