Elena, thank you very much : ). I just pushed the final commit. I had not realized it had failed last time.
Best. Pablo On Tue, Aug 19, 2014 at 4:26 PM, Elena Stepanova <ele...@montyprogram.com> wrote: > Hi Pablo, > > Thanks for the great work. > > Just one thing -- In RESULTS.md, paragraphs "The Fail Frequency algorithm" > and "The File-change correlation algorithm" are unfinished. It's not a big > deal, but I want to be sure there wasn't anything important in the lost > part. Could you please double-check? > > Regards, > Elena > > > > On 17.08.2014 16:32, Pablo Estrada wrote: > >> Hello Elena and all, >> I have submitted the concluding commit to the project with a very >> short 'RESULTS' file that explains briefly the project, the different >> strategies and the results. It includes a chart with updated results >> for both strategies and different modes. If you think I should add >> anything else, please let me know. >> Here it is: >> https://github.com/pabloem/Kokiri/blob/master/RESULTS.md >> >> Thank you very much. >> Regards >> >> Pablo >> >> On 8/13/14, Elena Stepanova <ele...@montyprogram.com> wrote: >> >>> Hi Pablo, >>> >>> On 10.08.2014 9:31, Pablo Estrada wrote: >>> >>>> Hello Elena, >>>> You raise good points. I have just rewritten the save_state and >>>> load_state >>>> functions. Now they work with a MySQL database and a table that looks >>>> like >>>> this: >>>> >>>> create table kokiri_data ( dict varchar(20), labels varchar(200), value >>>> varchar(100), primary key (dict,labels)); >>>> >>>> Since I wanted to store many dicts into the database, I decided to try >>>> this >>>> format. The 'dict' field includes the dictionary that the data belongs >>>> to >>>> ('upd_count','pred_count' or 'test_info'). The 'labels' field includes >>>> the >>>> space-separated list of labels in the dictionary (for a more detailed >>>> explanation, check the README and the code). The value contains the >>>> value >>>> of the datum (count of runs, relevance, etc.) >>>> >>>> Since the labels are space-separated, this assumes we are not using the >>>> mixed mode. If we use mixed mode, we may change the separator (, or & or >>>> % >>>> or $ are good alternatives). >>>> >>>> Let me know what you think about this strategy to store into the >>>> database. >>>> I felt it was the most simple one, while still allowing to do some >>>> querying >>>> on the database (like loading only one metric or one 'unit' >>>> (platform/branch/mix), etc). It may also allow to store many >>>> configurations >>>> if necessary. >>>> >>> >>> Okay, lets have it this way. We can change it later if we want to. >>> >>> In the remaining time, you can do the cleanup, check documentation, and >>> maybe run some last clean experiments with the existing data and >>> different parameters (modes, metrics etc.), to have the statistical >>> results with the latest code, which we'll use later to decide on the >>> final configuration. >>> >>> Regards, >>> Elena >>> >>> >>>> Regards >>>> Pablo >>>> >>>> >>>> On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova < >>>> ele...@montyprogram.com> >>>> wrote: >>>> >>>> Hi Pablo, >>>>> >>>>> Thanks for the update. Couple of comments inline. >>>>> >>>>> >>>>> On 08.08.2014 18:17, Pablo Estrada wrote: >>>>> >>>>> Hello Elena, >>>>>> I just pushed a transaction, with the following changes: >>>>>> >>>>>> 1. Added an internal counter to the kokiri class, and a function to >>>>>> expose >>>>>> it. This function can show how many update result runs and prediction >>>>>> runs >>>>>> have been run in total, or per unit (an unit being a platform, a >>>>>> branch >>>>>> or >>>>>> a mix of both). Using this counter, one can decide to add logic for >>>>>> extra >>>>>> learning rounds for new platforms (I added it to the wrapper class as >>>>>> an >>>>>> example). >>>>>> >>>>>> 2. Added functions to load and store status into temporary storage. >>>>>> They >>>>>> are very simple - they only serialize to a JSON file, but they can be >>>>>> easily modified to fit the requirements of the implementation. I can >>>>>> add >>>>>> this in the README. If you'd like for me to add the capacity to >>>>>> connect >>>>>> to >>>>>> a database and store the data in a table, I can do that too (I think >>>>>> it >>>>>> >>>>>> >>>>> Yes, I think we'll have to have it stored in the database. >>>>> Chances are, the scripts will run on buildbot slaves rather than on the >>>>> master, so storing data in a file just won't do any good. >>>>> >>>>> >>>>> would be easiest to store the dicts as json data in text fields). >>>>> Let >>>>> me >>>>> >>>>>> know if you'd prefer that. >>>>>> >>>>>> >>>>> I don't like the idea of storing the entire dicts as json. It doesn't >>>>> seem >>>>> to be justified by... well... anything, except for saving a tiny bit of >>>>> time on writing queries. But that's a one-time effort, while this way >>>>> we >>>>> won't be able to [easily] join the statistical data with, lets say, >>>>> existing buildbot tables; and it generally won't be efficient and easy >>>>> to >>>>> read. >>>>> >>>>> Besides, keep in mind that for real use, if, lets say, we are running >>>>> in >>>>> 'platform' mode, for each call we don't need the whole dict, we only >>>>> need >>>>> the part of dict which relates to this platform, and possibly the >>>>> standard >>>>> one. So, there is really no point loading other 20 platforms' data, >>>>> which >>>>> you will almost inevitably do if you store it in a single json. >>>>> >>>>> The real (not json-ed) data structure seems quite suitable for SQL, so >>>>> it >>>>> makes sense to store it as such. >>>>> >>>>> If you think it will take you long to do that, it's not critical: just >>>>> create an example interface for connecting to a database and running >>>>> *some* >>>>> queries to store/read the data, and we'll tune it later. >>>>> >>>>> Regards, >>>>> Elena >>>>> >>>>> >>>>> >>>>> By the way, these functions allow the two parts of the algorithm to be >>>>>> called separately, e.g.: >>>>>> >>>>>> Predicting phase (can be done depending of counts of training rounds >>>>>> for >>>>>> platform, etc..) >>>>>> 1. Create kokiri instance >>>>>> 2. Load status (call load_status) >>>>>> 3. Input test list, get smaller output >>>>>> 4. Eliminate instance from memory (no need to save state since nothing >>>>>> changes until results are updated) >>>>>> >>>>>> Training phase: >>>>>> 1. Create kokiri instance >>>>>> 2. Load status (call load_status) >>>>>> 3. Feed new information >>>>>> 4. Save status (call save_status) >>>>>> 5. Eliminate instance from memory >>>>>> >>>>>> I added tests that check the new features to the wrapper. Both >>>>>> features >>>>>> seem to be working okay. Of course, the more prediction rounds for new >>>>>> platforms, the platform mode improves a bit, but not too dramatically, >>>>>> for >>>>>> what I've seen. I'll test it a bit more. >>>>>> >>>>>> I will also add these features to the file_change_correlations branch, >>>>>> and >>>>>> document everything in the README file. >>>>>> >>>>>> Regards >>>>>> Pablo >>>>>> >>>>>> >>>>>> On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova >>>>>> <ele...@montyprogram.com> >>>>>> wrote: >>>>>> >>>>>> (sorry, forgot the list in my reply, resending) >>>>>> >>>>>>> >>>>>>> Hi Pablo, >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 03.08.2014 17:51, Pablo Estrada wrote: >>>>>>> >>>>>>> Hi Elena, >>>>>>>> >>>>>>>> >>>>>>>> One thing that I want to see there is fully developed platform >>>>>>>> mode. >>>>>>>> I >>>>>>>> >>>>>>>>> >>>>>>>>> see >>>>>>>> >>>>>>> >>>>>>> that mode option is still there, so it should not be difficult. I >>>>>>>> >>>>>>>>> >>>>>>>>> actually >>>>>>>> >>>>>>> >>>>>>> did it myself while experimenting, but since I only made hasty and >>>>>>>> crude >>>>>>>> >>>>>>>>> changes, I don't expect them to be useful. >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm not sure what code you are referring to. Can you be more >>>>>>>> specific >>>>>>>> on >>>>>>>> what seems to be missing? I might have missed something when >>>>>>>> migrating >>>>>>>> >>>>>>>> from >>>>>>> >>>>>>> the previous architecture... >>>>>>>> >>>>>>>> >>>>>>> I was mainly referring to the learning stage. Currently, the learning >>>>>>> stage is "global". You go through X test runs, collect data, >>>>>>> distribute >>>>>>> it >>>>>>> between platform-specific queues, and from X+1 test run you start >>>>>>> predicting based on whatever platform-specific data you have at the >>>>>>> moment. >>>>>>> >>>>>>> But this is bound to cause rather sporadic quality of prediction, >>>>>>> because >>>>>>> it could happen that out of 3000 learning runs, 1000 belongs to >>>>>>> platform >>>>>>> A, >>>>>>> while platform B only had 100, and platform C was introduced later, >>>>>>> after >>>>>>> your learning cycle. So, for platform B the statistical data will be >>>>>>> very >>>>>>> limited, and for platform C there will be none -- you will simply >>>>>>> start >>>>>>> randomizing tests from the very beginning (or using data from other >>>>>>> platforms as you suggest below, which is still not quite the same as >>>>>>> pure >>>>>>> platform-specific approach). >>>>>>> >>>>>>> It seems more reasonable, if the platform-specific mode is used, to >>>>>>> do >>>>>>> learning per platform too. It is not just about current investigation >>>>>>> activity, but about the real-life implementation too. >>>>>>> >>>>>>> Lets suppose tomorrow we start collecting the data and calculating >>>>>>> the >>>>>>> metrics. >>>>>>> Some platforms will run more often than others, so lets say in 2 >>>>>>> weeks >>>>>>> you >>>>>>> will have X test runs on these platforms so you can start predicting >>>>>>> for >>>>>>> them; while other platforms will run less frequently, and it will >>>>>>> take >>>>>>> 1 >>>>>>> month to collect the same amount of data. >>>>>>> And 2 months later there will be Ubuntu Utopic Unicorn which will >>>>>>> have >>>>>>> no >>>>>>> statistical data at all, and it will be cruel to jump into predicting >>>>>>> there >>>>>>> right away, without any statistical data at all. >>>>>>> >>>>>>> It sounds more complicated than it is, in fact pretty much all you >>>>>>> need >>>>>>> to >>>>>>> add to your algorithm is making 'count' in your run_simulation a dict >>>>>>> rather than a constant. >>>>>>> >>>>>>> So, I imagine that when you store your metrics after a test run, you >>>>>>> will >>>>>>> also store a number of test runs per platform, and only start >>>>>>> predicting >>>>>>> for this particular platform when the count for it reaches the >>>>>>> configured >>>>>>> number. >>>>>>> >>>>>>> >>>>>>> Of the code that's definitely not there, there are a couple things >>>>>>>> that >>>>>>>> could be added: >>>>>>>> 1. When we calculate the relevance of a test on a given platform, we >>>>>>>> >>>>>>>> might >>>>>>> >>>>>>> want to set the relevance to 0, or we might want to derive a default >>>>>>>> relevance from other platforms (An average, the 'standard', etc...). >>>>>>>> Currently, it's just set to 0. >>>>>>>> >>>>>>>> >>>>>>> I think you could combine this idea with what was described above. >>>>>>> While >>>>>>> it makes sense to run *some* full learning cycles on a new platform, >>>>>>> it >>>>>>> does not have to be thousands, especially since some non-LTS >>>>>>> platforms >>>>>>> come >>>>>>> and go awfully fast. So, we run these no-too-many cycles, get clean >>>>>>> platform-specific data, and if necessary enrich it with the other >>>>>>> platforms' data. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2. We might also, just in case, want to keep the 'standard' queue >>>>>>>> for >>>>>>>> >>>>>>>> when >>>>>>> >>>>>>> we don't have the data for this platform (related to the previous >>>>>>>> point). >>>>>>>> >>>>>>>> >>>>>>> If we do what's described above, we should always have data for the >>>>>>> platform. >>>>>>> But if you mean calculating and storing the standard metrics, then >>>>>>> yes >>>>>>> -- >>>>>>> since we are going to store the values rather than re-calculate them >>>>>>> every >>>>>>> time, there is no reason to be greedy about it. It might even make >>>>>>> sense >>>>>>> to >>>>>>> calculate both metrics that you developed, too. Who knows maybe one >>>>>>> day >>>>>>> we'll find out that the other one gives us better results. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> It doesn't matter in which order they fail/finish; the problem >>>>>>>> is, >>>>>>>> when >>>>>>>> >>>>>>>>> builder2 starts, it doesn't have information about builder1 >>>>>>>>> results, >>>>>>>>> and >>>>>>>>> builder3 doesn't know anything about the first two. So, the metric >>>>>>>>> for >>>>>>>>> >>>>>>>>> test >>>>>>>> >>>>>>> >>>>>>> X could not be increased yet. >>>>>>>> >>>>>>>>> >>>>>>>>> But in your current calculation, it is. So, naturally, if we happen >>>>>>>>> to >>>>>>>>> catch the failure on builder1, the metric raises dramatically, and >>>>>>>>> the >>>>>>>>> failure will be definitely caught on builders 2 and 3. >>>>>>>>> >>>>>>>>> It is especially important now, when you use incoming lists, and >>>>>>>>> the >>>>>>>>> running sets might be not identical for builders 1-3 even in >>>>>>>>> standard >>>>>>>>> >>>>>>>>> mode. >>>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> Right, I see your point. Although if test_run 1 would catch the >>>>>>>> error, >>>>>>>> test_run 2, although it would be using the same data. might not >>>>>>>> catch >>>>>>>> the >>>>>>>> same errors if the running set makes it such that they are pushed >>>>>>>> out >>>>>>>> due >>>>>>>> to lower relevance. The effect might not be too big, but it >>>>>>>> definitely >>>>>>>> >>>>>>>> has >>>>>>> >>>>>>> potential to affect the results. >>>>>>>> >>>>>>>> Over-pessimistic part: >>>>>>>> >>>>>>>> >>>>>>>>> It is similar to the previous one, but look at the same problem >>>>>>>>> from >>>>>>>>> a >>>>>>>>> different angle. Suppose the push broke test X, and the test >>>>>>>>> started >>>>>>>>> failing on all builders (platforms). So, you have 20 failures, one >>>>>>>>> per >>>>>>>>> >>>>>>>>> test >>>>>>>> >>>>>>> >>>>>>> run, for the same push. Now, suppose you caught it on one platform >>>>>>>> but >>>>>>>> >>>>>>>>> >>>>>>>>> not >>>>>>>> >>>>>>> >>>>>>> on others. Your statistics will still show 19 failures missed vs 1 >>>>>>>> >>>>>>>>> >>>>>>>>> failure >>>>>>>> >>>>>>> >>>>>>> caught, and recall will be dreadful (~0.05). But in fact, the goal >>>>>>>> is >>>>>>>> >>>>>>>>> achieved: the failure has been caught for this push. It doesn't >>>>>>>>> really >>>>>>>>> matter whether you catch it 1 time or 20 times. So, recall here >>>>>>>>> should >>>>>>>>> >>>>>>>>> be 1. >>>>>>>> >>>>>>> >>>>>>> >>>>>>>> It should mainly affect per-platform approach, but probably the >>>>>>>>> standard >>>>>>>>> one can also suffer if running sets are not identical for all >>>>>>>>> builders. >>>>>>>>> >>>>>>>>> >>>>>>>>> Right. It seems that solving these two issues is non-trivial (the >>>>>>>> >>>>>>>> test_run >>>>>>> >>>>>>> table does not contain duration of the test_run, or anything). But >>>>>>>> we >>>>>>>> can >>>>>>>> keep in mind these issues. >>>>>>>> >>>>>>>> >>>>>>> Right. At this point it doesn't even make sense to solve hem -- in >>>>>>> real-life application, the first one will be gone naturally, just >>>>>>> because >>>>>>> there will be no data from unfinished test runs. >>>>>>> >>>>>>> The second one only affects recall calculation, in other words -- >>>>>>> evaluation of the algorithm. It is interesting from theoretical point >>>>>>> of >>>>>>> view, but not critical for real-life application. >>>>>>> >>>>>>> >>>>>>> I fixed up the repositories with updated versions of the queries, >>>>>>> as >>>>>>> >>>>>>>> well >>>>>>>> as instructions in the README on how to generate them. >>>>>>>> >>>>>>>> Now I am looking a bit at the buildbot code, just to try to suggest >>>>>>>> some >>>>>>>> design ideas for adding the statistician and the pythia into the MTR >>>>>>>> related classes. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> As you know, we have the soft pencil-down in a few days, and the hard >>>>>>> one >>>>>>> a week later. At this point, there isn't much reason to keep >>>>>>> frantically >>>>>>> improving the algorithm (which is never perfect), so you are right >>>>>>> not >>>>>>> planning on it. >>>>>>> >>>>>>> In the remaining time I suggest to >>>>>>> >>>>>>> - address the points above; >>>>>>> - make sure that everything that should be configurable is >>>>>>> configurable >>>>>>> (algorithm, mode, learning set, db connection details); >>>>>>> - create structures to store the metrics and reading to/writing from >>>>>>> the >>>>>>> database; >>>>>>> - make sure the predicting and the calculating part can be called >>>>>>> separately; >>>>>>> - update documentation, clean up logging and code in general. >>>>>>> >>>>>>> As long as we have these two parts easily callable, we will find a >>>>>>> place >>>>>>> in buildbot/MTR to put them to, so don't waste too much time on it. >>>>>>> >>>>>>> Regards, >>>>>>> Elena >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards >>>>>>>> Pablo >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>>
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp