Re: [Maria-developers] [GSoC] Optimize mysql-test-runs - Results of new strategy

Pablo Estrada Tue, 19 Aug 2014 10:59:10 -0700

Elena,
thank you very much : ).

I just pushed the final commit. I had not realized it had failed last time.


Best.
Pablo


On Tue, Aug 19, 2014 at 4:26 PM, Elena Stepanova <ele...@montyprogram.com>
wrote:

> Hi Pablo,
>
> Thanks for the great work.
>
> Just one thing -- In RESULTS.md, paragraphs "The Fail Frequency algorithm"
> and "The File-change correlation algorithm" are unfinished. It's not a big
> deal, but I want to be sure there wasn't anything important in the lost
> part. Could you please double-check?
>
> Regards,
> Elena
>
>
>
> On 17.08.2014 16:32, Pablo Estrada wrote:
>
>> Hello Elena and all,
>> I have submitted the concluding commit to the project with a very
>> short 'RESULTS' file that explains briefly the project, the different
>> strategies and the results. It includes a chart with updated results
>> for both strategies and different modes. If you think I should add
>> anything else, please let me know.
>> Here it is:
>> https://github.com/pabloem/Kokiri/blob/master/RESULTS.md
>>
>> Thank you very much.
>> Regards
>>
>> Pablo
>>
>> On 8/13/14, Elena Stepanova <ele...@montyprogram.com> wrote:
>>
>>> Hi Pablo,
>>>
>>> On 10.08.2014 9:31, Pablo Estrada wrote:
>>>
>>>> Hello Elena,
>>>> You raise good points. I have just rewritten the save_state and
>>>> load_state
>>>> functions. Now they work with a MySQL database and a table that looks
>>>> like
>>>> this:
>>>>
>>>> create table kokiri_data  ( dict varchar(20), labels varchar(200), value
>>>> varchar(100), primary key (dict,labels));
>>>>
>>>> Since I wanted to store many dicts into the database, I decided to try
>>>> this
>>>> format. The 'dict' field includes the dictionary that the data belongs
>>>> to
>>>> ('upd_count','pred_count' or 'test_info'). The 'labels' field includes
>>>> the
>>>> space-separated list of labels in the dictionary (for a more detailed
>>>> explanation, check the README and the code). The value contains the
>>>> value
>>>> of the datum (count of runs, relevance, etc.)
>>>>
>>>> Since the labels are space-separated, this assumes we are not using the
>>>> mixed mode. If we use mixed mode, we may change the separator (, or & or
>>>> %
>>>> or $ are good alternatives).
>>>>
>>>> Let me know what you think about this strategy to store into the
>>>> database.
>>>> I felt it was the most simple one, while still allowing to do some
>>>> querying
>>>> on the database (like loading only one metric or one 'unit'
>>>> (platform/branch/mix), etc). It may also allow to store many
>>>> configurations
>>>> if necessary.
>>>>
>>>
>>> Okay, lets have it this way. We can change it later if we want to.
>>>
>>> In the remaining time, you can do the cleanup, check documentation, and
>>> maybe run some last clean experiments with the existing data and
>>> different parameters (modes, metrics etc.), to have the statistical
>>> results with the latest code, which we'll use later to decide on the
>>> final configuration.
>>>
>>> Regards,
>>> Elena
>>>
>>>
>>>> Regards
>>>> Pablo
>>>>
>>>>
>>>> On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova <
>>>> ele...@montyprogram.com>
>>>> wrote:
>>>>
>>>>  Hi Pablo,
>>>>>
>>>>> Thanks for the update. Couple of comments inline.
>>>>>
>>>>>
>>>>> On 08.08.2014 18:17, Pablo Estrada wrote:
>>>>>
>>>>>  Hello Elena,
>>>>>> I just pushed a transaction, with the following changes:
>>>>>>
>>>>>> 1. Added an internal counter to the kokiri class, and a function to
>>>>>> expose
>>>>>> it. This function can show how many update result runs and prediction
>>>>>> runs
>>>>>> have been run in total, or per unit (an unit being a platform, a
>>>>>> branch
>>>>>> or
>>>>>> a mix of both). Using this counter, one can decide to add logic for
>>>>>> extra
>>>>>> learning rounds for new platforms (I added it to the wrapper class as
>>>>>> an
>>>>>> example).
>>>>>>
>>>>>> 2. Added functions to load and store status into temporary storage.
>>>>>> They
>>>>>> are very simple - they only serialize to a JSON file, but they can be
>>>>>> easily modified to fit the requirements of the implementation. I can
>>>>>> add
>>>>>> this in the README. If you'd like for me to add the capacity to
>>>>>> connect
>>>>>> to
>>>>>> a database and store the data in a table, I can do that too (I think
>>>>>> it
>>>>>>
>>>>>>
>>>>> Yes, I think we'll have to have it stored in the database.
>>>>> Chances are, the scripts will run on buildbot slaves rather than on the
>>>>> master, so storing data in a file just won't do any good.
>>>>>
>>>>>
>>>>>    would be easiest to store the dicts as json data in text fields).
>>>>> Let
>>>>> me
>>>>>
>>>>>> know if you'd prefer that.
>>>>>>
>>>>>>
>>>>> I don't like the idea of storing the entire dicts as json. It doesn't
>>>>> seem
>>>>> to be justified by... well... anything, except for saving a tiny bit of
>>>>> time on writing queries. But that's a one-time effort, while this way
>>>>> we
>>>>> won't be able to [easily] join the statistical data with, lets say,
>>>>> existing buildbot tables; and it generally won't be efficient and easy
>>>>> to
>>>>> read.
>>>>>
>>>>> Besides, keep in mind that for real use, if, lets say, we are running
>>>>> in
>>>>> 'platform' mode, for each call we don't need the whole dict, we only
>>>>> need
>>>>> the part of dict which relates to this platform, and possibly the
>>>>> standard
>>>>> one. So, there is really no point loading other 20 platforms' data,
>>>>> which
>>>>> you will almost inevitably do if you store it in a single json.
>>>>>
>>>>> The real (not json-ed) data structure seems quite suitable for SQL, so
>>>>> it
>>>>> makes sense to store it as such.
>>>>>
>>>>> If you think it will take you long to do that, it's not critical: just
>>>>> create an example interface for connecting to a database and running
>>>>> *some*
>>>>> queries to store/read the data, and we'll tune it later.
>>>>>
>>>>> Regards,
>>>>> Elena
>>>>>
>>>>>
>>>>>
>>>>>  By the way, these functions allow the two parts of the algorithm to be
>>>>>> called separately, e.g.:
>>>>>>
>>>>>> Predicting phase (can be done depending of counts of training rounds
>>>>>> for
>>>>>> platform, etc..)
>>>>>> 1. Create kokiri instance
>>>>>> 2. Load status (call load_status)
>>>>>> 3. Input test list, get smaller output
>>>>>> 4. Eliminate instance from memory (no need to save state since nothing
>>>>>> changes until results are updated)
>>>>>>
>>>>>> Training phase:
>>>>>> 1. Create kokiri instance
>>>>>> 2. Load status (call load_status)
>>>>>> 3. Feed new information
>>>>>> 4. Save status (call save_status)
>>>>>> 5. Eliminate instance from memory
>>>>>>
>>>>>> I added tests that check the new features to the wrapper. Both
>>>>>> features
>>>>>> seem to be working okay. Of course, the more prediction rounds for new
>>>>>> platforms, the platform mode improves a bit, but not too dramatically,
>>>>>> for
>>>>>> what I've seen. I'll test it a bit more.
>>>>>>
>>>>>> I will also add these features to the file_change_correlations branch,
>>>>>> and
>>>>>> document everything in the README file.
>>>>>>
>>>>>> Regards
>>>>>> Pablo
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova
>>>>>> <ele...@montyprogram.com>
>>>>>> wrote:
>>>>>>
>>>>>>    (sorry, forgot the list in my reply, resending)
>>>>>>
>>>>>>>
>>>>>>> Hi Pablo,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 03.08.2014 17:51, Pablo Estrada wrote:
>>>>>>>
>>>>>>>  Hi Elena,
>>>>>>>>
>>>>>>>>
>>>>>>>>    One thing that I want to see there is fully developed platform
>>>>>>>> mode.
>>>>>>>> I
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  see
>>>>>>>>
>>>>>>>
>>>>>>>  that mode option is still there, so it should not be difficult. I
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  actually
>>>>>>>>
>>>>>>>
>>>>>>>  did it myself while experimenting, but since I only made hasty and
>>>>>>>> crude
>>>>>>>>
>>>>>>>>> changes, I don't expect them to be useful.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  I'm not sure what code you are referring to. Can you be more
>>>>>>>> specific
>>>>>>>> on
>>>>>>>> what seems to be missing? I might have missed something when
>>>>>>>> migrating
>>>>>>>>
>>>>>>>>  from
>>>>>>>
>>>>>>>  the previous architecture...
>>>>>>>>
>>>>>>>>
>>>>>>> I was mainly referring to the learning stage. Currently, the learning
>>>>>>> stage is "global". You go through X test runs, collect data,
>>>>>>> distribute
>>>>>>> it
>>>>>>> between platform-specific queues, and from X+1 test run you start
>>>>>>> predicting based on whatever platform-specific data you have at the
>>>>>>> moment.
>>>>>>>
>>>>>>> But this is bound to cause rather sporadic quality of prediction,
>>>>>>> because
>>>>>>> it could happen that out of 3000 learning runs, 1000 belongs to
>>>>>>> platform
>>>>>>> A,
>>>>>>> while platform B only had 100, and platform C was introduced later,
>>>>>>> after
>>>>>>> your learning cycle. So, for platform B the statistical data will be
>>>>>>> very
>>>>>>> limited, and for platform C there will be none -- you will simply
>>>>>>> start
>>>>>>> randomizing tests from the very beginning (or using data from other
>>>>>>> platforms as you suggest below, which is still not quite the same as
>>>>>>> pure
>>>>>>> platform-specific approach).
>>>>>>>
>>>>>>> It seems more reasonable, if the platform-specific mode is used, to
>>>>>>> do
>>>>>>> learning per platform too. It is not just about current investigation
>>>>>>> activity, but about the real-life implementation too.
>>>>>>>
>>>>>>> Lets suppose tomorrow we start collecting the data and calculating
>>>>>>> the
>>>>>>> metrics.
>>>>>>> Some platforms will run more often than others, so lets say in 2
>>>>>>> weeks
>>>>>>> you
>>>>>>> will have X test runs on these platforms so you can start predicting
>>>>>>> for
>>>>>>> them; while other platforms will run less frequently, and it will
>>>>>>> take
>>>>>>> 1
>>>>>>> month to collect the same amount of data.
>>>>>>> And 2 months later there will be Ubuntu Utopic Unicorn which will
>>>>>>> have
>>>>>>> no
>>>>>>> statistical data at all, and it will be cruel to jump into predicting
>>>>>>> there
>>>>>>> right away, without any statistical data at all.
>>>>>>>
>>>>>>> It sounds more complicated than it is, in fact pretty much all you
>>>>>>> need
>>>>>>> to
>>>>>>> add to your algorithm is making 'count' in your run_simulation a dict
>>>>>>> rather than a constant.
>>>>>>>
>>>>>>> So, I imagine that when you store your metrics after a test run, you
>>>>>>> will
>>>>>>> also store a number of test runs per platform, and only start
>>>>>>> predicting
>>>>>>> for this particular platform when the count for it reaches the
>>>>>>> configured
>>>>>>> number.
>>>>>>>
>>>>>>>
>>>>>>>  Of the code that's definitely not there, there are a couple things
>>>>>>>> that
>>>>>>>> could be added:
>>>>>>>> 1. When we calculate the relevance of a test on a given platform, we
>>>>>>>>
>>>>>>>>  might
>>>>>>>
>>>>>>>  want to set the relevance to 0, or we might want to derive a default
>>>>>>>> relevance from other platforms (An average, the 'standard', etc...).
>>>>>>>> Currently, it's just set to 0.
>>>>>>>>
>>>>>>>>
>>>>>>> I think you could combine this idea with what was described above.
>>>>>>> While
>>>>>>> it makes sense to run *some* full learning cycles on a new platform,
>>>>>>> it
>>>>>>> does not have to be thousands, especially since some non-LTS
>>>>>>> platforms
>>>>>>> come
>>>>>>> and go awfully fast. So, we run these no-too-many cycles, get clean
>>>>>>> platform-specific data, and if necessary enrich it with the other
>>>>>>> platforms' data.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  2. We might also, just in case, want to keep the 'standard' queue
>>>>>>>> for
>>>>>>>>
>>>>>>>>  when
>>>>>>>
>>>>>>>  we don't have the data for this platform (related to the previous
>>>>>>>> point).
>>>>>>>>
>>>>>>>>
>>>>>>> If we do what's described above, we should always have data for the
>>>>>>> platform.
>>>>>>> But if you mean calculating and storing the standard metrics, then
>>>>>>> yes
>>>>>>> --
>>>>>>> since we are going to store the values rather than re-calculate them
>>>>>>> every
>>>>>>> time, there is no reason to be greedy about it. It might even make
>>>>>>> sense
>>>>>>> to
>>>>>>> calculate both metrics that you developed, too. Who knows maybe one
>>>>>>> day
>>>>>>> we'll find out that the other one gives us better results.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>    It doesn't matter in which order they fail/finish; the problem
>>>>>>>> is,
>>>>>>>> when
>>>>>>>>
>>>>>>>>> builder2 starts, it doesn't have information about builder1
>>>>>>>>> results,
>>>>>>>>> and
>>>>>>>>> builder3 doesn't know anything about the first two. So, the metric
>>>>>>>>> for
>>>>>>>>>
>>>>>>>>>  test
>>>>>>>>
>>>>>>>
>>>>>>>  X could not be increased yet.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> But in your current calculation, it is. So, naturally, if we happen
>>>>>>>>> to
>>>>>>>>> catch the failure on builder1, the metric raises dramatically, and
>>>>>>>>> the
>>>>>>>>> failure will be definitely caught on builders 2 and 3.
>>>>>>>>>
>>>>>>>>> It is especially important now, when you use incoming lists, and
>>>>>>>>> the
>>>>>>>>> running sets might be not identical for builders 1-3 even in
>>>>>>>>> standard
>>>>>>>>>
>>>>>>>>>  mode.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>  Right, I see your point. Although if test_run 1 would catch the
>>>>>>>> error,
>>>>>>>> test_run 2, although it would be using the same data. might not
>>>>>>>> catch
>>>>>>>> the
>>>>>>>> same errors if the running set makes it such that they are pushed
>>>>>>>> out
>>>>>>>> due
>>>>>>>> to lower relevance. The effect might not be too big, but it
>>>>>>>> definitely
>>>>>>>>
>>>>>>>>  has
>>>>>>>
>>>>>>>  potential to affect the results.
>>>>>>>>
>>>>>>>> Over-pessimistic part:
>>>>>>>>
>>>>>>>>
>>>>>>>>> It is similar to the previous one, but look at the same problem
>>>>>>>>> from
>>>>>>>>> a
>>>>>>>>> different angle. Suppose the push broke test X, and the test
>>>>>>>>> started
>>>>>>>>> failing on all builders (platforms). So, you have 20 failures, one
>>>>>>>>> per
>>>>>>>>>
>>>>>>>>>  test
>>>>>>>>
>>>>>>>
>>>>>>>  run, for the same push. Now, suppose you caught it on one platform
>>>>>>>> but
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  not
>>>>>>>>
>>>>>>>
>>>>>>>  on others. Your statistics will still show 19 failures missed vs 1
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  failure
>>>>>>>>
>>>>>>>
>>>>>>>  caught, and recall will be dreadful (~0.05). But in fact, the goal
>>>>>>>> is
>>>>>>>>
>>>>>>>>> achieved: the failure has been caught for this push. It doesn't
>>>>>>>>> really
>>>>>>>>> matter whether you catch it 1 time or 20 times. So, recall here
>>>>>>>>> should
>>>>>>>>>
>>>>>>>>>  be 1.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  It should mainly affect per-platform approach, but probably the
>>>>>>>>> standard
>>>>>>>>> one can also suffer if running sets are not identical for all
>>>>>>>>> builders.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Right. It seems that solving these two issues is non-trivial (the
>>>>>>>>
>>>>>>>>  test_run
>>>>>>>
>>>>>>>  table does not contain duration of the test_run, or anything). But
>>>>>>>> we
>>>>>>>> can
>>>>>>>> keep in mind these issues.
>>>>>>>>
>>>>>>>>
>>>>>>> Right. At this point it doesn't even make sense to solve hem -- in
>>>>>>> real-life application, the first one will be gone naturally, just
>>>>>>> because
>>>>>>> there will be no data from unfinished test runs.
>>>>>>>
>>>>>>> The second one only affects recall calculation, in other words --
>>>>>>> evaluation of the algorithm. It is interesting from theoretical point
>>>>>>> of
>>>>>>> view, but not critical for real-life application.
>>>>>>>
>>>>>>>
>>>>>>>    I fixed up the repositories with updated versions of the queries,
>>>>>>> as
>>>>>>>
>>>>>>>> well
>>>>>>>> as instructions in the README on how to generate them.
>>>>>>>>
>>>>>>>> Now I am looking a bit at the buildbot code, just to try to suggest
>>>>>>>> some
>>>>>>>> design ideas for adding the statistician and the pythia into the MTR
>>>>>>>> related classes.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> As you know, we have the soft pencil-down in a few days, and the hard
>>>>>>> one
>>>>>>> a week later. At this point, there isn't much reason to keep
>>>>>>> frantically
>>>>>>> improving the algorithm (which is never perfect), so you are right
>>>>>>> not
>>>>>>> planning on it.
>>>>>>>
>>>>>>> In the remaining time I suggest to
>>>>>>>
>>>>>>> - address the points above;
>>>>>>> - make sure that everything that should be configurable is
>>>>>>> configurable
>>>>>>> (algorithm, mode, learning set, db connection details);
>>>>>>> - create structures to store the metrics and reading to/writing from
>>>>>>> the
>>>>>>> database;
>>>>>>> - make sure the predicting and the calculating part can be called
>>>>>>> separately;
>>>>>>> - update documentation, clean up logging and code in general.
>>>>>>>
>>>>>>> As long as we have these two parts easily callable, we will find a
>>>>>>> place
>>>>>>> in buildbot/MTR to put them to, so don't waste too much time on it.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Elena
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  Regards
>>>>>>>> Pablo
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>

_______________________________________________
Mailing list: https://launchpad.net/~maria-developers
Post to     : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp

Re: [Maria-developers] [GSoC] Optimize mysql-test-runs - Results of new strategy

Reply via email to