On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via
lists.openembedded.org wrote:
> On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via
> lists.openembedded.org wrote:
> > Hi Alexis,
> > 
> > Firstly, this looks very much improved, thanks. It is great to start to
> > see some meaningful data from this.
> > 
> > On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via
> > lists.openembedded.org wrote:
> > > From: Alexis Lothoré <alexis.loth...@bootlin.com>
> > > 
> > > Hello,
> > > this new series is the follow-up of [1] to make regression reports more
> > > meaningful, by reducing noise and false positives.
> > > 
> > > Change since v2:
> > > - add filtering on MACHINE field from test results configuration: the 
> > > MACHINE
> > >   should always match
> > > - add "metadata guessing" mechanism based on Richard proposal ([2]). Up 
> > > to the
> > >   point where this series will be merged, tests results stored in git are 
> > > not
> > >   enriched with OESELFTEST_METADATA. To allow proper test comparison even 
> > > with
> > >   those tests, try to guess what oeselftest command line has been used to 
> > > run
> > >   the corresponding tests, and generate OESELFTEST_METADATA accordingly
> > > - add new tool to ease test results usage: yocto_testresults_query. For 
> > > now the
> > >   tool only manages regression report and is a thin layer between 
> > > send-qa-email
> > >   (in yocto-autobuilder-helper) and resulttool. Its main role is to 
> > > translate
> > >   regression reports arguments (which are tags or branches) to fixed 
> > > revisions
> > >   and to call resulttool accordingly. Most of its code is a transfer from
> > >   send-qa-email (another series for the autobuilder will follow this one 
> > > to make
> > >   send-qa-email use this new helper, but this current series works
> > >   independently)
> > >   Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" 
> > > will
> > >   replay the regression report generated when the 4.2_M2 has been 
> > > generated.
> > > 
> > > Change since v1:
> > > - properly configure "From" field in series
> > > 
> > > With those improvements, the regression report is significantly reduced 
> > > and some
> > > useful data start to emerge from the removed noise:
> > > - with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB
> > > - with the OESELFTEST_METADATA enrichment + metadata guessing for older 
> > > tests,
> > >   the report goes from 627MB to 1.5MB
> > 
> > That is just a bit more readable!
> > 
> > > 
> > > After manual inspection on some entries, the remaining oeselftest 
> > > regression
> > > raised in the report seems valid. There are still some issues to tackle:
> > > - it seems that now one major remaining source of noise is on the 
> > > "runtime"
> > >   tests (comparison to tests not run on "target" results)
> > > - when a ptest managed by oe-selftest fails, I guess the remaining tests 
> > > are not
> > >   run, so when 1 failure is logged, we have many "PASSED->None" 
> > > transitions in
> > >   regression report, we should probably silence it.
> > > - some transitions appear as regression while those are in fact 
> > > improvements
> > >   (e.g: "UNRESOLVED->PASSED")
> > 
> > I had quick play. Firstly, if I try "yocto_testresults_query.py
> > regression-report 4.2_M1 4.2_M2" in an openembedded-core repository
> > instead of poky, it breaks. That isn't surprising but we should either
> > make it work or show a sensible error.
> > 
> > I also took a look the report and wondered why the matching isn't quite
> > right and why we have these "regressions". If we could remove that
> > noise, I think we'd get down to the real issues. I ended up doing:
> > 
> > resulttool report --commit 4d19594b8bdacde6d809d3f2a25cff7c5a42295e  . > 
> > /tmp/repa
> > resulttool report --commit 5e249ec855517765f4b99e8039cb888ffa09c211  . > 
> > /tmp/repb
> > meld /tmp/rep*
> > 
> > which was interesting as gave lots of warnings like:
> > 
> > "Warning duplicate ptest result 'acl.test/cp.test' for qemuarm64"
> > 
> > so it looks like we had a couple of different test runs for qemuarm64
> > ptests which is confusing your new code. I suspect this happened due to
> > some autobuilder glitch during the release build which restarted some
> > of the build pieces. Not sure how to handle that yet, I'll give it some
> > further thought but I wanted to share what I think is the source of
> > some of the issues. Basically we need to get the regression report
> > looking more like that meld output!
> 
> I was wrong about the duplication, that isn't the issue, or at least I
> found some other more pressing ones. For the ltp issue, I found an easy
> fix:
> 
> diff --git a/scripts/lib/resulttool/regression.py 
> b/scripts/lib/resulttool/regression.py
> index 1b0c8335a39..9d7c35942a6 100644
> --- a/scripts/lib/resulttool/regression.py
> +++ b/scripts/lib/resulttool/regression.py
> @@ -146,6 +146,7 @@ def can_be_compared(logger, base, target):
>      run with different tests sets or parameters. Return true if tests can be
>      compared
>      """
> +    ret = True
>      base_configuration = base['configuration']
>      target_configuration = target['configuration']
>  
> @@ -165,7 +166,10 @@ def can_be_compared(logger, base, target):
>              logger.debug(f"Enriching {target_configuration['STARTTIME']} 
> with {guess}")
>              target_configuration['OESELFTEST_METADATA'] = guess
>  
> -    return metadata_matches(base_configuration, target_configuration) \
> +    if base_configuration.get('TEST_TYPE') == 'runtime' and 
> any(result.startswith("ltpresult") for result in base['result']):
> +        ret = target_configuration.get('TEST_TYPE') == 'runtime' and 
> any(result.startswith("ltpresult") for result in target['result'])
> +
> +    return ret and metadata_matches(base_configuration, 
> target_configuration) \
>          and machine_matches(base_configuration, target_configuration)
>  
>  
> i.e. only compare ltp to ltp. The issue is we don't use a special image
> name for the ltp test runs, we just extend a standard one so it was
> comparing ltp to non-ltp.
> 
> We should also perhaps consider a clause in there which only compares
> runs with ptests with other runs with ptests? Our test matrix won't
> trigger that but other usage might in future and it is a safe check?
> 
> A lot of the rest of the noise is poor test naming for ptests, e.g.:
> 
> ptestresult.lttng-tools.ust/buffers-pid/test_buffers_pid_10_-_Create_session_buffers-pid_in_-o_/tmp/tmp.XXXXXXXXXXrs_pid_trace_path.XTnDY5
> 
> which has a random string at the end. I'm wondering if we should pre-
> filter ptest result names and truncate a known list of them at the "-"
> (lttng-tools, babeltrace, babeltrace2). Curl could also be truncated at
> the ",":
> 
> ptestresult.curl.test_0010__10_out_of_1506,_remaining:_06:44,_took_1.075s,_duration:_00:02_
> 
> We can adjust the ptest generation code to do this at source (we should
> perhaps file a bug for that for the four above?) but that won't fix the
> older results so we'll probably need some filtering in the code too.
> 
> There is something more going on with the ptest results too, I don't
> understand why quilt/python3 changed but I suspect we just have to go
> through the issues step by step now.
> 
> I did look into the:
> 
> ptestresult.glibc-user.debug/tst-fortify-c-default-1
> 
> 'regression' and it is because the test was renamed in the new glibc. I
> was therefore thinking a summary of added/removed would be useful in
> but only in these cases. Something along the line of if only tests
> added, just summarise X new added and call it a match. If tests removed
> and added, list and show a count summary (X removed, Y added) and call
> it a regression.
> 
> I think I might be tempted to merge this series and then we can change
> the code to improve from here as this is clearly a vast improvement on
> where we were! Improvements can be incremental on top of these changes.

This goes a long way to shrinking the report even further. Looks like
the curl test reporting needs some work as the IDs look like they
change but this at least makes the issue clearer and the real deltas
are becoming much easier to see outside the noise.

diff --git a/scripts/lib/resulttool/regression.py 
b/scripts/lib/resulttool/regression.py
index 1b0c8335a39..0d8948f012f 100644
--- a/scripts/lib/resulttool/regression.py
+++ b/scripts/lib/resulttool/regression.py
@@ -243,6 +247,21 @@ def regression_common(args, logger, base_results, 
target_results):
 
     return 0
 
+def fixup_ptest_names(results, logger):
+    for r in results:
+        for i in results[r]:
+            tests = list(results[r][i]['result'].keys())
+            for test in tests:
+                new = None
+                if test.startswith(("ptestresult.lttng-tools.", 
"ptestresult.babeltrace.", "ptestresult.babeltrace2")) and "_-_" in test:
+                    new = test.split("_-_")[0]
+                elif test.startswith(("ptestresult.curl.")) and "__" in test:
+                    new = test.split("__")[0]
+                if new:
+                    results[r][i]['result'][new] = 
results[r][i]['result'][test]
+                    del results[r][i]['result'][test]
+
+
 def regression_git(args, logger):
     base_results = {}
     target_results = {}
@@ -304,6 +323,9 @@ def regression_git(args, logger):
     base_results = resultutils.git_get_result(repo, revs[index1][2])
     target_results = resultutils.git_get_result(repo, revs[index2][2])
 
+    fixup_ptest_names(base_results, logger)
+    fixup_ptest_names(target_results, logger)
+
     regression_common(args, logger, base_results, target_results)
 
     return 0



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#177716): 
https://lists.openembedded.org/g/openembedded-core/message/177716
Mute This Topic: https://lists.openembedded.org/mt/97209732/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to