On 10.09.2015 16:08, Stefan Westerfeld wrote: > Hi! > > On Wed, Sep 09, 2015 at 10:02:12AM +0200, Tim Janik wrote: >> I'd like to make another Beast release, but some tests are currently failing. >> In particular some of the audio feature tests are breaking and need threshold >> adjustments to pass. >> >> I'd like to get your opinion on the threshold adjustments, so for your >> convenience >> I've appended: >> a) the threshold diff required for a successful release; >> b) a build log from the feature test dir in case you want a peek at the >> feature values. >> >> Some tests vary by several percent (syndrum) while e.g. partymonster >> contantly >> reaches a solid 100% similarity. >> >> FYI, the bse2wav.scm script has been replaced by "bsetool.cc render2wav" for >> porting >> reasons, but that's not related to audio processing. > Did you keep the deterministic random that --bse-disable-randomization > ususally > provided? This is just a guess, but removing it could cause such problems.
Yes, --bse-disable-randomization was provied as can be seen from the log I had attached. >> diff --git tests/audio/Makefile.am tests/audio/Makefile.am >> index 7805187..3c6dbde 100644 >> --- tests/audio/Makefile.am >> +++ tests/audio/Makefile.am >> @@ -58,7 +58,7 @@ minisong-test: >> $(BSE2WAV) $(srcdir)/minisong.bse $(@F).wav >> $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 0 --avg-spectrum >> --spectrum --avg-energy > $(@F).tmp >> $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 1 --avg-spectrum >> --spectrum --avg-energy >> $(@F).tmp >> - $(BSEFCOMPARE) $(srcdir)/minisong.ref $(@F).tmp --threshold 99.99 >> + $(BSEFCOMPARE) $(srcdir)/minisong.ref $(@F).tmp --threshold 98.00 >> rm -f $(@F).tmp $(@F).wav [...] The tests should ensure that we don't accidentally break how things sound like. So ideally our goal is that things sound exactly the same (100.00). This cannot always be done, so 99.99 is generally used. Yes, however AFAIK, the synthesis bits and timing bits weren't touched since things passed the last time and now I see random failures. > However the thresholds you used are nowhere near 99.99, so most likely things > don't sound the same, and we should investigate why, to ensure that we didn't > break things. > > To understand why I say things may be broken, and we should check this, it is > important to know that although scores are between 0.00 and 100.00, only those > scores very similar to 100.00 ensure that things really sound the same. So a > score of 98.00 already tolerates a lot of difference to the original. I'm not > sure if the difference in any of the files is audible, but it is significant. I'm aware of all this, but there's no clear way to trace such errors AFAICS. I.e. I can't currently tell why *some* feature tests sometimes fail and sometimes not. As a reminder, partymonster always passes 100%, so there's no general (timing) brokenness at play here... > Cu... Stefan -- Yours sincerely, Tim Janik https://testbit.eu/timj/ Free software author and speaker. _______________________________________________ beast mailing list [email protected] https://mail.gnome.org/mailman/listinfo/beast
