Oh wow, these are fantastic features! On 10 December 2015 at 11:28, Trey Jones <[email protected]> wrote: >> The result will be not only better tests, but a better impact for >> users, because we will actually be able to deploy the improvements we >> have worked on > > > That is the hope and promise of the relevance lab. And it can test more than > the ZRR. > > We already have the ability to check the rates of change to the top n > results (ignoring order or not). So, you can see if anything breaks into the > top 5 results (ignoring shuffling among the top 5), or note any change at > all in the top 5 (including reshuffling). Mechanically, you can see that, > for example, a change only affects the top 5 results for 0.1% of your query > corpus, so regardless of whether it is good or bad, it isn't super high > impact. > > The relevance lab report also provides links to diffs of example queries > that are affected by a change, so you can review them to get a sense of > whether they are good or bad. It's subjective, but you can sometimes get a > rough sense of things by looking at a few dozen randomly chosen examples. If > you look at 25 examples and 90% of them are clearly worse because of the > change, you know you need to fix something, even with the giant confidence > interval such a small sample entails. > > So, even without a gold standard corpus of graded search results, you can > use the relevance lab to do some pre-testing of a change and get a sense of > whether it's doing what you want on general queries (and not just the > handful you were focused on trying to fix). > > You can also test impact and effectiveness on a focused query corpus of > rarer query types (e.g., queries of more than 10 words). > > And adding other metrics is pretty straightforward, if anyone has any ideas > of places to take note of changes. And, on the back burner, I have some > ideas for improvements that look at other annotations on a query and how to > incorporate/test those in the relevance lab. > > So, I agree with you on the relevance lab side. I'm also looking forward to > better user acceptance testing, whether through more complex click-stream > metrics or micro surveys or whatever else works. We collectively suffer from > the curse of knowledge when it comes to search—it's hard to know what users > who don't spend a non-trivial portion of their professional lives > contemplating search will really like/want/use. > > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > On Thu, Dec 10, 2015 at 10:39 AM, Oliver Keyes <[email protected]> wrote: >> >> The title is mostly to get you're attention, you know I like A/B >> testing. With that being said: >> >> For a quarter and a bit we've been running A/B tests. Doing so has >> been intensely time-consuming for both engineering and analysis, and >> at times it's felt like we're pushing changes out just to test them, >> rather than because we have reason to believe there will be dramatic >> improvements. >> >> These tests have produced, at best, mixed results. Many of the tests >> have not shown a substantial improvement in the metric we have been >> testing - the zero results rate. Those that have have not been >> deployed further because we cannot, from the ZRR alone, test the >> _utility_ of produced results: for that we need to A/B test against >> clickthroughs, or a satisfaction metric. >> >> So where do we go from here? >> >> In my mind, the ideal is that we stop A/B testing against the zero results >> rate. >> >> This doesn't mean we stop testing improvements: this means we build >> the relevance lab up and out and test the zero results rate against >> /that/. ZRR does not need user participation, it needs the >> participation of user *queries*: with the relevance lab we can consume >> user queries and test ideas against them at a fraction of the cost of >> a full A/B test. >> >> Instead, we use the A/B tests for the other component: the utility >> component. If something passes the Relevance Lab ring of fire, we A/B >> test it against clickthroughs: this will be rarer than "every two >> weeks" and so we can afford to spend some time making sure the test is >> A+ scientifically, and all our ducks are in a row. >> >> The result will be not only better tests, but a better impact for >> users, because we will actually be able to deploy the improvements we >> have worked on - something that has thus far escaped us due to >> attention being focused on deploying More Tests rather than completely >> validating the ones we have already deployed. >> >> Thoughts? >> >> -- >> Oliver Keyes >> Count Logula >> Wikimedia Foundation >> >> _______________________________________________ >> discovery mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/discovery > > > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery >
-- Oliver Keyes Count Logula Wikimedia Foundation _______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
