Oh wow, these are fantastic features!

On 10 December 2015 at 11:28, Trey Jones <[email protected]> wrote:
>> The result will be not only better tests, but a better impact for
>> users, because we will actually be able to deploy the improvements we
>> have worked on
>
>
> That is the hope and promise of the relevance lab. And it can test more than
> the ZRR.
>
> We already have the ability to check the rates of change to the top n
> results (ignoring order or not). So, you can see if anything breaks into the
> top 5 results (ignoring shuffling among the top 5), or note any change at
> all in the top 5 (including reshuffling). Mechanically, you can see that,
> for example, a change only affects the top 5 results for 0.1% of your query
> corpus, so regardless of whether it is good or bad, it isn't super high
> impact.
>
> The relevance lab report also provides links to diffs of example queries
> that are affected by a change, so you can review them to get a sense of
> whether they are good or bad. It's subjective, but you can sometimes get a
> rough sense of things by looking at a few dozen randomly chosen examples. If
> you look at 25 examples and 90% of them are clearly worse because of the
> change, you know you need to fix something, even with the giant confidence
> interval such a small sample entails.
>
> So, even without a gold standard corpus of graded search results, you can
> use the relevance lab to do some pre-testing of a change and get a sense of
> whether it's doing what you want on general queries (and not just the
> handful you were focused on trying to fix).
>
> You can also test impact and effectiveness on a focused query corpus of
> rarer query types (e.g., queries of more than 10 words).
>
> And adding other metrics is pretty straightforward, if anyone has any ideas
> of places to take note of changes. And, on the back burner, I have some
> ideas for improvements that look at other annotations on a query and how to
> incorporate/test those in the relevance lab.
>
> So, I agree with you on the relevance lab side. I'm also looking forward to
> better user acceptance testing, whether through more complex click-stream
> metrics or micro surveys or whatever else works. We collectively suffer from
> the curse of knowledge when it comes to search—it's hard to know what users
> who don't spend a non-trivial portion of their professional lives
> contemplating search will really like/want/use.
>
> —Trey
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> On Thu, Dec 10, 2015 at 10:39 AM, Oliver Keyes <[email protected]> wrote:
>>
>> The title is mostly to get you're attention, you know I like A/B
>> testing. With that being said:
>>
>> For a quarter and a bit we've been running A/B tests. Doing so has
>> been intensely time-consuming for both engineering and analysis, and
>> at times it's felt like we're pushing changes out just to test them,
>> rather than because we have reason to believe there will be dramatic
>> improvements.
>>
>> These tests have produced, at best, mixed results. Many of the tests
>> have not shown a substantial improvement in the metric we have been
>> testing - the zero results rate. Those that have have not been
>> deployed further because we cannot, from the ZRR alone, test the
>> _utility_ of produced results: for that we need to A/B test against
>> clickthroughs, or a satisfaction metric.
>>
>> So where do we go from here?
>>
>> In my mind, the ideal is that we stop A/B testing against the zero results
>> rate.
>>
>> This doesn't mean we stop testing improvements: this means we build
>> the relevance lab up and out and test the zero results rate against
>> /that/. ZRR does not need user participation, it needs the
>> participation of user *queries*: with the relevance lab we can consume
>> user queries and test ideas against them at a fraction of the cost of
>> a full A/B test.
>>
>> Instead, we use the A/B tests for the other component: the utility
>> component. If something passes the Relevance Lab ring of fire, we A/B
>> test it against clickthroughs: this will be rarer than "every two
>> weeks" and so we can afford to spend some time making sure the test is
>> A+ scientifically, and all our ducks are in a row.
>>
>> The result will be not only better tests, but a better impact for
>> users, because we will actually be able to deploy the improvements we
>> have worked on - something that has thus far escaped us due to
>> attention being focused on deploying More Tests rather than completely
>> validating the ones we have already deployed.
>>
>> Thoughts?
>>
>> --
>> Oliver Keyes
>> Count Logula
>> Wikimedia Foundation
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to