Re: [discovery] [AI] Collecting human labeled relevance judgements for search from readers

Jonathan Morgan Thu, 04 May 2017 16:38:09 -0700

On Thu, May 4, 2017 at 4:27 PM, Trey Jones <[email protected]> wrote:


> One possible way to give people the context they need to answer the
>> question accurately  is to provide them with, say, three of the top search
>> queries that you think are relevant to the result, and ask them to choose
>> which one is *most* relevant.
>
>
> That might be less confusing, but unfortunately I don't think it would
> give us what we want. In this scenario, we'd need up/down votes on all
> three options, and relative ranking among them wouldn't be useful. (I can
> give an example to explain if that's not clear.)
>

"click all that apply"?

Okay, I'm spitballing now :P Happy to talk about this more if you want.
Sounds like I'd need a little more background on your goals and the data
you're working with to be more helpful - J



>
> I agree this falls under (or is at least reasonably similar to)
> experimental design, though, and it'd be great to get help.
>
> (While this was Erik's excellent idea, I'm very excited about it because
> it would mean I could stop feeling guilty about not having done any
> Discernatron queries in months.)
>
>
> On Thu, May 4, 2017 at 7:07 PM, Jonathan Morgan <[email protected]>
> wrote:
>
>> This conversation is exactly what I meant by "experimental design" above.
>>
>> I like Jan's recommendation to keep the prompt simple, and ask to people
>> to provide a quick binary judgement. However I agree that, considering some
>> of the search queries you're showing folks are going to be kind of oddball,
>> you want to give them a little bit of context to help them understand that
>> they're looking at a search query.
>>
>> One possible way to give people the context they need to answer the
>> question accurately  is to provide them with, say, three of the top search
>> queries that you think are relevant to the result, and ask them to choose
>> which one is *most* relevant.
>>
>> Without some context, I'm not sure I would be able to give an accurate
>> answer to the question "Is this article about 'hydrostone halifax nova
>> scotia'"?
>>
>> Seeing multiple examples makes decision-making easier. The prompt could
>> be something like "Which set of [search terms/key words/tags] is most
>> relevant to this article?"
>>
>> Adding a "none of the above" option as well would allow you to screen out
>> cases where the responder was either confused by the question, or felt that
>> none of the candidate queries were even remotely relevant.
>>
>> I suggest you loop Aeryn Palmer from Legal in, and add a "why are we
>> asking this?" link into the banner/quicksurvey popup that links to a survey
>> privacy statement page on FoundationWiki
>> <https://wikimediafoundation.org/wiki/Quick_Survey_Privacy_Statement>.
>>
>> Hope that helps,
>> J
>>
>> On Thu, May 4, 2017 at 12:32 PM, Trey Jones <[email protected]> wrote:
>>
>>> Yeah, this is definitely the reverse of Discernatron. Part of the reason
>>> for waiting 60s is that then, hopefully, the reader at least has some idea
>>> what the article is about (another difficulty with Discernatron), so they
>>> only have to spend a little time guessing what the query is about.
>>>
>>> We are going to have to work on the wording of the question. It needs to
>>> be clear and concise.
>>>
>>> I worry that *Is this page about "X"?* might make people reply too
>>> strictly. A page can be reasonable relevant to X without being *about*
>>> X. What about this: *If you searched for X, would this article be a
>>> good result?* I'm not sure normal people think of "results".
>>>
>>>    - *Would someone who searched for X want to read this article?*
>>>    —better
>>>    - *If someone searched for X, would they want to read this 
>>> article?*—longer,
>>>    but easier to parse.
>>>    - *If someone searched for X, **would they find what they are
>>>    looking for in this article?*—probably too long
>>>
>>> More brainstorming on this wouldn't hurt, even if it is very early in
>>> the whole process.
>>>
>>> There's also the wording that goes with the request for a judgement.
>>> "Help us make search better!" might get more response than just the
>>> judgement question.
>>>
>>> Folks in fundraising might have good ideas about how to catch people's
>>> attention, and at the very least would could learn from them and actively
>>> A/B test different options and see what kind of response rate we get.
>>>
>>> We might also get cleaner A/B test results if we limited their scope—a
>>> few pages and a few "queries" where we know the answers, so we can gauge
>>> not only response rate, but also engagement, to see if one kind of phrasing
>>> makes people try a little harder.
>>>
>>> We might also want to make "No, thanks" the default button so that it is
>>> easier to bail than to give random input.
>>>
>>> Trey Jones
>>> Software Engineer, Discovery
>>> Wikimedia Foundation
>>>
>>> On Thu, May 4, 2017 at 2:44 PM, Jan Drewniak <[email protected]>
>>> wrote:
>>>
>>>> Hi Erik
>>>>
>>>> From my understanding, it looks like your looking to collect relevance
>>>> data "in reverse". Typically, for this type of data collection, I would
>>>> assume that you'd present a query with some search results, and ask users
>>>> "which results are relevant to this query" (which is what discernatron
>>>> does, at a very high effort level).
>>>>
>>>> What I think your proposing instead is that when a user visits an
>>>> article, we present them with a question that asks "would this search query
>>>> be  relevant to the article you are looking at".
>>>>
>>>> I can see this working, provided that the query is controlled and the
>>>> question is *not* phrased like it is above.
>>>>
>>>> I think that for this to work, the question should be phrased in a way
>>>> that elicits a simple "top-level" (maybe "yes" or "no") response. For
>>>> example, the question "*is this page about*: 'hydrostone halifax nova
>>>> scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a
>>>> question like "is this article relevant to the following query: ..." seems
>>>> more complicated 🤔 .
>>>>
>>>>
>>>> On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson <
>>>> [email protected]> wrote:
>>>>
>>>>> On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Erik,
>>>>>>
>>>>>> I've been using some similar methods to evaluate Related Article
>>>>>> recommendations
>>>>>> <https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations>
>>>>>> and the source of the trending article card
>>>>>> <https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature>
>>>>>> in the Explore feed on Android. Let me know if you'd like to sit down and
>>>>>> chat about experimental design sometime.
>>>>>>
>>>>>> - J
>>>>>>
>>>>>>
>>>>> This might be useful. I'll see if i can find a time on both our
>>>>> calendars. I should note though this is explicitly not about experimental
>>>>> design. The data is not going to be used for experimental purposes, but
>>>>> rather to feed into a machine learning pipeline that will re-order search
>>>>> results to provide the best results at the top of the list. For the 
>>>>> purpose
>>>>> of ensuring the long tail is represented in the training data for this
>>>>> model I would like to have a few tens of thousands of labels for (query,
>>>>> page) combinations each month. The relevance of pages to a query does have
>>>>> some temporal aspect, so we would likely want to only use the last N 
>>>>> months
>>>>> worth of data (TBD).
>>>>>
>>>>> On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> At our weekly relevance meeting an interesting idea came up about
>>>>>>> how to collect relevance judgements for the long tail of queries, which
>>>>>>> make up around 60% of search sessions.
>>>>>>>
>>>>>>> We are pondering asking questions on the article pages themselves.
>>>>>>> Roughly we would manually curate some list of queries we want to collect
>>>>>>> relevance judgements for. When a user has spent some threshold of time
>>>>>>> (60s?) on a page we would, for some % of users, check if we have any
>>>>>>> queries we want labeled for this page, and then ask them if the page is 
>>>>>>> a
>>>>>>> relevant result for that query. In this way the amount of work asked of
>>>>>>> individuals is relatively low and hopefully something they can answer
>>>>>>> without too much work. We know that the average page receives a few
>>>>>>> thousand page views per day, so even with a relatively low response 
>>>>>>> rate we
>>>>>>> could probably collect a reasonable number of judgements over some 
>>>>>>> medium
>>>>>>> length time period (weeks?)
>>>>>>>
>>>>>>> These labels would almost certainly be noisy, we would need to
>>>>>>> collect the same judgement many times to get any kind of certainty on 
>>>>>>> the
>>>>>>> label. Additionally we would not be able to really explain the nuances 
>>>>>>> of a
>>>>>>> grading scale with many points, we would probably have to use either a
>>>>>>> thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley
>>>>>>> face.
>>>>>>>
>>>>>>> Does this seem reasonable? Are there other ways we could go about
>>>>>>> collecting the same data? How to design it in a non-intrusive manner 
>>>>>>> that
>>>>>>> gets results, but doesn't annoy users? Other thoughts?
>>>>>>>
>>>>>>>
>>>>>>> For some background:
>>>>>>>
>>>>>>> * We are currently generating labeled data using statistical
>>>>>>> analysis (clickmodels) against historical click data. This analysis
>>>>>>> requires there to be multiple search sessions with the same query 
>>>>>>> presented
>>>>>>> with similar results to estimate the relevance of those results. A 
>>>>>>> manual
>>>>>>> review of the results showed queries with clicks from at least 10 
>>>>>>> sessions
>>>>>>> had reasonable but not great labels, queries with 35+ sessions looked
>>>>>>> pretty good, and queries with hundreds of sessions were labeled really 
>>>>>>> well.
>>>>>>>
>>>>>>> * an analysis of 80 days worth of search click logs showed that 35
>>>>>>> to 40% of search sessions are for queries that are repeated more than 10
>>>>>>> times in that 80 day period. Around 20% of search session are for 
>>>>>>> queries
>>>>>>> that are repeated more than 35 times in that 80 day period. (
>>>>>>> https://phabricator.wikimedia.org/P5371)
>>>>>>>
>>>>>>> * Our privacy policy prevents us from keeping more than 90 days
>>>>>>> worth of data from which to run these clickmodels. Practically 80 days 
>>>>>>> is
>>>>>>> probably a reasonable cutoff, as we will want to re-use the data 
>>>>>>> multiple
>>>>>>> times before needing to delete it and generate a new set of labels.
>>>>>>>
>>>>>>> * We currently collect human relevance judgements with Discernatron (
>>>>>>> https://discernatron.wmflabs.org/). This is useful data for manual
>>>>>>> evaluation of changes, but the data set is much too small (low hundreds 
>>>>>>> of
>>>>>>> queries, with an average of 50 documents per query) to integrate into
>>>>>>> machine learning. The process of judging query/document pairs for the
>>>>>>> community is quite tedious, and it doesn't seem like a great use of
>>>>>>> engineer time for us to do this ourselves.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> AI mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/ai
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jonathan T. Morgan
>>>>>> Senior Design Researcher
>>>>>> Wikimedia Foundation
>>>>>> User:Jmorgan (WMF)
>>>>>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discovery mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discovery mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jan Drewniak
>>>> UX Engineer, Discovery
>>>> Wikimedia Foundation
>>>>
>>>> _______________________________________________
>>>> discovery mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>
>>>>
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>>
>> --
>> Jonathan T. Morgan
>> Senior Design Researcher
>> Wikimedia Foundation
>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>


-- 
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] [AI] Collecting human labeled relevance judgements for search from readers

Reply via email to