Re: [discovery] [AI] Collecting human labeled relevance judgements for search from readers

Jan Drewniak Thu, 04 May 2017 11:45:06 -0700

Hi Erik

>From my understanding, it looks like your looking to collect relevance data
"in reverse". Typically, for this type of data collection, I would assume
that you'd present a query with some search results, and ask users "which
results are relevant to this query" (which is what discernatron does, at a
very high effort level).


What I think your proposing instead is that when a user visits an article,
we present them with a question that asks "would this search query be
relevant to the article you are looking at".

I can see this working, provided that the query is controlled and the
question is *not* phrased like it is above.

I think that for this to work, the question should be phrased in a way that
elicits a simple "top-level" (maybe "yes" or "no") response. For example,
the question "*is this page about*: 'hydrostone halifax nova scotia' " can
be responded to with a thumbs up 👍 or thumbs down 👎, but a question like
"is this article relevant to the following query: ..." seems more
complicated 🤔 .


On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson <[email protected]
> wrote:

> On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <[email protected]>
> wrote:
>
>> Hi Erik,
>>
>> I've been using some similar methods to evaluate Related Article
>> recommendations
>> <https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations>
>> and the source of the trending article card
>> <https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature>
>> in the Explore feed on Android. Let me know if you'd like to sit down and
>> chat about experimental design sometime.
>>
>> - J
>>
>>
> This might be useful. I'll see if i can find a time on both our calendars.
> I should note though this is explicitly not about experimental design. The
> data is not going to be used for experimental purposes, but rather to feed
> into a machine learning pipeline that will re-order search results to
> provide the best results at the top of the list. For the purpose of
> ensuring the long tail is represented in the training data for this model I
> would like to have a few tens of thousands of labels for (query, page)
> combinations each month. The relevance of pages to a query does have some
> temporal aspect, so we would likely want to only use the last N months
> worth of data (TBD).
>
> On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson <
>> [email protected]> wrote:
>>
>>> At our weekly relevance meeting an interesting idea came up about how to
>>> collect relevance judgements for the long tail of queries, which make up
>>> around 60% of search sessions.
>>>
>>> We are pondering asking questions on the article pages themselves.
>>> Roughly we would manually curate some list of queries we want to collect
>>> relevance judgements for. When a user has spent some threshold of time
>>> (60s?) on a page we would, for some % of users, check if we have any
>>> queries we want labeled for this page, and then ask them if the page is a
>>> relevant result for that query. In this way the amount of work asked of
>>> individuals is relatively low and hopefully something they can answer
>>> without too much work. We know that the average page receives a few
>>> thousand page views per day, so even with a relatively low response rate we
>>> could probably collect a reasonable number of judgements over some medium
>>> length time period (weeks?)
>>>
>>> These labels would almost certainly be noisy, we would need to collect
>>> the same judgement many times to get any kind of certainty on the label.
>>> Additionally we would not be able to really explain the nuances of a
>>> grading scale with many points, we would probably have to use either a
>>> thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley
>>> face.
>>>
>>> Does this seem reasonable? Are there other ways we could go about
>>> collecting the same data? How to design it in a non-intrusive manner that
>>> gets results, but doesn't annoy users? Other thoughts?
>>>
>>>
>>> For some background:
>>>
>>> * We are currently generating labeled data using statistical analysis
>>> (clickmodels) against historical click data. This analysis requires there
>>> to be multiple search sessions with the same query presented with similar
>>> results to estimate the relevance of those results. A manual review of the
>>> results showed queries with clicks from at least 10 sessions had reasonable
>>> but not great labels, queries with 35+ sessions looked pretty good, and
>>> queries with hundreds of sessions were labeled really well.
>>>
>>> * an analysis of 80 days worth of search click logs showed that 35 to
>>> 40% of search sessions are for queries that are repeated more than 10 times
>>> in that 80 day period. Around 20% of search session are for queries that
>>> are repeated more than 35 times in that 80 day period. (
>>> https://phabricator.wikimedia.org/P5371)
>>>
>>> * Our privacy policy prevents us from keeping more than 90 days worth of
>>> data from which to run these clickmodels. Practically 80 days is probably a
>>> reasonable cutoff, as we will want to re-use the data multiple times before
>>> needing to delete it and generate a new set of labels.
>>>
>>> * We currently collect human relevance judgements with Discernatron (
>>> https://discernatron.wmflabs.org/). This is useful data for manual
>>> evaluation of changes, but the data set is much too small (low hundreds of
>>> queries, with an average of 50 documents per query) to integrate into
>>> machine learning. The process of judging query/document pairs for the
>>> community is quite tedious, and it doesn't seem like a great use of
>>> engineer time for us to do this ourselves.
>>>
>>> _______________________________________________
>>> AI mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/ai
>>>
>>>
>>
>>
>> --
>> Jonathan T. Morgan
>> Senior Design Researcher
>> Wikimedia Foundation
>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>


-- 
Jan Drewniak
UX Engineer, Discovery
Wikimedia Foundation

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] [AI] Collecting human labeled relevance judgements for search from readers

Reply via email to