On Thu, May 4, 2017 at 4:27 PM, Trey Jones <[email protected]> wrote:
> One possible way to give people the context they need to answer the >> question accurately is to provide them with, say, three of the top search >> queries that you think are relevant to the result, and ask them to choose >> which one is *most* relevant. > > > That might be less confusing, but unfortunately I don't think it would > give us what we want. In this scenario, we'd need up/down votes on all > three options, and relative ranking among them wouldn't be useful. (I can > give an example to explain if that's not clear.) > "click all that apply"? Okay, I'm spitballing now :P Happy to talk about this more if you want. Sounds like I'd need a little more background on your goals and the data you're working with to be more helpful - J > > I agree this falls under (or is at least reasonably similar to) > experimental design, though, and it'd be great to get help. > > (While this was Erik's excellent idea, I'm very excited about it because > it would mean I could stop feeling guilty about not having done any > Discernatron queries in months.) > > > On Thu, May 4, 2017 at 7:07 PM, Jonathan Morgan <[email protected]> > wrote: > >> This conversation is exactly what I meant by "experimental design" above. >> >> I like Jan's recommendation to keep the prompt simple, and ask to people >> to provide a quick binary judgement. However I agree that, considering some >> of the search queries you're showing folks are going to be kind of oddball, >> you want to give them a little bit of context to help them understand that >> they're looking at a search query. >> >> One possible way to give people the context they need to answer the >> question accurately is to provide them with, say, three of the top search >> queries that you think are relevant to the result, and ask them to choose >> which one is *most* relevant. >> >> Without some context, I'm not sure I would be able to give an accurate >> answer to the question "Is this article about 'hydrostone halifax nova >> scotia'"? >> >> Seeing multiple examples makes decision-making easier. The prompt could >> be something like "Which set of [search terms/key words/tags] is most >> relevant to this article?" >> >> Adding a "none of the above" option as well would allow you to screen out >> cases where the responder was either confused by the question, or felt that >> none of the candidate queries were even remotely relevant. >> >> I suggest you loop Aeryn Palmer from Legal in, and add a "why are we >> asking this?" link into the banner/quicksurvey popup that links to a survey >> privacy statement page on FoundationWiki >> <https://wikimediafoundation.org/wiki/Quick_Survey_Privacy_Statement>. >> >> Hope that helps, >> J >> >> On Thu, May 4, 2017 at 12:32 PM, Trey Jones <[email protected]> wrote: >> >>> Yeah, this is definitely the reverse of Discernatron. Part of the reason >>> for waiting 60s is that then, hopefully, the reader at least has some idea >>> what the article is about (another difficulty with Discernatron), so they >>> only have to spend a little time guessing what the query is about. >>> >>> We are going to have to work on the wording of the question. It needs to >>> be clear and concise. >>> >>> I worry that *Is this page about "X"?* might make people reply too >>> strictly. A page can be reasonable relevant to X without being *about* >>> X. What about this: *If you searched for X, would this article be a >>> good result?* I'm not sure normal people think of "results". >>> >>> - *Would someone who searched for X want to read this article?* >>> —better >>> - *If someone searched for X, would they want to read this >>> article?*—longer, >>> but easier to parse. >>> - *If someone searched for X, **would they find what they are >>> looking for in this article?*—probably too long >>> >>> More brainstorming on this wouldn't hurt, even if it is very early in >>> the whole process. >>> >>> There's also the wording that goes with the request for a judgement. >>> "Help us make search better!" might get more response than just the >>> judgement question. >>> >>> Folks in fundraising might have good ideas about how to catch people's >>> attention, and at the very least would could learn from them and actively >>> A/B test different options and see what kind of response rate we get. >>> >>> We might also get cleaner A/B test results if we limited their scope—a >>> few pages and a few "queries" where we know the answers, so we can gauge >>> not only response rate, but also engagement, to see if one kind of phrasing >>> makes people try a little harder. >>> >>> We might also want to make "No, thanks" the default button so that it is >>> easier to bail than to give random input. >>> >>> Trey Jones >>> Software Engineer, Discovery >>> Wikimedia Foundation >>> >>> On Thu, May 4, 2017 at 2:44 PM, Jan Drewniak <[email protected]> >>> wrote: >>> >>>> Hi Erik >>>> >>>> From my understanding, it looks like your looking to collect relevance >>>> data "in reverse". Typically, for this type of data collection, I would >>>> assume that you'd present a query with some search results, and ask users >>>> "which results are relevant to this query" (which is what discernatron >>>> does, at a very high effort level). >>>> >>>> What I think your proposing instead is that when a user visits an >>>> article, we present them with a question that asks "would this search query >>>> be relevant to the article you are looking at". >>>> >>>> I can see this working, provided that the query is controlled and the >>>> question is *not* phrased like it is above. >>>> >>>> I think that for this to work, the question should be phrased in a way >>>> that elicits a simple "top-level" (maybe "yes" or "no") response. For >>>> example, the question "*is this page about*: 'hydrostone halifax nova >>>> scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a >>>> question like "is this article relevant to the following query: ..." seems >>>> more complicated 🤔 . >>>> >>>> >>>> On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson < >>>> [email protected]> wrote: >>>> >>>>> On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Erik, >>>>>> >>>>>> I've been using some similar methods to evaluate Related Article >>>>>> recommendations >>>>>> <https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations> >>>>>> and the source of the trending article card >>>>>> <https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature> >>>>>> in the Explore feed on Android. Let me know if you'd like to sit down and >>>>>> chat about experimental design sometime. >>>>>> >>>>>> - J >>>>>> >>>>>> >>>>> This might be useful. I'll see if i can find a time on both our >>>>> calendars. I should note though this is explicitly not about experimental >>>>> design. The data is not going to be used for experimental purposes, but >>>>> rather to feed into a machine learning pipeline that will re-order search >>>>> results to provide the best results at the top of the list. For the >>>>> purpose >>>>> of ensuring the long tail is represented in the training data for this >>>>> model I would like to have a few tens of thousands of labels for (query, >>>>> page) combinations each month. The relevance of pages to a query does have >>>>> some temporal aspect, so we would likely want to only use the last N >>>>> months >>>>> worth of data (TBD). >>>>> >>>>> On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> At our weekly relevance meeting an interesting idea came up about >>>>>>> how to collect relevance judgements for the long tail of queries, which >>>>>>> make up around 60% of search sessions. >>>>>>> >>>>>>> We are pondering asking questions on the article pages themselves. >>>>>>> Roughly we would manually curate some list of queries we want to collect >>>>>>> relevance judgements for. When a user has spent some threshold of time >>>>>>> (60s?) on a page we would, for some % of users, check if we have any >>>>>>> queries we want labeled for this page, and then ask them if the page is >>>>>>> a >>>>>>> relevant result for that query. In this way the amount of work asked of >>>>>>> individuals is relatively low and hopefully something they can answer >>>>>>> without too much work. We know that the average page receives a few >>>>>>> thousand page views per day, so even with a relatively low response >>>>>>> rate we >>>>>>> could probably collect a reasonable number of judgements over some >>>>>>> medium >>>>>>> length time period (weeks?) >>>>>>> >>>>>>> These labels would almost certainly be noisy, we would need to >>>>>>> collect the same judgement many times to get any kind of certainty on >>>>>>> the >>>>>>> label. Additionally we would not be able to really explain the nuances >>>>>>> of a >>>>>>> grading scale with many points, we would probably have to use either a >>>>>>> thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley >>>>>>> face. >>>>>>> >>>>>>> Does this seem reasonable? Are there other ways we could go about >>>>>>> collecting the same data? How to design it in a non-intrusive manner >>>>>>> that >>>>>>> gets results, but doesn't annoy users? Other thoughts? >>>>>>> >>>>>>> >>>>>>> For some background: >>>>>>> >>>>>>> * We are currently generating labeled data using statistical >>>>>>> analysis (clickmodels) against historical click data. This analysis >>>>>>> requires there to be multiple search sessions with the same query >>>>>>> presented >>>>>>> with similar results to estimate the relevance of those results. A >>>>>>> manual >>>>>>> review of the results showed queries with clicks from at least 10 >>>>>>> sessions >>>>>>> had reasonable but not great labels, queries with 35+ sessions looked >>>>>>> pretty good, and queries with hundreds of sessions were labeled really >>>>>>> well. >>>>>>> >>>>>>> * an analysis of 80 days worth of search click logs showed that 35 >>>>>>> to 40% of search sessions are for queries that are repeated more than 10 >>>>>>> times in that 80 day period. Around 20% of search session are for >>>>>>> queries >>>>>>> that are repeated more than 35 times in that 80 day period. ( >>>>>>> https://phabricator.wikimedia.org/P5371) >>>>>>> >>>>>>> * Our privacy policy prevents us from keeping more than 90 days >>>>>>> worth of data from which to run these clickmodels. Practically 80 days >>>>>>> is >>>>>>> probably a reasonable cutoff, as we will want to re-use the data >>>>>>> multiple >>>>>>> times before needing to delete it and generate a new set of labels. >>>>>>> >>>>>>> * We currently collect human relevance judgements with Discernatron ( >>>>>>> https://discernatron.wmflabs.org/). This is useful data for manual >>>>>>> evaluation of changes, but the data set is much too small (low hundreds >>>>>>> of >>>>>>> queries, with an average of 50 documents per query) to integrate into >>>>>>> machine learning. The process of judging query/document pairs for the >>>>>>> community is quite tedious, and it doesn't seem like a great use of >>>>>>> engineer time for us to do this ourselves. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> AI mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/ai >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jonathan T. Morgan >>>>>> Senior Design Researcher >>>>>> Wikimedia Foundation >>>>>> User:Jmorgan (WMF) >>>>>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discovery mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> discovery mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>>> >>>>> >>>> >>>> >>>> -- >>>> Jan Drewniak >>>> UX Engineer, Discovery >>>> Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> discovery mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>> >>>> >>> >>> _______________________________________________ >>> discovery mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >>> >> >> >> -- >> Jonathan T. Morgan >> Senior Design Researcher >> Wikimedia Foundation >> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >> >> >> _______________________________________________ >> discovery mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > > -- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
