Thanks Erik for summarizing the discussion so far. The very last sentence got cut off:
But yes it's a huge engineering task with a lot of challenges :/ It's also I think I know what was next: ... a fun engineering task with many new things to learn! :) Even if that wasn't the next bit, it's still true. On Fri, Mar 4, 2016 at 8:24 PM, Erik Bernhardson <[email protected] > wrote: > This thread started off list, but I'm hoping all of you watching along can > help us along to brainstorm and improve search satisfaction. Note that > these aren't all my thoughts, they are a conglomeration of thoughts (many > copy/pasted from off-list emails) from Trey, David, Mikhail and I. That's > also why this might not all read like one person wrote it. > > A few weeks ago I attended ElasticON and there was a good presentation > about search satisfaction by Paul Nelson. One of the things he thought was > incredibly important, that we had already been thinking about but hadn't > moved forward enough on, was generating an Engine Score. This week Paul > held an online webinar where he gave the same presentation but without such > strict time constraints which Trey attended. You can find my summary of > this presentation in last weeks email to this list, 'ElasticON notes' > > Some things of note: > > - He doesn't like the idea of golden corpora—but his idea is different > from Trey's. He imagines a hand-selected set of "important" queries that > find "important" documents. I don't like that either (at least not by > itself). I always imagine a random selection of queries for a golden > corpus. > - He lumps BM25 in with TF/IDF and calls them ancient and unmotivated > and from the 80s and 90s. David's convinced us that BM25 is a good thing to > pursue. Of course, part of Search Technologies' purpose is to drum up > business, so they can't say, "hey just use this in Elastic Search" or > they'd be out of business. > - He explains the mysterious K factor that got all this started in the > first place. It controls how much weight changes far down the results list > carry. It sounds like he might tune K based on the number of results > for every query, but my question about that wasn't answered. In the demo, > he's only pulling 25 results, which Erik's click-through data shows is > probably enough. > - He mentions that 25,000 "clicks" is a good enough sized set for > measuring a score (and having random noise come out in the wash). Not clear > if he meant 25K clicks, or 25K user sessions, since it was in the Q&A. > > > David and Trey talked about this some, and Trey think's the idea of Paul's > metric (Σ power(FACTOR, position) * isRelevant[user, > searchResult[Q,position].DocID]) has a lot of appeal. It's based on > clicks and user sessions, so we'd have to be able to capture all the > relevant information and make it available somewhere to replay in Relevance > Forge for assessment. We currently have a reasonable amount of clickthrough > data collected from 0.5% of desktop search sessions that we can use for > this task. There are some complications though because this is PII data and > so has to be treated carefully. > > Mikhail's goal for our user satisfaction metric is to have a function that > maps features including dwell time to user satisfaction ratio. (e.g., 10s = > 20% likely to be satisfied, 10m = 94% likely to be satisfied, etc.). The > predictive model is going to include a variety of features of varying > predictive power, such as dwell time, clickthrough rate, engagement > (scrolling), etc. One problem with the user satisfaction metric is that > it isn't replayable. We can't re-run the queries in vitro and get data on > what users think of the new results. However it does play into Nelson's > idea, discussed in the paper and maybe in the video, of gradable relevance. > Assigning a user satisfaction score to a given result would allow us to > weight various clicks in his metric rather than treating them all as equal > (though that works, too, if it's all you have). > > We need to build a system that we are able to tune in an effective way. > As pointed by Trey cirrus does not allow us to tune the core similarity > function params. David tend's to think that we need to replace our core > similarity function with a new one that is suited for optimizations and > BM25 allows it, there are certainly others and we could build our own. But > the problem will be: > > How to tune these parameters in an effective way, with BM25 we will have 7 > fields with 2 analyzers : 14 internal lucene fields. BM25 allows to tune > 3 params : weight, k1, and b for each field > - weight is likely to range between 0 and 1 with maybe 2 digits precision > steps > - k1 from 1 to 2 > - b from 0 to 1 > And I'm not talking about the query independent factors like popularity, > pagerank & co that we may want to add. It's clear that we will have to > tackle hard search performance problems... > > David tend's to think that we need to apply an optimization algorithm that > will search for optimal combination according to an objective. David > doesn't think we can run such optimization plan with A/B testing, it's why > we need a way to replay a set of queries and compute various search engine > scores. > We don't know what's the best approach here: > - extract the metrics from the search satisfaction schema that do not > require user intervention (click and result position). > - build our own set of queries with the tool Erik is building (temporary > location: http://portal.wmflabs.org/search/index.php) > -- Erik thinks we should do both, as they will give us completely > different sets of information. The metrics about what our users are doing > is a great source of information provides a good signal. The tool Erik is > building comes at the problem from a different direction, sourcing search > results from wiki/google/bing/ddg and getting humans to rate which results > are relevant/not relevant on a scale of 1 to 4. This can be used with other > algorithms to generate an independent score. Essentially I think the best > Relevance Forge will output a multi-dimensional engine score and not just a > single number. > -- We should set up records of how this engine score changes over days, > months, and longer, so we can see a rate of improvement (or lack thereof. > But hopefully improvement :) > > And in the end will this (BM25 and/or searching with weights per field) > work? > - not sure, maybe the text features we have today are not relevant and we > need to spend more time on extracting relevant text features from the > mediawiki content model (https://phabricator.wikimedia.org/T128076) > but we should be able to say : this field has no or only bad impact > impact. > > The big picture would be: > - Refactor cirrus in a way that everything is suited for optimization > - search engine score: the objective (Erik added it as goal) > - Optimization algorithm to search/tune the system params. Trey has prior > experience working within optimization frameworks. Mikhail also has > relevant machine learning experience. > - A/B testing with advanced metrics to confirm that the optimization found > good combination > > With a framework like that we could spend more time on big impact text > features (wikitext, synonyms, spelling correction ...). > But yes it's a huge engineering task with a lot of challenges :/ It's also > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
