Whatever Metrics and The Relevance Forge

Trey Jones Mon, 07 Mar 2016 06:06:45 -0800

Thanks Erik for summarizing the discussion so far.

The very last sentence got cut off:


But yes it's a huge engineering task with a lot of challenges :/ It's also


I think I know what was next:

... a fun engineering task with many new things to learn! :)


Even if that wasn't the next bit, it's still true.


On Fri, Mar 4, 2016 at 8:24 PM, Erik Bernhardson <[email protected]
> wrote:

> This thread started off list, but I'm hoping all of you watching along can
> help us along to brainstorm and improve search satisfaction. Note that
> these aren't all my thoughts, they are a conglomeration of thoughts (many
> copy/pasted from off-list emails) from Trey, David, Mikhail and I. That's
> also why this might not all read like one person wrote it.
>
> A few weeks ago I attended ElasticON and there was a good presentation
> about search satisfaction by Paul Nelson.  One of the things he thought was
> incredibly important, that we had already been thinking about but hadn't
> moved forward enough on, was generating an Engine Score. This week Paul
> held an online webinar where he gave the same presentation but without such
> strict time constraints which Trey attended. You can find my summary of
> this presentation in last weeks email to this list, 'ElasticON notes'
>
> Some things of note:
>
>    - He doesn't like the idea of golden corpora—but his idea is different
>    from Trey's. He imagines a hand-selected set of "important" queries that
>    find "important" documents. I don't like that either (at least not by
>    itself). I always imagine a random selection of queries for a golden 
> corpus.
>    - He lumps BM25 in with TF/IDF and calls them ancient and unmotivated
>    and from the 80s and 90s. David's convinced us that BM25 is a good thing to
>    pursue. Of course, part of Search Technologies' purpose is to drum up
>    business, so they can't say, "hey just use this in Elastic Search" or
>    they'd be out of business.
>    - He explains the mysterious K factor that got all this started in the
>    first place. It controls how much weight changes far down the results list
>    carry. It sounds like he might tune K based on the number of results
>    for every query, but my question about that wasn't answered. In the demo,
>    he's only pulling 25 results, which Erik's click-through data shows is
>    probably enough.
>    - He mentions that 25,000 "clicks" is a good enough sized set for
>    measuring a score (and having random noise come out in the wash). Not clear
>    if he meant 25K clicks, or 25K user sessions, since it was in the Q&A.
>
>
> David and Trey talked about this some, and Trey think's the idea of Paul's
> metric (Σ power(FACTOR, position) * isRelevant[user,
> searchResult[Q,position].DocID]) has a lot of appeal. It's based on
> clicks and user sessions, so we'd have to be able to capture all the
> relevant information and make it available somewhere to replay in Relevance
> Forge for assessment. We currently have a reasonable amount of clickthrough
> data collected from 0.5% of desktop search sessions that we can use for
> this task. There are some complications though because this is PII data and
> so has to be treated carefully.
>
> Mikhail's goal for our user satisfaction metric is to have a function that
> maps features including dwell time to user satisfaction ratio. (e.g., 10s =
> 20% likely to be satisfied, 10m = 94% likely to be satisfied, etc.). The
> predictive model is going to include a variety of features of varying
> predictive power, such as dwell time, clickthrough rate, engagement
> (scrolling), etc. One problem with the user satisfaction metric is that
> it isn't replayable. We can't re-run the queries in vitro and get data on
> what users think of the new results. However it does play into Nelson's
> idea, discussed in the paper and maybe in the video, of gradable relevance.
> Assigning a user satisfaction score to a given result would allow us to
> weight various clicks in his metric rather than treating them all as equal
> (though that works, too, if it's all you have).
>
> We need to  build a system that we are able to tune in an effective way.
> As pointed by Trey cirrus does not allow us to tune the core similarity
> function params. David tend's to think that we need to replace our core
> similarity function with a new one that is suited for optimizations and
> BM25 allows it, there are certainly others and we could build our own. But
> the problem will be:
>
> How to tune these parameters in an effective way, with BM25 we will have 7
> fields with 2 analyzers : 14 internal lucene fields. BM25 allows to tune
> 3 params : weight, k1, and b for each field
> - weight is likely to range between 0 and 1 with maybe 2 digits precision
> steps
> - k1 from 1 to 2
> - b from 0 to 1
> And I'm not talking about the query independent factors like popularity,
> pagerank & co that we may want to add. It's clear that we will have to
> tackle hard search performance problems...
>
> David tend's to think that we need to apply an optimization algorithm that
> will search for optimal combination according to an objective. David
> doesn't think we can run such optimization plan with A/B testing, it's why
> we need a way to replay a set of queries and compute various search engine
> scores.
> We don't know what's the best approach here:
> - extract the metrics from the search satisfaction schema that do not
> require user intervention (click and result position).
> - build our own set of queries with the tool Erik is building (temporary
> location: http://portal.wmflabs.org/search/index.php)
> -- Erik thinks we should do both, as they will give us completely
> different sets of information. The metrics about what our users are doing
> is a great source of information provides a good signal. The tool Erik is
> building comes at the problem from a different direction, sourcing search
> results from wiki/google/bing/ddg and getting humans to rate which results
> are relevant/not relevant on a scale of 1 to 4. This can be used with other
> algorithms to generate an independent score. Essentially I think the best
> Relevance Forge will output a multi-dimensional engine score and not just a
> single number.
> -- We should set up records of how this engine score changes over days,
> months, and longer, so we can see a rate of improvement (or lack thereof.
> But hopefully improvement :)
>
> And in the end will this (BM25 and/or searching with weights per field)
> work?
> - not sure, maybe the text features we have today are not relevant and we
> need to spend more time on extracting relevant text features from the
> mediawiki content model (https://phabricator.wikimedia.org/T128076)
>   but we should be able to say : this field has no or only bad impact
> impact.
>
> The big picture would be:
> - Refactor cirrus in a way that everything is suited for optimization
> - search engine score: the objective (Erik added it as goal)
> - Optimization algorithm to search/tune the system params. Trey has prior
> experience working within optimization frameworks. Mikhail also has
> relevant machine learning experience.
> - A/B testing with advanced metrics to confirm that the optimization found
> good combination
>
> With a framework like that we could spend more time on big impact text
> features (wikitext, synonyms, spelling correction ...).
> But yes it's a huge engineering task with a lot of challenges :/ It's also
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Search Satisfaction/Success/Whatever Metrics and The Relevance Forge

Reply via email to