Re: [discovery] Next steps for language goal

Trey Jones Thu, 05 Nov 2015 08:17:53 -0800

I agree that the top 3 are straightforward and seem likely to be
beneficial. Translation is definitely very hard, and finding
license-compatible libraries that are effective enough and cover enough
languages seems daunting. And speed might also be a big concern. Given the
scope of non-English queries on enwiki, I don't think it's worth it right
now.


The last one is new to me. I kinda like it in theory, though I'll have to
mull it over for a while. Could we test it without building the combined
index if that's prohibitive by running intitle queries or some such against
the top N likely useful indexes and seeing what hit rate we get?

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Wed, Nov 4, 2015 at 3:07 PM, Erik Bernhardson <[email protected]
> wrote:

> Taking the above into consideration and reviewing what we have in the
> brainstorming session, the set of idea seems to be the following:
>
> Do language detection on more than just zero result queries, how about
> queries that only return 1 or 2 results
>
>    - Seems useful and doable, but will only effect satisfaction and not
>    the zero result rate. Still possibly worthwhile.
>
>
>    - This should be relatively easy to test with relevancy lab
>
>
>    1. Determine the language to search in via something other than
>    language detection (headers, geolocation, etc)
>
>
>    - Working up a couple heuristics wouldn't be too hard. The webrequests
>    table in hive has the accept language header and geolocation info as well
>    as the query string, so we could extract a set of queries to test with
>
>
>    1. Integrate wikidata search
>
>
>    - This looks to be https://en.wikipedia.org/wiki/MediaWiki:Wdsearch.js
>
>
>    - We could integrate that more directly, can't be tested by relevancy
>    lab. It is basically just an additional set of results below the existing
>    results.
>
>
>    - Would need a significant cleanup to pass code review, but it's not
>    particularly hard to do
>
>
>    1. Translate the query from the provided language into the language of
>    the wiki being searched on
>
>
>    - This seems "very hard".  Not only do we have to correctly detect the
>    language the user input, but then we have to translate that into a second
>    language
>
>
>    - The CX service might be able to provide us a translation endpoint
>    that works with whatever they are currently using, but will likely have
>    high latency. Our inability (currently) to do async requests in php makes
>    it harder to hide that latency.
>
> Build an index that contains the titles from all wikis, but not much else.
> This could be used to suggest the user search on other wikis (or to inform
> the code that does actual searches on other wikis)
>
>    - This could be somewhat tested in relevancy lab, but first we would
>    have to build something to actually combine all the titles into the same
>    index.
>    -
>
>
> I think any of the top three could be worked on, the first and the second
> can be validated through relevancy lab. The third takes a completely
> different approach and is not easily testable outside of production, but
> may be useful. The fourth is "very hard" and i think we should leave it
> alone for now.  The fifth and final idea was only put forth once, but is
> interesting. I'm not sure how valuable it would be though.
>
>
> On Tue, Nov 3, 2015 at 3:55 PM, Erik Bernhardson <
> [email protected]> wrote:
>
>> In terms of user language data we have,  within the webrequests table in
>> hive we have the accept language header and we have geolocation
>> information. This table also contains the query strings so we can extract
>> the exact search terms and feed that information into relevancy lab.
>>
>> On Tue, Nov 3, 2015 at 3:29 PM, Kevin Smith <[email protected]> wrote:
>>
>>> So do we think we should favor the "try to guess the user's language(s)"
>>> item over others that would benefit from the relevance lab? Are there steps
>>> we could/should take in advance, such as analyzing whatever user language
>>> data we have, or instrumenting to get more if we don't have enough?
>>>
>>>
>>>
>>> Kevin Smith
>>> Agile Coach, Wikimedia Foundation
>>>
>>>
>>> On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones <[email protected]> wrote:
>>>
>>>> Sorry I didn't respond to this sooner!
>>>>
>>>> I really like the idea of trying to detect what languages the user can
>>>> read, and searching in (a subset of) those. This wouldn't benefit from
>>>> relevance lab testing, though. It'll need to be measured against the user
>>>> satisfaction metric. (BTW, Do we have a sense of how many users have info
>>>> we can detect for this?)
>>>>
>>>> I think the biggest problem with language detection is the quality of
>>>> the language detector. The Elastic Search plugin we tested has a Romanian
>>>> fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki
>>>> searches, which is crazy, and I got 0% accuracy for Romanian on my much
>>>> smaller tagged corpus of failed (zero results) queries to enwiki). Most of
>>>> the time, I would expect queries sent to the wrong wiki to fail (though
>>>> there are some exceptions)—but a query in English that does get hits in
>>>> rowiki is going to just look wrong most of the time.
>>>>
>>>> There are several proposals for improving language detection in the
>>>> etherpad, and we can work on them in parallel, since any given one could be
>>>> better than any other one. (We don't want to make 100 of them, but a few to
>>>> test and compare would be nice—there may also be reasonable speed/accuracy
>>>> tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good
>>>> deal.)
>>>>
>>>> We need training and evaluation data. I see a few ways of getting it.
>>>> The easy, lower-quality way is just take queries from a given wiki and
>>>> assume they are in the language in question (i.e., eswiki queries are in
>>>> Spanish). Easy, not 100% accurate, unlimited supply. The hard,
>>>> higher-quality way is to hand annotate a corpus of queries. This is slow,
>>>> but doable. I can do on the order of 1000 queries in a day—more if I were
>>>> less accurate and more willing to toss stuff into the junk pile. I couldn't
>>>> do it for a week straight, though, without going crazy. A possible middle
>>>> of the road approach would be to create a feedback loop and run detectors
>>>> on our training data and review and remove items that are not in the
>>>> desired language (we could also start by filtering things that are not in
>>>> the right character set, like removing all Arabic, Cyrillic, and Chinese
>>>> from enwiki, frwiki, and eswiki queries). If we want thousands of
>>>> hand-annotated queries, we need to get annotating!
>>>>
>>>> I think we can use the relevance lab to help evaluate a language
>>>> detector (at least with respect to zero results rate). We could run the
>>>> detector against a pile of zero-results queries, then group the queries by
>>>> detected language, and run them against the relevant wiki (if we have room
>>>> in labs for the indexes, and we update the relevance lab tools to support
>>>> choosing a target wiki to search). We wouldn't be comparing "before" and
>>>> "after", but just measuring the zero results rate against the target wiki.
>>>> As any time we're using zero-results rate, there's no guarantee that we'll
>>>> be giving good results, just results (e.g., "unix time stamp" queries with
>>>> English words fail on enwiki but sometimes work on zhwiki for some reason,
>>>> but that's not really better.)
>>>>
>>>> I'm somewhat worried about being able to reduce the targeted zero
>>>> results rate by 10%. In my test[1], only 12% of non-DOI zero-results
>>>> queries were "in a language", and only about a third got results when
>>>> searched in the "correct" (human-determined) wiki. I didn't filter bots
>>>> other than the DOI bot, and some non-language queries (e.g., names) might
>>>> get results in another wiki, but there may not be enough wiggle room.
>>>> There's a lot of junk in other languages, too, but maybe filtering bots
>>>> will help more than I dare presume.
>>>>
>>>>
>>>> [1]
>>>> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_Searching#Perfect_identification.2C_ignoring_non-language_queries
>>>>
>>>> Trey Jones
>>>> Software Engineer, Discovery
>>>> Wikimedia Foundation
>>>>
>>>> On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson <
>>>> [email protected]> wrote:
>>>>
>>>>> It measures the zero results rate for 1 in 10 search requests via
>>>>> CirrusSearchUserTesting log that we used last quarter.
>>>>>
>>>>> On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Define this "does it do anything?" test?
>>>>>>
>>>>>> On 2 November 2015 at 19:58, Erik Bernhardson
>>>>>> <[email protected]> wrote:
>>>>>> > Now that we have the feature deployed (behind a feature flag), and
>>>>>> have an
>>>>>> > initial "does it do anything?" test going out today, along with an
>>>>>> upcoming
>>>>>> > integration with our satisfaction metrics, we need to come up with
>>>>>> how will
>>>>>> > will try to further move the needle forward.
>>>>>> >
>>>>>> > For reference these are our Q2 goals:
>>>>>> >
>>>>>> > Run A/B test for a feature that:
>>>>>> >
>>>>>> > Uses a library to detect the language of a user's search query.
>>>>>> > Adjusts results to match that language.
>>>>>> >
>>>>>> > Determine from A/B test results whether this feature is fit to push
>>>>>> to
>>>>>> > production, with the aim to:
>>>>>> >
>>>>>> > Improve search user satisfaction by 10% (from 15% to 16.5%).
>>>>>> > Reduce zero results rate for non-automata search queries by 10%.
>>>>>> >
>>>>>> > We brainstormed a number of possibilities here:
>>>>>> >
>>>>>> > https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
>>>>>> >
>>>>>> >
>>>>>> > We now need to decide which of these ideas we should prioritize. We
>>>>>> might
>>>>>> > want to take into consideration which of these can be pre-tested
>>>>>> with our
>>>>>> > relevancy lab work, such that we can prefer to work on things we
>>>>>> think will
>>>>>> > move the needle the most. I'm really not sure which of these to
>>>>>> push forward
>>>>>> > on, so let us know which you think can have the most impact, or
>>>>>> where the
>>>>>> > expected impact could be measured with relevancy lab with minimal
>>>>>> work.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > discovery mailing list
>>>>>> > [email protected]
>>>>>> > https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Oliver Keyes
>>>>>> Count Logula
>>>>>> Wikimedia Foundation
>>>>>>
>>>>>> _______________________________________________
>>>>>> discovery mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discovery mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> discovery mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>
>>>>
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Next steps for language goal

Reply via email to