I agree that the top 3 are straightforward and seem likely to be beneficial. Translation is definitely very hard, and finding license-compatible libraries that are effective enough and cover enough languages seems daunting. And speed might also be a big concern. Given the scope of non-English queries on enwiki, I don't think it's worth it right now.
The last one is new to me. I kinda like it in theory, though I'll have to mull it over for a while. Could we test it without building the combined index if that's prohibitive by running intitle queries or some such against the top N likely useful indexes and seeing what hit rate we get? Trey Jones Software Engineer, Discovery Wikimedia Foundation On Wed, Nov 4, 2015 at 3:07 PM, Erik Bernhardson <[email protected] > wrote: > Taking the above into consideration and reviewing what we have in the > brainstorming session, the set of idea seems to be the following: > > Do language detection on more than just zero result queries, how about > queries that only return 1 or 2 results > > - Seems useful and doable, but will only effect satisfaction and not > the zero result rate. Still possibly worthwhile. > > > - This should be relatively easy to test with relevancy lab > > > 1. Determine the language to search in via something other than > language detection (headers, geolocation, etc) > > > - Working up a couple heuristics wouldn't be too hard. The webrequests > table in hive has the accept language header and geolocation info as well > as the query string, so we could extract a set of queries to test with > > > 1. Integrate wikidata search > > > - This looks to be https://en.wikipedia.org/wiki/MediaWiki:Wdsearch.js > > > - We could integrate that more directly, can't be tested by relevancy > lab. It is basically just an additional set of results below the existing > results. > > > - Would need a significant cleanup to pass code review, but it's not > particularly hard to do > > > 1. Translate the query from the provided language into the language of > the wiki being searched on > > > - This seems "very hard". Not only do we have to correctly detect the > language the user input, but then we have to translate that into a second > language > > > - The CX service might be able to provide us a translation endpoint > that works with whatever they are currently using, but will likely have > high latency. Our inability (currently) to do async requests in php makes > it harder to hide that latency. > > Build an index that contains the titles from all wikis, but not much else. > This could be used to suggest the user search on other wikis (or to inform > the code that does actual searches on other wikis) > > - This could be somewhat tested in relevancy lab, but first we would > have to build something to actually combine all the titles into the same > index. > - > > > I think any of the top three could be worked on, the first and the second > can be validated through relevancy lab. The third takes a completely > different approach and is not easily testable outside of production, but > may be useful. The fourth is "very hard" and i think we should leave it > alone for now. The fifth and final idea was only put forth once, but is > interesting. I'm not sure how valuable it would be though. > > > On Tue, Nov 3, 2015 at 3:55 PM, Erik Bernhardson < > [email protected]> wrote: > >> In terms of user language data we have, within the webrequests table in >> hive we have the accept language header and we have geolocation >> information. This table also contains the query strings so we can extract >> the exact search terms and feed that information into relevancy lab. >> >> On Tue, Nov 3, 2015 at 3:29 PM, Kevin Smith <[email protected]> wrote: >> >>> So do we think we should favor the "try to guess the user's language(s)" >>> item over others that would benefit from the relevance lab? Are there steps >>> we could/should take in advance, such as analyzing whatever user language >>> data we have, or instrumenting to get more if we don't have enough? >>> >>> >>> >>> Kevin Smith >>> Agile Coach, Wikimedia Foundation >>> >>> >>> On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones <[email protected]> wrote: >>> >>>> Sorry I didn't respond to this sooner! >>>> >>>> I really like the idea of trying to detect what languages the user can >>>> read, and searching in (a subset of) those. This wouldn't benefit from >>>> relevance lab testing, though. It'll need to be measured against the user >>>> satisfaction metric. (BTW, Do we have a sense of how many users have info >>>> we can detect for this?) >>>> >>>> I think the biggest problem with language detection is the quality of >>>> the language detector. The Elastic Search plugin we tested has a Romanian >>>> fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki >>>> searches, which is crazy, and I got 0% accuracy for Romanian on my much >>>> smaller tagged corpus of failed (zero results) queries to enwiki). Most of >>>> the time, I would expect queries sent to the wrong wiki to fail (though >>>> there are some exceptions)—but a query in English that does get hits in >>>> rowiki is going to just look wrong most of the time. >>>> >>>> There are several proposals for improving language detection in the >>>> etherpad, and we can work on them in parallel, since any given one could be >>>> better than any other one. (We don't want to make 100 of them, but a few to >>>> test and compare would be nice—there may also be reasonable speed/accuracy >>>> tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good >>>> deal.) >>>> >>>> We need training and evaluation data. I see a few ways of getting it. >>>> The easy, lower-quality way is just take queries from a given wiki and >>>> assume they are in the language in question (i.e., eswiki queries are in >>>> Spanish). Easy, not 100% accurate, unlimited supply. The hard, >>>> higher-quality way is to hand annotate a corpus of queries. This is slow, >>>> but doable. I can do on the order of 1000 queries in a day—more if I were >>>> less accurate and more willing to toss stuff into the junk pile. I couldn't >>>> do it for a week straight, though, without going crazy. A possible middle >>>> of the road approach would be to create a feedback loop and run detectors >>>> on our training data and review and remove items that are not in the >>>> desired language (we could also start by filtering things that are not in >>>> the right character set, like removing all Arabic, Cyrillic, and Chinese >>>> from enwiki, frwiki, and eswiki queries). If we want thousands of >>>> hand-annotated queries, we need to get annotating! >>>> >>>> I think we can use the relevance lab to help evaluate a language >>>> detector (at least with respect to zero results rate). We could run the >>>> detector against a pile of zero-results queries, then group the queries by >>>> detected language, and run them against the relevant wiki (if we have room >>>> in labs for the indexes, and we update the relevance lab tools to support >>>> choosing a target wiki to search). We wouldn't be comparing "before" and >>>> "after", but just measuring the zero results rate against the target wiki. >>>> As any time we're using zero-results rate, there's no guarantee that we'll >>>> be giving good results, just results (e.g., "unix time stamp" queries with >>>> English words fail on enwiki but sometimes work on zhwiki for some reason, >>>> but that's not really better.) >>>> >>>> I'm somewhat worried about being able to reduce the targeted zero >>>> results rate by 10%. In my test[1], only 12% of non-DOI zero-results >>>> queries were "in a language", and only about a third got results when >>>> searched in the "correct" (human-determined) wiki. I didn't filter bots >>>> other than the DOI bot, and some non-language queries (e.g., names) might >>>> get results in another wiki, but there may not be enough wiggle room. >>>> There's a lot of junk in other languages, too, but maybe filtering bots >>>> will help more than I dare presume. >>>> >>>> >>>> [1] >>>> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_Searching#Perfect_identification.2C_ignoring_non-language_queries >>>> >>>> Trey Jones >>>> Software Engineer, Discovery >>>> Wikimedia Foundation >>>> >>>> On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson < >>>> [email protected]> wrote: >>>> >>>>> It measures the zero results rate for 1 in 10 search requests via >>>>> CirrusSearchUserTesting log that we used last quarter. >>>>> >>>>> On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes <[email protected]> >>>>> wrote: >>>>> >>>>>> Define this "does it do anything?" test? >>>>>> >>>>>> On 2 November 2015 at 19:58, Erik Bernhardson >>>>>> <[email protected]> wrote: >>>>>> > Now that we have the feature deployed (behind a feature flag), and >>>>>> have an >>>>>> > initial "does it do anything?" test going out today, along with an >>>>>> upcoming >>>>>> > integration with our satisfaction metrics, we need to come up with >>>>>> how will >>>>>> > will try to further move the needle forward. >>>>>> > >>>>>> > For reference these are our Q2 goals: >>>>>> > >>>>>> > Run A/B test for a feature that: >>>>>> > >>>>>> > Uses a library to detect the language of a user's search query. >>>>>> > Adjusts results to match that language. >>>>>> > >>>>>> > Determine from A/B test results whether this feature is fit to push >>>>>> to >>>>>> > production, with the aim to: >>>>>> > >>>>>> > Improve search user satisfaction by 10% (from 15% to 16.5%). >>>>>> > Reduce zero results rate for non-automata search queries by 10%. >>>>>> > >>>>>> > We brainstormed a number of possibilities here: >>>>>> > >>>>>> > https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming >>>>>> > >>>>>> > >>>>>> > We now need to decide which of these ideas we should prioritize. We >>>>>> might >>>>>> > want to take into consideration which of these can be pre-tested >>>>>> with our >>>>>> > relevancy lab work, such that we can prefer to work on things we >>>>>> think will >>>>>> > move the needle the most. I'm really not sure which of these to >>>>>> push forward >>>>>> > on, so let us know which you think can have the most impact, or >>>>>> where the >>>>>> > expected impact could be measured with relevancy lab with minimal >>>>>> work. >>>>>> > >>>>>> > >>>>>> > >>>>>> > _______________________________________________ >>>>>> > discovery mailing list >>>>>> > [email protected] >>>>>> > https://lists.wikimedia.org/mailman/listinfo/discovery >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Oliver Keyes >>>>>> Count Logula >>>>>> Wikimedia Foundation >>>>>> >>>>>> _______________________________________________ >>>>>> discovery mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> discovery mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> discovery mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>> >>>> >>> >>> _______________________________________________ >>> discovery mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >>> >> > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
