The more like code lives in elasticsearch, https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html gives a decent rundown of the various parameters available. The defaults we currently use are at https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/CirrusSearch.php#L450-L483
These can be overridden with a custom page on the wiki at MediaWiki:cirrussearch-morelikethis-settings. I can't suggest editors should tune this on their own though, it requires careful testing to see what changes do. The same options can also be overridden at query time via a series of internal test-only paremeters implemented at https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/Hooks.php#L230-L252 On Thu, Feb 18, 2016 at 4:00 PM, Jon Katz <[email protected]> wrote: > Hi, > Can someone on this list point me to where the more-like code sits? Or > better, yet would be someone documenting the rules that govern > prioritization of suggestions. > > I would like to document the logic for our communities so that we can have > an open discussion about what variables and weighting we should use to > suggest articles. > -J > > On Mon, Feb 15, 2016 at 11:26 AM, Dmitry Brant <[email protected]> > wrote: > >> Just a quick note that our latest production release (just published) >> contains this A/B test, in addition to the other updates. >> Looking forward to seeing the numbers from this! >> >> -Dmitry >> >> >> On Sun, Jan 31, 2016 at 9:35 PM, Dmitry Brant <[email protected]> >> wrote: >> >>> Roger that! I think we could squeeze it in -- the change would be pretty >>> straightforward. We'll be able to release a Beta with this A/B test in >>> short order, but it will probably be a couple weeks until our next >>> production release. I hope that's all right. >>> >>> >>> On Sat, Jan 30, 2016 at 1:02 PM, Gabriel Wicke <[email protected]> >>> wrote: >>> >>>> We are also happy to add cached entry points for high-traffic end >>>> points in the REST API. I commented to that effect at >>>> https://phabricator.wikimedia.org/T124216#1984206. Let us know if you >>>> think this would be useful for this use case. >>>> >>>> On Sat, Jan 30, 2016 at 8:11 AM, Adam Baso <[email protected]> wrote: >>>> > Okay. As per https://phabricator.wikimedia.org/T124225#1984080 I >>>> think if >>>> > we're doing near term experimentation with a controlled A/B test the >>>> Android >>>> > app is the only logical place to start. Dmitry, can that work for >>>> you? It's >>>> > not required, but I think it would be neat to see if we can move the >>>> needle >>>> > even more. Of course your quarterly goals take top priority...but >>>> what do >>>> > you think? >>>> > >>>> > On Sat, Jan 23, 2016 at 5:58 AM, Adam Baso <[email protected]> >>>> wrote: >>>> >> >>>> >> Hey all, am planning to look at Phabricator tasks and provide a reply >>>> >> during the upcoming weekdays. Just wanted to acknowledge I saw your >>>> replies! >>>> >> >>>> >> >>>> >> On Friday, January 22, 2016, Erik Bernhardson < >>>> [email protected]> >>>> >> wrote: >>>> >>> >>>> >>> On Thu, Jan 21, 2016 at 1:29 AM, Joaquin Oltra Hernandez >>>> >>> <[email protected]> wrote: >>>> >>>> >>>> >>>> Regarding the caching, we would need to agree between apps and web >>>> about >>>> >>>> the url and smaxage parameter as Adam noted so that the urls are >>>> exactly the >>>> >>>> same to not bloat varnish and reuse the same cached objects across >>>> >>>> platforms. >>>> >>>> >>>> >>>> It is an extremely adhoc and brittle solution but seems like it >>>> would be >>>> >>>> the greatest win. >>>> >>>> >>>> >>>> 20% of the traffic from searches by being only in android and web >>>> beta >>>> >>>> seems a lot to me, and we should work on reducing it, otherwise >>>> when it hits >>>> >>>> web stable we're going to crush the servers, so caching seems the >>>> highest >>>> >>>> priority. >>>> >>>> >>>> >>> To clarify its 20% of the load, as opposed to 20% of the traffic. >>>> But >>>> >>> same difference :) >>>> >>> >>>> >>>> >>>> >>>> Let's chime in https://phabricator.wikimedia.org/T124216 and >>>> continue >>>> >>>> the cache discussion there. >>>> >>>> >>>> >>>> Regarding the validity of results with opening text only, how >>>> should we >>>> >>>> proceed? Adam? >>>> >>>> >>>> >>> I've put together https://phabricator.wikimedia.org/T124258 to >>>> track >>>> >>> putting together an AB test that measures the difference in click >>>> through >>>> >>> rates for the two approaches. >>>> >>> >>>> >>> >>>> >>>> >>>> >>>> On Wed, Jan 20, 2016 at 9:34 PM, David Causse < >>>> [email protected]> >>>> >>>> wrote: >>>> >>>>> >>>> >>>>> Hi, >>>> >>>>> >>>> >>>>> Yes we can combine many factors, from templates (quality but also >>>> >>>>> disambiguation/stubs), size and others. >>>> >>>>> Today cirrus uses mostly the number of incoming links which >>>> (imho) is >>>> >>>>> not very good for morelike. >>>> >>>>> On enwiki results will also be scored according the weights >>>> defined in >>>> >>>>> >>>> https://en.wikipedia.org/wiki/MediaWiki:Cirrussearch-boost-templates. >>>> >>>>> >>>> >>>>> I wrote a small bash to compare results : >>>> >>>>> https://gist.github.com/nomoa/93c5097e3c3cb3b6ebad >>>> >>>>> Here is some random results from the list (Semetimes better, >>>> sometimes >>>> >>>>> worse) : >>>> >>>>> >>>> >>>>> $ sh morelike.sh Revolution_Muslim >>>> >>>>> Defaults >>>> >>>>> "title": "Chess", >>>> >>>>> "title": "Suicide attack", >>>> >>>>> "title": "Zachary Adam Chesser", >>>> >>>>> ======= >>>> >>>>> Opening text no boost links >>>> >>>>> "title": "Hungarian Revolution of 1956", >>>> >>>>> "title": "Muslims for America", >>>> >>>>> "title": "Salafist Front", >>>> >>>>> >>>> >>>>> $ sh morelike.sh Chesser >>>> >>>>> Defaults >>>> >>>>> "title": "Chess", >>>> >>>>> "title": "Edinburgh", >>>> >>>>> "title": "Edinburgh Corn Exchange", >>>> >>>>> ======= >>>> >>>>> Opening text no boost links >>>> >>>>> "title": "Dreghorn Barracks", >>>> >>>>> "title": "Edinburgh Chess Club", >>>> >>>>> "title": "Threipmuir Reservoir", >>>> >>>>> >>>> >>>>> $ sh morelike.sh Time_%28disambiguation%29 >>>> >>>>> Defaults >>>> >>>>> "title": "Atlantis: The Lost Empire", >>>> >>>>> "title": "Stargate", >>>> >>>>> "title": "Stargate SG-1", >>>> >>>>> ======= >>>> >>>>> Opening text no boost links >>>> >>>>> "title": "Father Time (disambiguation)", >>>> >>>>> "title": "The Last Time", >>>> >>>>> "title": "Time After Time", >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> Le 20/01/2016 19:34, Jon Robson a écrit : >>>> >>>>>> >>>> >>>>>> I'm actually interested to see whether this yields better >>>> results in >>>> >>>>>> certain examples where the algorithm is lacking [1]. If it's >>>> done as >>>> >>>>>> an A/B test we could even measure things such as click throughs >>>> in the >>>> >>>>>> related article feature (whether they go up or not) >>>> >>>>>> >>>> >>>>>> Out of interest is it also possible to take article size and >>>> type into >>>> >>>>>> account and not returning any morelike results for things like >>>> >>>>>> disambiguation pages and stubs? >>>> >>>>>> >>>> >>>>>> [1] https://www.mediawiki.org/wiki/Topic:Swsjajvdll3pf8ya >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> On Wed, Jan 20, 2016 at 9:47 AM, Adam Baso <[email protected]> >>>> >>>>>> wrote: >>>> >>>>>>> >>>> >>>>>>> One thing we could do regarding the quality of the output is >>>> check >>>> >>>>>>> results >>>> >>>>>>> against a random sample of popular articles (example approach >>>> to find >>>> >>>>>>> some >>>> >>>>>>> articles) on mdot Wikipedia. Presuming that improves the >>>> quality of >>>> >>>>>>> the >>>> >>>>>>> recommendations or at least does not degrade them, we should >>>> consider >>>> >>>>>>> adding >>>> >>>>>>> the enhancement task to a future sprint, with further >>>> instrumentation >>>> >>>>>>> and >>>> >>>>>>> A/B testing / timeboxed beta test, etc. >>>> >>>>>>> >>>> >>>>>>> Joaquin, smaxage (e.g., 24 hour cached responses) does seem a >>>> good >>>> >>>>>>> fix for >>>> >>>>>>> now for further reduction of client perceived wait, at least for >>>> >>>>>>> non-cold >>>> >>>>>>> cache requests, even if we stop beating up the backend. Does >>>> anyone >>>> >>>>>>> know of >>>> >>>>>>> a compelling reason to not do that for the time being? The main >>>> thing >>>> >>>>>>> that >>>> >>>>>>> comes to mind as always is growing the Varnish cache object >>>> pool - >>>> >>>>>>> probably >>>> >>>>>>> not a huge deal while the thing is only in beta, but on the >>>> stable >>>> >>>>>>> channel >>>> >>>>>>> maybe noteworthy because it would run on probably most pages >>>> (but >>>> >>>>>>> that's >>>> >>>>>>> what edge caches are for, after all). >>>> >>>>>>> >>>> >>>>>>> Erik, from your perspective does use of smaxage relieve the >>>> backend >>>> >>>>>>> sufficiently? >>>> >>>>>>> >>>> >>>>>>> If we do smaxage, then Web, Android, iOS should standardize >>>> their >>>> >>>>>>> URLs so we >>>> >>>>>>> get more cache hits at the edge across all clients. Here's the >>>> URL I >>>> >>>>>>> see >>>> >>>>>>> being used on the web today from mobile web beta: >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3ACome_Share_My_Love&gsrnamespace=0&gsrlimit=3 >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> -Adam >>>> >>>>>>> >>>> >>>>>>> On Wed, Jan 20, 2016 at 7:45 AM, Joaquin Oltra Hernandez >>>> >>>>>>> <[email protected]> wrote: >>>> >>>>>>>> >>>> >>>>>>>> I'd be up to it if we manage to cram it up in a following >>>> sprint and >>>> >>>>>>>> it is >>>> >>>>>>>> worth it. >>>> >>>>>>>> >>>> >>>>>>>> We could run a controlled test against production with a long >>>> batch >>>> >>>>>>>> of >>>> >>>>>>>> articles and check median/percentiles response time with >>>> repeated >>>> >>>>>>>> runs and >>>> >>>>>>>> highlight the different results for human inspection regarding >>>> >>>>>>>> quality. >>>> >>>>>>>> >>>> >>>>>>>> It's been noted previously that the results are far from ideal >>>> >>>>>>>> (which they >>>> >>>>>>>> are because it is just morelike), and I think it would be a >>>> great >>>> >>>>>>>> idea to >>>> >>>>>>>> change the endpoint to a specific one that is smarter and has >>>> some >>>> >>>>>>>> cache (we >>>> >>>>>>>> could do much more to get relevant results besides text >>>> similarity, >>>> >>>>>>>> take >>>> >>>>>>>> into account links, or see also links if there are, etc...). >>>> >>>>>>>> >>>> >>>>>>>> As a note, in mobile web the related articles extension allows >>>> >>>>>>>> editors to >>>> >>>>>>>> specify articles to show in the section, which would avoid >>>> queries >>>> >>>>>>>> to >>>> >>>>>>>> cirrussearch if it was more used (once rolled into stable I >>>> guess). >>>> >>>>>>>> >>>> >>>>>>>> I remember that the performance related task was closed as >>>> resolved >>>> >>>>>>>> (https://phabricator.wikimedia.org/T121254#1907192), should we >>>> >>>>>>>> reopen it or >>>> >>>>>>>> create a new one? >>>> >>>>>>>> >>>> >>>>>>>> I'm not sure if we ended up adding the smaxage parameter (I >>>> think we >>>> >>>>>>>> didn't), should we? To me it seems a no-brainer that we should >>>> be >>>> >>>>>>>> caching >>>> >>>>>>>> this results in varnish since they don't need to be completely >>>> up to >>>> >>>>>>>> date >>>> >>>>>>>> for this use case. >>>> >>>>>>>> >>>> >>>>>>>> On Tue, Jan 19, 2016 at 11:54 PM, Erik Bernhardson >>>> >>>>>>>> <[email protected]> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>> Both mobile apps and web are using CirrusSearch's morelike: >>>> feature >>>> >>>>>>>>> which >>>> >>>>>>>>> is showing some performance issues on our end. We would like >>>> to >>>> >>>>>>>>> make a >>>> >>>>>>>>> performance optimization to it, but before we would prefer to >>>> run >>>> >>>>>>>>> an A/B >>>> >>>>>>>>> test to see if the results are still "about as good" as they >>>> are >>>> >>>>>>>>> currently. >>>> >>>>>>>>> >>>> >>>>>>>>> The optimization is basically: Currently more like this takes >>>> the >>>> >>>>>>>>> entire >>>> >>>>>>>>> article into account, we would like to change this to take >>>> only the >>>> >>>>>>>>> opening >>>> >>>>>>>>> text of an article into account. This should reduce the >>>> amount of >>>> >>>>>>>>> work we >>>> >>>>>>>>> have to do on the backend saving both server load and latency >>>> the >>>> >>>>>>>>> user sees >>>> >>>>>>>>> running the query. >>>> >>>>>>>>> >>>> >>>>>>>>> This can be triggered by adding these two query parameters to >>>> the >>>> >>>>>>>>> search >>>> >>>>>>>>> api request that is being performed: >>>> >>>>>>>>> >>>> >>>>>>>>> cirrusMltUseFields=yes&cirrusMltFields=opening_text >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> The API will give a warning that these parameters do not >>>> exist, but >>>> >>>>>>>>> they >>>> >>>>>>>>> are safe to ignore. Would any of you be willing to run this >>>> test? >>>> >>>>>>>>> We would >>>> >>>>>>>>> basically want to look at user perceived latency along with >>>> click >>>> >>>>>>>>> through >>>> >>>>>>>>> rates for the current default setup along with the restricted >>>> setup >>>> >>>>>>>>> using >>>> >>>>>>>>> only opening_text. >>>> >>>>>>>>> >>>> >>>>>>>>> Erik B. >>>> >>>>>>>>> >>>> >>>>>>>>> _______________________________________________ >>>> >>>>>>>>> Mobile-l mailing list >>>> >>>>>>>>> [email protected] >>>> >>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l >>>> >>>>>>>>> >>>> >>>>>>> >>>> >>>>>>> _______________________________________________ >>>> >>>>>>> Mobile-l mailing list >>>> >>>>>>> [email protected] >>>> >>>>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l >>>> >>>>>>> >>>> >>>>>> _______________________________________________ >>>> >>>>>> Mobile-l mailing list >>>> >>>>>> [email protected] >>>> >>>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> _______________________________________________ >>>> >>>>> Mobile-l mailing list >>>> >>>>> [email protected] >>>> >>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> >>>> Mobile-l mailing list >>>> >>>> [email protected] >>>> >>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l >>>> >>>> >>>> >>> >>>> > >>>> > >>>> > _______________________________________________ >>>> > Mobile-l mailing list >>>> > [email protected] >>>> > https://lists.wikimedia.org/mailman/listinfo/mobile-l >>>> > >>>> >>>> >>>> >>>> -- >>>> Gabriel Wicke >>>> Principal Engineer, Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> Mobile-l mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l >>>> >>> >>> >>> >>> -- >>> Dmitry Brant >>> Mobile Apps Team (Android) >>> Wikimedia Foundation >>> https://www.mediawiki.org/wiki/Wikimedia_mobile_engineering >>> >>> >> >> >> -- >> Dmitry Brant >> Mobile Apps Team (Android) >> Wikimedia Foundation >> https://www.mediawiki.org/wiki/Wikimedia_mobile_engineering >> >> >> _______________________________________________ >> Mobile-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/mobile-l >> >> > > _______________________________________________ > Mobile-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/mobile-l > >
_______________________________________________ Mobile-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mobile-l
