Re: [WikimediaMobile] Similar articles feature performance in CirrusSearch for apps and mobile web

Adam Baso Sat, 23 Jan 2016 05:59:24 -0800

Hey all, am planning to look at Phabricator tasks and provide a reply
during the upcoming weekdays. Just wanted to acknowledge I saw your replies!


On Friday, January 22, 2016, Erik Bernhardson <[email protected]>
wrote:

> On Thu, Jan 21, 2016 at 1:29 AM, Joaquin Oltra Hernandez <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> Regarding the caching, we would need to agree between apps and web about
>> the url and smaxage parameter as Adam noted so that the urls are
>> *exactly* the same to not bloat varnish and reuse the same cached
>> objects across platforms.
>>
>> It is an extremely adhoc and brittle solution but seems like it would be
>> the greatest win.
>>
>> 20% of the traffic from searches by being only in android and web beta
>> seems a lot to me, and we should work on reducing it, otherwise when it
>> hits web stable we're going to crush the servers, so caching seems the
>> highest priority.
>>
>> To clarify its 20% of the load, as opposed to 20% of the traffic. But
> same difference :)
>
>
>> Let's chime in https://phabricator.wikimedia.org/T124216 and continue
>> the cache discussion there.
>>
>> Regarding the validity of results with opening text only, how should we
>> proceed? Adam?
>>
>> I've put together https://phabricator.wikimedia.org/T124258 to track
> putting together an AB test that measures the difference in click through
> rates for the two approaches.
>
>
>
>> On Wed, Jan 20, 2016 at 9:34 PM, David Causse <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> Hi,
>>>
>>> Yes we can combine many factors, from templates (quality but also
>>> disambiguation/stubs), size and others.
>>> Today cirrus uses mostly the number of incoming links which (imho) is
>>> not very good for morelike.
>>> On enwiki results will also be scored according the weights defined in
>>> https://en.wikipedia.org/wiki/MediaWiki:Cirrussearch-boost-templates.
>>>
>>> I wrote a small bash to compare results :
>>> https://gist.github.com/nomoa/93c5097e3c3cb3b6ebad
>>> Here is some random results from the list (Semetimes better, sometimes
>>> worse) :
>>>
>>> $ sh morelike.sh Revolution_Muslim
>>> Defaults
>>>         "title": "Chess",
>>>         "title": "Suicide attack",
>>>         "title": "Zachary Adam Chesser",
>>> =======
>>> Opening text no boost links
>>>         "title": "Hungarian Revolution of 1956",
>>>         "title": "Muslims for America",
>>>         "title": "Salafist Front",
>>>
>>> $ sh morelike.sh Chesser
>>> Defaults
>>>         "title": "Chess",
>>>         "title": "Edinburgh",
>>>         "title": "Edinburgh Corn Exchange",
>>> =======
>>> Opening text no boost links
>>>         "title": "Dreghorn Barracks",
>>>         "title": "Edinburgh Chess Club",
>>>         "title": "Threipmuir Reservoir",
>>>
>>> $ sh morelike.sh Time_%28disambiguation%29
>>> Defaults
>>>         "title": "Atlantis: The Lost Empire",
>>>         "title": "Stargate",
>>>         "title": "Stargate SG-1",
>>> =======
>>> Opening text no boost links
>>>         "title": "Father Time (disambiguation)",
>>>         "title": "The Last Time",
>>>         "title": "Time After Time",
>>>
>>>
>>>
>>>
>>>
>>> Le 20/01/2016 19:34, Jon Robson a écrit :
>>>
>>>> I'm actually  interested to see whether this yields better results in
>>>> certain examples where the algorithm is lacking [1]. If it's done as
>>>> an A/B test we could even measure things such as click throughs in the
>>>> related article feature (whether they go up or not)
>>>>
>>>> Out of interest is it also possible to take article size and type into
>>>> account and not returning any morelike results for things like
>>>> disambiguation pages and stubs?
>>>>
>>>> [1] https://www.mediawiki.org/wiki/Topic:Swsjajvdll3pf8ya
>>>>
>>>>
>>>> On Wed, Jan 20, 2016 at 9:47 AM, Adam Baso <[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>
>>>>> One thing we could do regarding the quality of the output is check
>>>>> results
>>>>> against a random sample of popular articles (example approach to find
>>>>> some
>>>>> articles) on mdot Wikipedia. Presuming that improves the quality of the
>>>>> recommendations or at least does not degrade them, we should consider
>>>>> adding
>>>>> the enhancement task to a future sprint, with further instrumentation
>>>>> and
>>>>> A/B testing / timeboxed beta test, etc.
>>>>>
>>>>> Joaquin, smaxage (e.g., 24 hour cached responses) does seem a good fix
>>>>> for
>>>>> now for further reduction of client perceived wait, at least for
>>>>> non-cold
>>>>> cache requests, even if we stop beating up the backend. Does anyone
>>>>> know of
>>>>> a compelling reason to not do that for the time being? The main thing
>>>>> that
>>>>> comes to mind as always is growing the Varnish cache object pool -
>>>>> probably
>>>>> not a huge deal while the thing is only in beta, but on the stable
>>>>> channel
>>>>> maybe noteworthy because it would run on probably most pages (but
>>>>> that's
>>>>> what edge caches are for, after all).
>>>>>
>>>>> Erik, from your perspective does use of smaxage relieve the backend
>>>>> sufficiently?
>>>>>
>>>>> If we do smaxage, then Web, Android, iOS should standardize their URLs
>>>>> so we
>>>>> get more cache hits at the edge across all clients. Here's the URL I
>>>>> see
>>>>> being used on the web today from mobile web beta:
>>>>>
>>>>>
>>>>> https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3ACome_Share_My_Love&gsrnamespace=0&gsrlimit=3
>>>>>
>>>>>
>>>>> -Adam
>>>>>
>>>>> On Wed, Jan 20, 2016 at 7:45 AM, Joaquin Oltra Hernandez
>>>>> <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>>
>>>>>> I'd be up to it if we manage to cram it up in a following sprint and
>>>>>> it is
>>>>>> worth it.
>>>>>>
>>>>>> We could run a controlled test against production with a long batch of
>>>>>> articles and check median/percentiles response time with repeated
>>>>>> runs and
>>>>>> highlight the different results for human inspection regarding
>>>>>> quality.
>>>>>>
>>>>>> It's been noted previously that the results are far from ideal (which
>>>>>> they
>>>>>> are because it is just morelike), and I think it would be a great
>>>>>> idea to
>>>>>> change the endpoint to a specific one that is smarter and has some
>>>>>> cache (we
>>>>>> could do much more to get relevant results besides text similarity,
>>>>>> take
>>>>>> into account links, or see also links if there are, etc...).
>>>>>>
>>>>>> As a note, in mobile web the related articles extension allows
>>>>>> editors to
>>>>>> specify articles to show in the section, which would avoid queries to
>>>>>> cirrussearch if it was more used (once rolled into stable I guess).
>>>>>>
>>>>>> I remember that the performance related task was closed as resolved
>>>>>> (https://phabricator.wikimedia.org/T121254#1907192), should we
>>>>>> reopen it or
>>>>>> create a new one?
>>>>>>
>>>>>> I'm not sure if we ended up adding the smaxage parameter (I think we
>>>>>> didn't), should we? To me it seems a no-brainer that we should be
>>>>>> caching
>>>>>> this results in varnish since they don't need to be completely up to
>>>>>> date
>>>>>> for this use case.
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 11:54 PM, Erik Bernhardson
>>>>>> <[email protected]
>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>>>
>>>>>>> Both mobile apps and web are using CirrusSearch's morelike: feature
>>>>>>> which
>>>>>>> is showing some performance issues on our end. We would like to make
>>>>>>> a
>>>>>>> performance optimization to it, but before we would prefer to run an
>>>>>>> A/B
>>>>>>> test to see if the results are still "about as good" as they are
>>>>>>> currently.
>>>>>>>
>>>>>>> The optimization is basically: Currently more like this takes the
>>>>>>> entire
>>>>>>> article into account, we would like to change this to take only the
>>>>>>> opening
>>>>>>> text of an article into account. This should reduce the amount of
>>>>>>> work we
>>>>>>> have to do on the backend saving both server load and latency the
>>>>>>> user sees
>>>>>>> running the query.
>>>>>>>
>>>>>>> This can be triggered by adding these two query parameters to the
>>>>>>> search
>>>>>>> api request that is being performed:
>>>>>>>
>>>>>>> cirrusMltUseFields=yes&cirrusMltFields=opening_text
>>>>>>>
>>>>>>>
>>>>>>> The API will give a warning that these parameters do not exist, but
>>>>>>> they
>>>>>>> are safe to ignore. Would any of you be willing to run this test? We
>>>>>>> would
>>>>>>> basically want to look at user perceived latency along with click
>>>>>>> through
>>>>>>> rates for the current default setup along with the restricted setup
>>>>>>> using
>>>>>>> only opening_text.
>>>>>>>
>>>>>>> Erik B.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Mobile-l mailing list
>>>>>>> [email protected]
>>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> Mobile-l mailing list
>>>>> [email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>>>>
>>>>> _______________________________________________
>>>> Mobile-l mailing list
>>>> [email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>>>
>>>
>>>
>>> _______________________________________________
>>> Mobile-l mailing list
>>> [email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>>
>>
>>
>> _______________________________________________
>> Mobile-l mailing list
>> [email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>
>>
>

_______________________________________________
Mobile-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mobile-l

Re: [WikimediaMobile] Similar articles feature performance in CirrusSearch for apps and mobile web

Reply via email to