Re: [WikimediaMobile] Search

Adam Baso Fri, 04 Apr 2014 14:14:25 -0700

Here are summarized meeting notes:

* No-go on default fuzzy searching for mobile apps so as to not hammer
server - prefix search (title-starts-with) to be used; if approaching
default fuzzy search as technology is refined, add 300ms delay to iOS like
for Android, though
* Possibly try this on beta mobile web, alpha mobile web, or a targeted
language Wikipedia on the mobile web (maybe a larger beta mobile web
language Wikipedia) to see how performance would go
* Search team to examine returning fewer fields in each search result
record by default when srprop mask not specified (e.g., don't return
snippets unless they're requested)


-Adam


On Wed, Apr 2, 2014 at 8:11 AM, Nikolas Everett <[email protected]>wrote:

>
>
>
> On Tue, Apr 1, 2014 at 7:19 PM, Adam Baso <[email protected]> wrote:
>
>>
>> +mobile-l
>>
>> mobile-l recipients, if replying, if you would please reply-all in case
>> any people on the CC: line aren't on mobile-l, it would be appreciated.
>>
>> Nik,
>>
>> Thanks for the update. Glad to hear there's even faster performance
>> coming and also that there's no need to structure too much fallback stuff
>> depending on whether the reflex time is okay. With any luck, it would be
>> just fast enough. I don't think there'd be too much hammering on the
>> suggest term; only if the resultset is insufficient does it seem like it
>> would make sense to orchestrate the client side (or server side, for that
>> matter) call. The apps do have a key tap timer thing on them to help avoid
>> spurious searching, so that should help. I think I understand the ellipse
>> related stuff - parsing the snippet text is no problem, but if there's an
>> even simpler way to get text condensed to the point where there's no work
>> to avoid wrapping on most form factors, cool!...and if I misunderstood,
>> well, we'll get to the bottom of that on Friday.
>>
>
> I suppose this is close to my heart because I just worked on it, but if
> you chop the snippet on the client side it defeats the logic used to pick
> the "best snippet".  That logic isn't that great in Cirrus now but it'll
> get a whole lot better when we deploy the new highlighter.  Right now the
> snippet is always 150 characters or something +/- 20ish characters on each
> side to find a word break.  We pick the best snippet based on hits in the
> 150 character window.  At minimum we should let you configure it to
> something that'll fit better.  I suppose the best option would be to
> configure up font widths that matter and then use them to chop really
> accurately.  For the most part that sounds pretty simple and quick to
> implement so long as we're ok with estimates that ignore stuff like
> ligatures.
>
> Do the type timers fire the hit on the leading character or on a slight
> hesitation?  Prefix search on the site is leading character and then it
> <cancels> the request if the user types more.  That is silly because I
> can't cancel the request on the backend....  If it triggers on a hesitation
> then we should just plow ahead, I think.  If it triggers on leading
> characters then we should totally cache requests shorter then N character.
> 3 or 5 or something.
>
> From the code:
>         // If we were constraining the namespace set, we would probably use
>         // 0|1|2|3|4|5|6|7|12|13|14|15|100|101|108|109|446|447
>         // to keep it related more to article, article talk, policy,
>         // policy talk, help, and help talk types of resources.
>         // The odd numbered Talk pages could even be witheld, but that's
>         // sort of pointless when the number of backlinks to them is
>         // likely to be small, meaning they won't turn up too much
>         // unless they're a result(set) of last resort, or the user
>         // went to the trouble of prefix namespace searching such as
> Talk:Cats.
>         // But realistically, it's probably easier to just stick to
>         // not defining a namespace constrainint set, and thereby (likely)
>         // getting more pre-cached responses, due to other consumers
> leading
>         // or follwing suit. There's a school of thought, or there could
> be,
>         // that says only namespace 0 should be searched here, as it's
>         // the core article content. But users may practically want
>         // categories, too. And such logic spirals out from there.
>         // If we were instead using the opensearch API and were seeking
>         // parity with the desktop and mobile web experience, we should
>         // indeed as of 27-March-2014 only be searching namespace 0.
>         // But as CirrusSearch will be the norm and server load is expected
>         // to handle things just fine (no fallback is necessary per Search
> team),
>         // higher quality search results now can be obtained anyway.
>
>
> Cirrus searches all wgContentNamespaces by default and it is optimized to
> do so.  All non-content namespaces are in another index so we don't have to
> pay attention to it during the request.  We also don't have to filter by
> namespace at all.
>
> Each namespace has a weight factor that influences its position.  That
> factor often ends up being more important then links.  Links are "score *
> log(incoming_links + 2)" and the weights vary from "score * 1" (MAIN) to
> "score * 0.0025" (TEMPLATE_TALK).  Our power users expect these because
> lsearchd did it.  Mobile users, who knows.
>
>         // With all of this considered, we want a request of the following
> format
>         // //
> en.m.wikipedia.org/w/api.php?action=query&list=search&srsearch=cats&srprop=snippet|sectiontitle&srlimit=15&srbackend=CirrusSearch&format=json<http://en.m.wikipedia.org/w/api.php?action=query&list=search&srsearch=cats&srprop=snippet%7Csectiontitle&srlimit=15&srbackend=CirrusSearch&format=json>
>         // Note that MobileFrontend's use of opensearch has its result
>         // set limited at 15. Note also that the 'srprop' only keeps
> 'snippet',
>         // and 'sectiontitle' plus the 'title' field which is always
> implicit.
>         // This buys us some additional features once we're ready for them,
>         // all the while populating the cache.
>         // We probably also will want to add 'srinterwiki=1' in some future
>         // state so that users don't have to change their language to
>         // search setting. As it is, 'srinterwiki' is not yet in place
>         // and the format of such results may look a bit different,
>         // so it's probably best to hold off on 'srinterwiki=1'. We are
>         // not yet using the snippets and section titles, but let's get the
>         // cache populated for our sake and everyone else's sake.
>
> Interwiki is coming but I'd give it a few months, I think.
>
>         // NOTE:
>         // Although as of 27-March-2014 it seems that suggestions may not
> be coming
>         // back for CirrusSearch as frequently as for Lucene, that's
> probably
>         // just an artifact of relatively lower training of suggestions.
>         // In other words, it's likely that the suggestion pairing will
> grow.
>         // Currently, we're not examining
> [@"query"][@"searchinfo"][@"suggestion],
>         // but we could. There are two cases for the suggestion.
>         // 1. When the result set is of length 0, just fire off a search
> with the suggestion.
>         //    This is the case where the user probably mis-spelled
> something.
>         // 2. When the result set is of short length (less than 5?), fire
> another search with
>         //    the suggestion, and then collate those search results
> /after/ the first result set.
>
> The suggestion is actually better then you give it credit for: even if it
> lots of results show up if we provide a suggestion it might useful.  It
> comes from redirect and title names and it'll suggest combinations that
> work.  So if the user searches for "picket's charge" it'll suggest
> "pickett's charge" even though there are plenty of results for the first
> term.  The results for the second term are better.
>
> The reason you get different results is because the implementations are
> vastly different.  The Cirrus implementation has less tuning but is "more
> modern".  Whatever that is worth.
>
> Nik
>

_______________________________________________
Mobile-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mobile-l

Re: [WikimediaMobile] Search

Reply via email to