Thanks, all....I've moved both tickets to the sprint board and hopefully we can get a lot of answers going forward for this ticket <https://phabricator.wikimedia.org/T141216> from the test <https://phabricator.wikimedia.org/T142620> on it.
Please let me know if I've missed anything! :) Cheers, Deb -- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation On Wed, Aug 10, 2016 at 9:04 AM, David Causse <dcau...@wikimedia.org> wrote: > The problem with french is slightly different but leads to somewhat the > same problems: there are no ascii folding configured currently. > > This leads to the same intitle problems where it can't find words with > diacritics when diacritics are omitted in the query. > > For french putting ascii folding before the stemmer is certainly a bad > idea and we should imo do ascii folding after the stemmer (possibly using > preserve original). > > Le 10/08/2016 à 16:52, Trey Jones a écrit : > > I'm less sure that re-ordering would do the right thing for French. > Presumably the French stemmer knows about accented characters and uses > them. We should test and make sure. Maybe we need custom folding only for > "unusual" accents in any given language (they are all unusual for English). > > We can test French the same way we tested English, though, and be sure. > > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > On Wed, Aug 10, 2016 at 10:48 AM, David Causse <dcau...@wikimedia.org> > wrote: > >> Thanks Trey! >> >> this will certainly greatly improve the intitle keyword as it uses the >> field with stems for filtering and hopefully will find pages that were >> ignored because of this filter ordering (e.g. intitle:louys can't find >> User:Louÿs currently). >> >> I think I'll do the same for French which suffers from the same problem. >> IMO we should continue to work on this for other languages while we try to >> switch from asciifolding (latin letters only) to icu folding. >> >> We may require some guidance on some languages where diacritics removal >> can be counter productive and maybe blacklist some letters (e.g. for >> finnish: is it appropriate to fold Ä or Ö for example?) >> >> Note on accent folding: cirrus tries to always prefer exact matches. >> Searching for élément should always prefer élément over element. Users that >> prefer exact matches can always force cirrus to discard stems by wrapping >> the word in double quotes, e.g. "élément". >> >> Le 10/08/2016 à 16:07, Trey Jones a écrit : >> >> David and I had a discussion about moving ascii-folding to come before >> stemming on English Wikipedia. It seemed like a good idea, but we decided >> we should run some tests before implementing it, just to be sure. >> >> Turns out it is a good idea! >> >> Much more detail: >> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re- >> Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia >> >> We won't deploy it until we deploy BM25 later in the year, since it >> requires a full re-index of English Wikipedia, as does BM25. That's >> something we should only do once. >> >> —Trey >> >> Trey Jones >> Software Engineer, Discovery >> Wikimedia Foundation >> >> >> _______________________________________________ >> discovery mailing >> listdiscovery@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/discovery >> >> _______________________________________________ discovery mailing list >> discovery@lists.wikimedia.org https://lists.wikimedia.org/ma >> ilman/listinfo/discovery > > _______________________________________________ > discovery mailing > listdiscovery@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/discovery > > > _______________________________________________ > discovery mailing list > discovery@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery