Thanks, all....I've  moved both tickets to the sprint board and hopefully
we can get a lot of answers going forward for this ticket
<https://phabricator.wikimedia.org/T141216> from the test
<https://phabricator.wikimedia.org/T142620> on it.

Please let me know if I've missed anything! :)

Cheers,

Deb

--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation

On Wed, Aug 10, 2016 at 9:04 AM, David Causse <dcau...@wikimedia.org> wrote:

> The problem with french is slightly different but leads to somewhat the
> same problems: there are no ascii folding configured currently.
>
> This leads to the same intitle problems where it can't find words with
> diacritics when diacritics are omitted in the query.
>
> For french putting ascii folding before the stemmer is certainly a bad
> idea and we should imo do ascii folding after the stemmer (possibly using
> preserve original).
>
> Le 10/08/2016 à 16:52, Trey Jones a écrit :
>
> I'm less sure that re-ordering would do the right thing for French.
> Presumably the French stemmer knows about accented characters and uses
> them. We should test and make sure. Maybe we need custom folding only for
> "unusual" accents in any given language (they are all unusual for English).
>
> We can test French the same way we tested English, though, and be sure.
>
> —Trey
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> On Wed, Aug 10, 2016 at 10:48 AM, David Causse <dcau...@wikimedia.org>
> wrote:
>
>> Thanks Trey!
>>
>> this will certainly greatly improve the intitle keyword as it uses the
>> field with stems for filtering and hopefully will find pages that were
>> ignored because of this filter ordering (e.g. intitle:louys can't find
>> User:Louÿs currently).
>>
>> I think I'll do the same for French which suffers from the same problem.
>> IMO we should continue to work on this for other languages while we try to
>> switch from asciifolding (latin letters only) to icu folding.
>>
>> We may require some guidance on some languages where diacritics removal
>> can be counter productive and maybe blacklist some letters (e.g. for
>> finnish: is it appropriate to fold Ä or Ö for example?)
>>
>> Note on accent folding: cirrus tries to always prefer exact matches.
>> Searching for élément should always prefer élément over element. Users that
>> prefer exact matches can always force cirrus to discard stems by wrapping
>> the word in double quotes, e.g. "élément".
>>
>> Le 10/08/2016 à 16:07, Trey Jones a écrit :
>>
>> David and I had a discussion about moving ascii-folding to come before
>> stemming on English Wikipedia. It seemed like a good idea, but we decided
>> we should run some tests before implementing it, just to be sure.
>>
>> Turns out it is a good idea!
>>
>> Much more detail:
>>     https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re-
>> Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia
>>
>> We won't deploy it until we deploy BM25 later in the year, since it
>> requires a full re-index of English Wikipedia, as does BM25. That's
>> something we should only do once.
>>
>> —Trey
>>
>> Trey Jones
>> Software Engineer, Discovery
>> Wikimedia Foundation
>>
>>
>> _______________________________________________
>> discovery mailing 
>> listdiscovery@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/discovery
>>
>> _______________________________________________ discovery mailing list
>> discovery@lists.wikimedia.org https://lists.wikimedia.org/ma
>> ilman/listinfo/discovery
>
> _______________________________________________
> discovery mailing 
> listdiscovery@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/discovery
>
>
> _______________________________________________
> discovery mailing list
> discovery@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to