Greetings Deb/Trey/Erik,

I'd enjoy joining the discussions on these hackathon topics also.

Specifically, I'd like to see I can help improve MWF's search relevance
using additional machine learning techniques/ML-packages.

Thanks,
--justin

On Wed, May 2, 2018 at 8:53 AM, Deborah Tankersley <
[email protected]> wrote:

> Nice stuff!
>
> Should we set up a meeting to talk more in depth about this, as we're
> about 2 weeks out from the Hackathon right now?
>
> Cheers,
>
> Deb
>
> --
>
> deb tankersley
>
> Program Manager, Engineering
>
> Wikimedia Foundation
>
> On Wed, May 2, 2018 at 8:39 AM, Trey Jones <[email protected]> wrote:
>
>> I've got my own list of more language-focused not-necessarily-great
>> ideas, in order of my current desire to work on them:
>>
>>    - Mirandese (mwl) analysis plugin built from Portuguese and French
>>    parts, plus a stop list provided by an mwl editor
>>    - plugin to merge high surrogates and low surrogates that get split
>>    up by the Chinese analyzer
>>    - plugin to do automatic homoglyph corrections
>>    - plugin to do transliteration for languages where it is relatively
>>    easy (Serbian was on the list, but it’s already done!—and for very simple
>>    mappings this is just a char map)
>>    - look into ways of automatically generating a stemmer from
>>    Wiktionary conjugation/declension data (maybe start with Estonian?)
>>    - compare the analyzers for the top 5-10 wiki languages by volume,
>>    and look for ways to increase consistency among them
>>    - develop a different statistical approach to detect wrong keyboard
>>    typing and build a search-only filter to generate alternative tokens—for
>>    Russian/English, Hebrew/English, OR one hand on wrong home row
>>    - update RelForge with some additional metrics I’ve been collecting
>>    - project Wordnet or other thesaurus/ontology onto short strings
>>    (e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful
>>    thesaurus terms and prune the rest
>>    - recheck differences in unpacked vs monolithic analyzers
>>    (eliminating our automatic upgrades, which 98% likely to have caused the
>>    diffs)
>>    - “Bollywood detector”—identify and map Bollywood movie names into
>>    multiple scripts
>>
>> I was planning to work on the Mirandese analysis plugin and maybe one of
>> the next three on the list. But if anyone wants to collaborate on any of
>> the others, I'm happy to do so.
>>
>> Trey Jones
>> Sr. Software Engineer, Search Platform
>> Wikimedia Foundation
>>
>> On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson <
>> [email protected]> wrote:
>>
>>> With the hackathon coming up I thought we could ponder what could be
>>> done while there. I've been constructing a list of horrible ideas over the
>>> last couple weeks:
>>>
>>>
>> _______________________________________________
>> Discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> Discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
_______________________________________________
Discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to