Greetings Deb/Trey/Erik, I'd enjoy joining the discussions on these hackathon topics also.
Specifically, I'd like to see I can help improve MWF's search relevance using additional machine learning techniques/ML-packages. Thanks, --justin On Wed, May 2, 2018 at 8:53 AM, Deborah Tankersley < [email protected]> wrote: > Nice stuff! > > Should we set up a meeting to talk more in depth about this, as we're > about 2 weeks out from the Hackathon right now? > > Cheers, > > Deb > > -- > > deb tankersley > > Program Manager, Engineering > > Wikimedia Foundation > > On Wed, May 2, 2018 at 8:39 AM, Trey Jones <[email protected]> wrote: > >> I've got my own list of more language-focused not-necessarily-great >> ideas, in order of my current desire to work on them: >> >> - Mirandese (mwl) analysis plugin built from Portuguese and French >> parts, plus a stop list provided by an mwl editor >> - plugin to merge high surrogates and low surrogates that get split >> up by the Chinese analyzer >> - plugin to do automatic homoglyph corrections >> - plugin to do transliteration for languages where it is relatively >> easy (Serbian was on the list, but it’s already done!—and for very simple >> mappings this is just a char map) >> - look into ways of automatically generating a stemmer from >> Wiktionary conjugation/declension data (maybe start with Estonian?) >> - compare the analyzers for the top 5-10 wiki languages by volume, >> and look for ways to increase consistency among them >> - develop a different statistical approach to detect wrong keyboard >> typing and build a search-only filter to generate alternative tokens—for >> Russian/English, Hebrew/English, OR one hand on wrong home row >> - update RelForge with some additional metrics I’ve been collecting >> - project Wordnet or other thesaurus/ontology onto short strings >> (e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful >> thesaurus terms and prune the rest >> - recheck differences in unpacked vs monolithic analyzers >> (eliminating our automatic upgrades, which 98% likely to have caused the >> diffs) >> - “Bollywood detector”—identify and map Bollywood movie names into >> multiple scripts >> >> I was planning to work on the Mirandese analysis plugin and maybe one of >> the next three on the list. But if anyone wants to collaborate on any of >> the others, I'm happy to do so. >> >> Trey Jones >> Sr. Software Engineer, Search Platform >> Wikimedia Foundation >> >> On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson < >> [email protected]> wrote: >> >>> With the hackathon coming up I thought we could ponder what could be >>> done while there. I've been constructing a list of horrible ideas over the >>> last couple weeks: >>> >>> >> _______________________________________________ >> Discovery mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > _______________________________________________ > Discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ Discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
