I added a slight bit on Word Embeddings to the talk page (beyond the word2vec mentioned in the page). Just to extol its virtues, training a fastText model is extremely easy.
Thanks, --justin On Wed, Jul 18, 2018 at 12:05 PM, Trey Jones <[email protected]> wrote: > Hi everyone, > > I've got an update on the NLP project selection. We've narrowed things > down to a handful of projects we could work on with a consultant, and a > handful we could work on internally. > > David, Erik, and I reviewed a selection of the most promising-seeming > and/or most interesting projects and gave them a very rough cost estimate > based on how big of a relative impact they would have, technologically how > hard they would be, and how difficult the UI aspect would be. The scores > are not definitive, but helped guide the discussion. You can see the list > of projects we looked at and more details of the scoring on MediaWiki > <https://www.mediawiki.org/w/index.php?title=User:TJones_(WMF)/Notes/Potential_Applications_of_Natural_Language_Processing_to_On-Wiki_Search#Current_Recommendations> > . > > For the possibility of working with an outside consultant, we also > considered how easily separated each project would be from our overall > system (making it easier for someone new to get up to speed), how projects > feed into each other, how easily we could work on projects ourselves (like, > we know pretty much what to do, we just have to do it), etc. > > Our current *recommendation for an outside consultant* would be to start > with (1) *spelling correction/did you mean improvements,* with an option > to extend the project to include either (2) *"more like" suggestion > improvements,* or (3) *query reformulation mining,* specifically for typo > corrections. > > For spelling correction (#1), we are envisioning an approach that > integrates generic intra-word and inter-word statistical models, optional > language-specific features, and explicit weighted corrections. We believe > we could mine redirects flagged as typo correction for explicit > corrections, and the query reformulation mining (#3) would also provide > frequency-weighted explicit corrections. Our hope is that a system built > initially for English would be readily applicable to other alphabetic > languages, most probably other Indo-European languages, based on statistics > available from Elastic; and that some elements of the system could be > applied to other non-alphabetic languages and languages that are > typologically <https://en.wikipedia.org/wiki/Morphological_typology> > dissimilar > to Indo-European languages. > > Looking at the rest of the list, (a) *wrong keyboard detection* seems > like something we should work on internally, since we already have a few > good ideas on how to approach it. (b) *Acronym support* is a pet peeve > for several members of the team, and seems to be straightforward to > improve. (c) *Automatic stemmer building* and (d) *automatic stop word* > generation > aren't so much projects we should work on as things we should research to > see if there are already tools or lists out there we could use to make the > projects much easier. > > Comments and questions here or on the talk page are welcome. > > Cheers, > —Trey > > Trey Jones > Sr. Software Engineer, Search Platform > Wikimedia Foundation > > On Tue, May 15, 2018 at 11:30 AM, Trey Jones <[email protected]> wrote: > >> Hi everyone, >> >> I just finished putting together an annotated list of potential >> applications of natural language processing to on-wiki search >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applications_of_Natural_Language_Processing_to_On-Wiki_Search>. >> There are dozens and dozens of ideas there—including many that are >> interesting but probably not practical. If you have any additional ideas, >> questions, suggestions, recommendations, or preferences, please >> share!—either on the mailing list or on the talk page. >> >> The goal is to narrow it down to one or two things to pursue over the >> next two to four quarters, along with other projects we are working on. >> >> Thanks! >> —Trey >> >> Trey Jones >> Sr. Software Engineer, Search Platform >> Wikimedia Foundation >> >> > > _______________________________________________ > Discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ Discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
