Re: [discovery] NLP for on-wiki search

Justin Ormont Wed, 18 Jul 2018 14:46:53 -0700

I added a slight bit on Word Embeddings to the talk page (beyond the
word2vec mentioned in the page). Just to extol its virtues, training a
fastText model is extremely easy.


Thanks,
--justin

On Wed, Jul 18, 2018 at 12:05 PM, Trey Jones <[email protected]> wrote:

> Hi everyone,
>
> I've got an update on the NLP project selection. We've narrowed things
> down to a handful of projects we could work on with a consultant, and a
> handful we could work on internally.
>
> David, Erik, and I reviewed a selection of the most promising-seeming
> and/or most interesting projects and gave them a very rough cost estimate
> based on how big of a relative impact they would have, technologically how
> hard they would be, and how difficult the UI aspect would be. The scores
> are not definitive, but helped guide the discussion. You can see the list
> of projects we looked at and more details of the scoring on MediaWiki
> <https://www.mediawiki.org/w/index.php?title=User:TJones_(WMF)/Notes/Potential_Applications_of_Natural_Language_Processing_to_On-Wiki_Search#Current_Recommendations>
> .
>
> For the possibility of working with an outside consultant, we also
> considered how easily separated each project would be from our overall
> system (making it easier for someone new to get up to speed), how projects
> feed into each other, how easily we could work on projects ourselves (like,
> we know pretty much what to do, we just have to do it), etc.
>
> Our current *recommendation for an outside consultant* would be to start
> with (1) *spelling correction/did you mean improvements,* with an option
> to extend the project to include either (2) *"more like" suggestion
> improvements,* or (3) *query reformulation mining,* specifically for typo
> corrections.
>
> For spelling correction (#1), we are envisioning an approach that
> integrates generic intra-word and inter-word statistical models, optional
> language-specific features, and explicit weighted corrections. We believe
> we could mine redirects flagged as typo correction for explicit
> corrections, and the query reformulation mining (#3) would also provide
> frequency-weighted explicit corrections. Our hope is that a system built
> initially for English would be readily applicable to other alphabetic
> languages, most probably other Indo-European languages, based on statistics
> available from Elastic; and that some elements of the system could be
> applied to other non-alphabetic languages and languages that are
> typologically <https://en.wikipedia.org/wiki/Morphological_typology> 
> dissimilar
> to Indo-European languages.
>
> Looking at the rest of the list, (a) *wrong keyboard detection* seems
> like something we should work on internally, since we already have a few
> good ideas on how to approach it. (b) *Acronym support* is a pet peeve
> for several members of the team, and seems to be straightforward to
> improve. (c) *Automatic stemmer building* and (d) *automatic stop word* 
> generation
> aren't so much projects we should work on as things we should research to
> see if there are already tools or lists out there we could use to make the
> projects much easier.
>
> Comments and questions here or on the talk page are welcome.
>
> Cheers,
> —Trey
>
> Trey Jones
> Sr. Software Engineer, Search Platform
> Wikimedia Foundation
>
> On Tue, May 15, 2018 at 11:30 AM, Trey Jones <[email protected]> wrote:
>
>> Hi everyone,
>>
>> I just finished putting together an annotated list of potential
>> applications of natural language processing to on-wiki search
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applications_of_Natural_Language_Processing_to_On-Wiki_Search>.
>> There are dozens and dozens of ideas there—including many that are
>> interesting but probably not practical. If you have any additional ideas,
>> questions, suggestions, recommendations, or preferences, please
>> share!—either on the mailing list or on the talk page.
>>
>> The goal is to narrow it down to one or two things to pursue over the
>> next two to four quarters, along with other projects we are working on.
>>
>> Thanks!
>> —Trey
>>
>> Trey Jones
>> Sr. Software Engineer, Search Platform
>> Wikimedia Foundation
>>
>>
>
> _______________________________________________
> Discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>

_______________________________________________
Discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] NLP for on-wiki search

Reply via email to