Re: [discovery] Blog post about searching wikipedia dumps

Erik Bernhardson Thu, 07 Jul 2016 10:42:33 -0700

There is another similar article where they tested a different search
engine: http://www.searchtechnologies.com/querying-indexing-cloudsearch

Some takeaways:
* Considers longer articles more important
* Considers shorter titles more important (aka Germany vs List of German
Corps in World War II)
* Some hand tweaking ended up with the formula: text_relevance +
40.0*log10(content_size) - 15.0*log10(title_size)
* defined a per-document boost from 0 to 10 based on which namespace
something belongs to.
* tweaked formula into: ext_relevance + (log10(content_size)*(doc_boost ==
1 ? 25.0 : 40.0)) - (log10(title_size)*15)

On Thu, Jul 7, 2016 at 10:29 AM, Erik Bernhardson <
[email protected]> wrote:

> Semi interesting post from Search Technologies (aka Paul Score) about
> indexing wikipedia data:
> http://www.searchtechnologies.com/wikipedia-azure-search
>
> Takeaways:
> * Automated entity detection, categorizing into person/place/organization
> * Offers search facets by wikipedia category and by entity detection
> * Multiple scoring profiles offered which change the weight between title
> and description (content? not clear)
>

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Blog post about searching wikipedia dumps

Reply via email to