On Sun, Jan 10, 2016 at 8:05 AM, Edison Nica <[email protected]> wrote:

> Dario Taraborelli <dtaraborelli@...> writes:
>
> >
> > what Greg said, Common Crawl is an excellent data source to answer
> these questions, see:
> >
> > http://blog.commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> > http://blog.commoncrawl.org/2015/02/wikireverse-visualizing-reverse-
> links-with-open-data/
> >
> > for aggregate stats about referrals to individual articles by traffic
> and aggregated at domain level you
> > mail also be interested in this dataset:
> >
> > http://figshare.com/articles/Wikipedia_Clickstream/1305770
> >
> > > On Dec 2, 2015, at 8:06 AM, Greg Lindahl <lindahl <at> pbm.com>
> wrote:
> > >
> > > On Tue, Dec 01, 2015 at 07:50:23PM +0100, Federico Leva (Nemo)
> wrote:
> > >> Edison Nica, 29/11/2015 16:56:
> > >>> how many non-wikipedia pages point to a certain wikipedia page
> > >>
> > >> I guess the only way we have to know this (other than grepping
> > >> request logs for referrers, which would be quite a nightmare) is to
> > >> access the Google Webmaster account for wikipedia.org (to which a
> > >> couple employees had access, IIRC).
> > >
> > > There are a couple of other ways to figure out inlinks:
> > >
> > > * Common Crawl
> > > * Commercial SEO services like Moz or Ahrefs
> > >
> > > In the medium term the Internet Archive is going to be generating
> this
> > > kind of link data as part of the Wayback Machine search engine
> effort.
> > >
> > > And finally, Edison, counting the number of inlinks without
> > > considering their rank or popularity will probably leave you
> > > vulnerable to people orchestrating googlebombs. And you might want
> to
> > > also know the anchortext, that's extremely valuable for search
> > > indexing.
> > >
> > > -- greg
> > >
> > >
> > >
> > > _______________________________________________
> > > Analytics mailing list
> > > Analytics <at> lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> > Dario Taraborelli  Head of Research, Wikimedia Foundation
> > wikimediafoundation.org • nitens.org •  <at> readermeter
> >
> > _______________________________________________
> > Analytics mailing list
> > Analytics <at> lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
> Thank you all for your replies, and I apologize for improper usage of
> English language (see 'no offence')
>
> I built my first Wikipedia Search App a while ago, it is a test best for
> my Offline Search Engine, and it contains only Medical Related
> Information for now.
>
> https://play.google.com/store/apps/details?
> id=com.zeropii.publish.txt.medical (BTW, this app has no Permissions,
> and not tracking of what the user is searching, and watch out, the APK
> is 78MB, if you plan installing)
>
> I am now building the second version, which will extend to full
> Wikipedia.
>
> If everything works right, I have another 3-6 months until I will need
> the Analytics to improve the search.
>
>
> BTW, if this is public information, what Search Engine do you use?
>

We use an elasticsearch cluster to power search. The full index not
including replicas is just under 3TB.

Do you use a custom one?
>

The search engine is not custom. We do use a custom mediawiki extension to
turn user queries into elasticsearch queries.
https://www.mediawiki.org/wiki/Extension:CirrusSearch


> DO you use the Analytics to refine search?
>
> Currently no. We are working up some things now to use analytics to
generate a popularity score based on page view data to improve search. We
also have a stretch goal to calculate page rank within wikis to replace our
current usage of incoming wikilink count as part of the scoring algo.


> My goal is to understand if Analytics could substantially improve my
> (Wikipedia) search engine or not.
>
> Thank you again for your answers and pointers!
>
> Edison Nica
> www.0Pii.com
> Edisonn at 0pii dot com
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to