On Sun, Jan 10, 2016 at 8:05 AM, Edison Nica <[email protected]> wrote:
> Dario Taraborelli <dtaraborelli@...> writes: > > > > > what Greg said, Common Crawl is an excellent data source to answer > these questions, see: > > > > http://blog.commoncrawl.org/2015/04/announcing-the-common-crawl-index/ > > http://blog.commoncrawl.org/2015/02/wikireverse-visualizing-reverse- > links-with-open-data/ > > > > for aggregate stats about referrals to individual articles by traffic > and aggregated at domain level you > > mail also be interested in this dataset: > > > > http://figshare.com/articles/Wikipedia_Clickstream/1305770 > > > > > On Dec 2, 2015, at 8:06 AM, Greg Lindahl <lindahl <at> pbm.com> > wrote: > > > > > > On Tue, Dec 01, 2015 at 07:50:23PM +0100, Federico Leva (Nemo) > wrote: > > >> Edison Nica, 29/11/2015 16:56: > > >>> how many non-wikipedia pages point to a certain wikipedia page > > >> > > >> I guess the only way we have to know this (other than grepping > > >> request logs for referrers, which would be quite a nightmare) is to > > >> access the Google Webmaster account for wikipedia.org (to which a > > >> couple employees had access, IIRC). > > > > > > There are a couple of other ways to figure out inlinks: > > > > > > * Common Crawl > > > * Commercial SEO services like Moz or Ahrefs > > > > > > In the medium term the Internet Archive is going to be generating > this > > > kind of link data as part of the Wayback Machine search engine > effort. > > > > > > And finally, Edison, counting the number of inlinks without > > > considering their rank or popularity will probably leave you > > > vulnerable to people orchestrating googlebombs. And you might want > to > > > also know the anchortext, that's extremely valuable for search > > > indexing. > > > > > > -- greg > > > > > > > > > > > > _______________________________________________ > > > Analytics mailing list > > > Analytics <at> lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > Dario Taraborelli Head of Research, Wikimedia Foundation > > wikimediafoundation.org • nitens.org • <at> readermeter > > > > _______________________________________________ > > Analytics mailing list > > Analytics <at> lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > Thank you all for your replies, and I apologize for improper usage of > English language (see 'no offence') > > I built my first Wikipedia Search App a while ago, it is a test best for > my Offline Search Engine, and it contains only Medical Related > Information for now. > > https://play.google.com/store/apps/details? > id=com.zeropii.publish.txt.medical (BTW, this app has no Permissions, > and not tracking of what the user is searching, and watch out, the APK > is 78MB, if you plan installing) > > I am now building the second version, which will extend to full > Wikipedia. > > If everything works right, I have another 3-6 months until I will need > the Analytics to improve the search. > > > BTW, if this is public information, what Search Engine do you use? > We use an elasticsearch cluster to power search. The full index not including replicas is just under 3TB. Do you use a custom one? > The search engine is not custom. We do use a custom mediawiki extension to turn user queries into elasticsearch queries. https://www.mediawiki.org/wiki/Extension:CirrusSearch > DO you use the Analytics to refine search? > > Currently no. We are working up some things now to use analytics to generate a popularity score based on page view data to improve search. We also have a stretch goal to calculate page rank within wikis to replace our current usage of incoming wikilink count as part of the scoring algo. > My goal is to understand if Analytics could substantially improve my > (Wikipedia) search engine or not. > > Thank you again for your answers and pointers! > > Edison Nica > www.0Pii.com > Edisonn at 0pii dot com > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
