Yes, of course *processing* the entire history (even with text) has been
done before - but perhaps not storing or indexing it.

BTW is anyone still using "Wikihadoop"?
https://blog.wikimedia.org/2011/11/21/do-it-yourself-analytics-with-wikipedia/
https://github.com/whym/wikihadoop

On Wed, May 18, 2016 at 3:09 AM, Dan Andreescu <[email protected]>
wrote:

> Hi Tilman, thanks for pointing to this research. We have indeed worked on
> this kind of project, for both ORES and the WikiCredit system. There are
> many challenges like memory and processing time. Loading the entire history
> without text is what we're working on right now for our Wikistats 2.0
> project. Even this has many challenges.
>
> As far as I can tell right now, any simple attempt to handle all the data
> in one way or one place is going to run into some sort of limit. If anybody
> finds otherwise, it would be useful to our work.
>
> *From: *Tilman Bayer
> *Sent: *Tuesday, May 17, 2016 02:54
> *To: *A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics.
> *Reply To: *A mailing list for the Analytics Team at WMF and everybody
> who has an interest in Wikipedia and analytics.
> *Cc: *A public mailing list about Wikimedia Search and Discovery projects
> *Subject: *[Analytics] University project to make entire English
> Wikipedia history searchable on Hadoop using Solr
>
> Detailed technical report on an undergraduate student project at Virginia
> Tech (work in progress) to import the entire English Wikipedia history dump
> into the university's Hadoop cluster and index it using Apache Solr, to
> "allow researchers and developers at Virginia Tech to benchmark
> configurations and big data analytics software":
>
> Steven Stulga, "English Wikipedia on Hadoop Cluster"
> https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0)
>
> IIRC this has rarely or never been attempted due to the large size of the
> dataset - 10TB uncompressed. And it looks like the author here encountered
> an out of memory error that he wasn't able to solve before the end of
> term...
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
>
> --
> Sent from Gmail Mobile
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to