Yes, of course *processing* the entire history (even with text) has been done before - but perhaps not storing or indexing it.
BTW is anyone still using "Wikihadoop"? https://blog.wikimedia.org/2011/11/21/do-it-yourself-analytics-with-wikipedia/ https://github.com/whym/wikihadoop On Wed, May 18, 2016 at 3:09 AM, Dan Andreescu <[email protected]> wrote: > Hi Tilman, thanks for pointing to this research. We have indeed worked on > this kind of project, for both ORES and the WikiCredit system. There are > many challenges like memory and processing time. Loading the entire history > without text is what we're working on right now for our Wikistats 2.0 > project. Even this has many challenges. > > As far as I can tell right now, any simple attempt to handle all the data > in one way or one place is going to run into some sort of limit. If anybody > finds otherwise, it would be useful to our work. > > *From: *Tilman Bayer > *Sent: *Tuesday, May 17, 2016 02:54 > *To: *A mailing list for the Analytics Team at WMF and everybody who has > an interest in Wikipedia and analytics. > *Reply To: *A mailing list for the Analytics Team at WMF and everybody > who has an interest in Wikipedia and analytics. > *Cc: *A public mailing list about Wikimedia Search and Discovery projects > *Subject: *[Analytics] University project to make entire English > Wikipedia history searchable on Hadoop using Solr > > Detailed technical report on an undergraduate student project at Virginia > Tech (work in progress) to import the entire English Wikipedia history dump > into the university's Hadoop cluster and index it using Apache Solr, to > "allow researchers and developers at Virginia Tech to benchmark > configurations and big data analytics software": > > Steven Stulga, "English Wikipedia on Hadoop Cluster" > https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0) > > IIRC this has rarely or never been attempted due to the large size of the > dataset - 10TB uncompressed. And it looks like the author here encountered > an out of memory error that he wasn't able to solve before the end of > term... > > -- > Tilman Bayer > Senior Analyst > Wikimedia Foundation > IRC (Freenode): HaeB > > > -- > Sent from Gmail Mobile > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
