@Tilman: Yes, some people are using Wikihadoop to convert dumps from XML to JSON. I have added code recently and maintain it.
On Wed, May 18, 2016 at 4:22 PM, Tilman Bayer <[email protected]> wrote: > Yes, of course *processing* the entire history (even with text) has been > done before - but perhaps not storing or indexing it. > > BTW is anyone still using "Wikihadoop"? > > https://blog.wikimedia.org/2011/11/21/do-it-yourself-analytics-with-wikipedia/ > https://github.com/whym/wikihadoop > > On Wed, May 18, 2016 at 3:09 AM, Dan Andreescu <[email protected]> > wrote: > >> Hi Tilman, thanks for pointing to this research. We have indeed worked on >> this kind of project, for both ORES and the WikiCredit system. There are >> many challenges like memory and processing time. Loading the entire history >> without text is what we're working on right now for our Wikistats 2.0 >> project. Even this has many challenges. >> >> As far as I can tell right now, any simple attempt to handle all the data >> in one way or one place is going to run into some sort of limit. If anybody >> finds otherwise, it would be useful to our work. >> >> *From: *Tilman Bayer >> *Sent: *Tuesday, May 17, 2016 02:54 >> *To: *A mailing list for the Analytics Team at WMF and everybody who has >> an interest in Wikipedia and analytics. >> *Reply To: *A mailing list for the Analytics Team at WMF and everybody >> who has an interest in Wikipedia and analytics. >> *Cc: *A public mailing list about Wikimedia Search and Discovery projects >> *Subject: *[Analytics] University project to make entire English >> Wikipedia history searchable on Hadoop using Solr >> >> Detailed technical report on an undergraduate student project at Virginia >> Tech (work in progress) to import the entire English Wikipedia history dump >> into the university's Hadoop cluster and index it using Apache Solr, to >> "allow researchers and developers at Virginia Tech to benchmark >> configurations and big data analytics software": >> >> Steven Stulga, "English Wikipedia on Hadoop Cluster" >> https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0) >> >> IIRC this has rarely or never been attempted due to the large size of the >> dataset - 10TB uncompressed. And it looks like the author here encountered >> an out of memory error that he wasn't able to solve before the end of >> term... >> >> -- >> Tilman Bayer >> Senior Analyst >> Wikimedia Foundation >> IRC (Freenode): HaeB >> >> >> -- >> Sent from Gmail Mobile >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > Tilman Bayer > Senior Analyst > Wikimedia Foundation > IRC (Freenode): HaeB > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
