That would be wonderful. Is it something that you just found or do you actually know them?
Maybe they could consider starting from a smaller language. If their software is not good at parsing languages other than English, even the Simple English Wikipedia would be more manageable. -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore 2016-05-17 9:54 GMT+03:00 Tilman Bayer <[email protected]>: > Detailed technical report on an undergraduate student project at Virginia > Tech (work in progress) to import the entire English Wikipedia history dump > into the university's Hadoop cluster and index it using Apache Solr, to > "allow researchers and developers at Virginia Tech to benchmark > configurations and big data analytics software": > > Steven Stulga, "English Wikipedia on Hadoop Cluster" > https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0) > > IIRC this has rarely or never been attempted due to the large size of the > dataset - 10TB uncompressed. And it looks like the author here encountered > an out of memory error that he wasn't able to solve before the end of > term... > > -- > Tilman Bayer > Senior Analyst > Wikimedia Foundation > IRC (Freenode): HaeB > > > -- > Sent from Gmail Mobile > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
