Re: [Analytics] University project to make entire English Wikipedia history searchable on Hadoop using Solr

Amir E. Aharoni Tue, 17 May 2016 01:23:51 -0700

That would be wonderful.

Is it something that you just found or do you actually know them?


Maybe they could consider starting from a smaller language. If their
software is not good at parsing languages other than English, even the
Simple English Wikipedia would be more manageable.


--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬

2016-05-17 9:54 GMT+03:00 Tilman Bayer <[email protected]>:

> Detailed technical report on an undergraduate student project at Virginia
> Tech (work in progress) to import the entire English Wikipedia history dump
> into the university's Hadoop cluster and index it using Apache Solr, to
> "allow researchers and developers at Virginia Tech to benchmark
> configurations and big data analytics software":
>
> Steven Stulga, "English Wikipedia on Hadoop Cluster"
> https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0)
>
> IIRC this has rarely or never been attempted due to the large size of the
> dataset - 10TB uncompressed. And it looks like the author here encountered
> an out of memory error that he wasn't able to solve before the end of
> term...
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
>
> --
> Sent from Gmail Mobile
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] University project to make entire English Wikipedia history searchable on Hadoop using Solr

Reply via email to