Re: [Analytics] University project to make entire English Wikipedia history searchable on Hadoop using Solr

Joseph Allemandou Wed, 18 May 2016 08:04:16 -0700

@Tilman: Yes, some people are using Wikihadoop to convert dumps from XML to
JSON.
I have added code recently and maintain it.


On Wed, May 18, 2016 at 4:22 PM, Tilman Bayer <[email protected]> wrote:

> Yes, of course *processing* the entire history (even with text) has been
> done before - but perhaps not storing or indexing it.
>
> BTW is anyone still using "Wikihadoop"?
>
> https://blog.wikimedia.org/2011/11/21/do-it-yourself-analytics-with-wikipedia/
> https://github.com/whym/wikihadoop
>
> On Wed, May 18, 2016 at 3:09 AM, Dan Andreescu <[email protected]>
> wrote:
>
>> Hi Tilman, thanks for pointing to this research. We have indeed worked on
>> this kind of project, for both ORES and the WikiCredit system. There are
>> many challenges like memory and processing time. Loading the entire history
>> without text is what we're working on right now for our Wikistats 2.0
>> project. Even this has many challenges.
>>
>> As far as I can tell right now, any simple attempt to handle all the data
>> in one way or one place is going to run into some sort of limit. If anybody
>> finds otherwise, it would be useful to our work.
>>
>> *From: *Tilman Bayer
>> *Sent: *Tuesday, May 17, 2016 02:54
>> *To: *A mailing list for the Analytics Team at WMF and everybody who has
>> an interest in Wikipedia and analytics.
>> *Reply To: *A mailing list for the Analytics Team at WMF and everybody
>> who has an interest in Wikipedia and analytics.
>> *Cc: *A public mailing list about Wikimedia Search and Discovery projects
>> *Subject: *[Analytics] University project to make entire English
>> Wikipedia history searchable on Hadoop using Solr
>>
>> Detailed technical report on an undergraduate student project at Virginia
>> Tech (work in progress) to import the entire English Wikipedia history dump
>> into the university's Hadoop cluster and index it using Apache Solr, to
>> "allow researchers and developers at Virginia Tech to benchmark
>> configurations and big data analytics software":
>>
>> Steven Stulga, "English Wikipedia on Hadoop Cluster"
>> https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0)
>>
>> IIRC this has rarely or never been attempted due to the large size of the
>> dataset - 10TB uncompressed. And it looks like the author here encountered
>> an out of memory error that he wasn't able to solve before the end of
>> term...
>>
>> --
>> Tilman Bayer
>> Senior Analyst
>> Wikimedia Foundation
>> IRC (Freenode): HaeB
>>
>>
>> --
>> Sent from Gmail Mobile
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] University project to make entire English Wikipedia history searchable on Hadoop using Solr

Reply via email to