Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

Marc Miquel Tue, 09 Jul 2019 08:44:54 -0700

Thank you Houcemeddine for your answer. At the moment the project is
already funded by a Project Grant by the WMF. Nuria had referred to a
formal collaboration as a sort of frame to access the Hadoop resource.


Thanks Nuria for your recommendation on importing data and computing
somewhere else. I will do some tests and estimate the time it might take
along with the rest of computations, as it needs to run on a monthly basis.
This is something I definitely need to verify.

Best regards,

Marc

Missatge de Houcemeddine A. Turki <[email protected]> del dia dt.,
9 de jul. 2019 a les 17:20:

> Dear Mr.,
> I thank you for your efforts. The link to H2020 is
> https://ec.europa.eu/programmes/horizon2020/en/how-get-funding.
> Yours Sincerely,
> Houcemeddine Turki
> ------------------------------
> *De :* Analytics <[email protected]> de la part de
> Houcemeddine A. Turki <[email protected]>
> *Envoyé :* mardi 9 juillet 2019 16:12
> *À :* A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics.
> *Objet :* Re: [Analytics] project Cultural Diversity Observatory /
> accessing analytics hadoop databases
>
> Dear Mr.,
> I thank you for your efforts. When we were in WikiIndaba 2018, it was
> interesting to see your research work. The project is interesting
> particularly because there are many cultures across the work that are
> underrepresented in Internet and mainly Wikipedia. Concerning the formal
> collaboration, I think that if your team can apply for a H2020 grant, this
> will be useful. This worked for the Scholia project and can work for you as
> well.
> Yours Sincerely,
> Houcemeddine Turki
> ------------------------------
> *De :* Analytics <[email protected]> de la part de
> Nuria Ruiz <[email protected]>
> *Envoyé :* mardi 9 juillet 2019 16:00
> *À :* A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics.
> *Objet :* Re: [Analytics] project Cultural Diversity Observatory /
> accessing analytics hadoop databases
>
> Marc:
>
> >We'd like to start the formal process to have an active collaboration, as
> it seems there is no other solution available
>
> Given that formal collaborations are somewhat hard to obtain (research
> team has so many resources) my recommendation  would be to import the
> public data into other computing platform that is not as constrained as
> labs in terms of space and do your calculations there.
>
> Thanks,
>
> Nuria
>
>
>
> On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel <[email protected]> wrote:
>
> Thanks for your clarification Nuria.
>
> The categorylinks table is working better lately. Computing counts at the
> pagelinks table is critical. I'm afraid there is no solution for this one.
>
> I thought about creating a temporary table pagelinks with data from the
> dumps for each language edition. But to replicate the pagelinks database in
> the sever local disk would be so costful in terms of time and space. The
> magnitude of the enwiki table for pagelinks must be more than 50GB. The
> entire process would run during many many days considering the other
> language editions too.
>
> Other counts I need to do is the number of editors per article, which also
> gets stuck with the revision table. For the rest of data, as you said, it
> is more about retrieval, as you said, and I can use alternatives.
>
> The queries to obtain count for pagelinks is something that worked before
> with the database replicas and a database with more power like Hadoop would
> do with certain ease. The problem is both a mixture of retrieval but also
> computing power.
>
> We'd like to start the formal process to have an active collaboration, as
> it seems there is no other solution available and we cannot be stuck and
> not deliver the work promised. I'll let you know when I have more info.
>
> Thanks again.
> Best,
>
> Marc Miquel
>
>
> Missatge de Nuria Ruiz <[email protected]> del dia dt., 9 de jul. 2019
> a les 1:44:
>
> >Will there be a release for these two tables?
> No, sorry, there will not be. The dataset release is about pages and
> users. To be extra clear though, it is not tables but a denormalized
> reconstruction of the edit history.
>
> > Could I connect to the Hadoop to see if the queries on pagelinks and
> categorylinks run faster?
> It is a bit more complicated that just "connecting"  but I do not think we
> have to dwell on that, cause, as far as I know, there is no categorylink
> info in hadoop.
>
>
> Hadoop has the set of data from mediawiki that we use to create the
> dataset I pointed you to:
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>  and
> a bit more.
>
> Is it possible to extract some of this information from the xml dumps?
> Perhaps somebody in the list has other ideas?
>
> Thanks,
>
> Nuria
>
> P.S. So you know in order to facilitate access to our computing resources
> and private data (there is no way for us to give access to only "part" of
> the data we hold in hadoop)  we require an active collaboration with our
> research team. We cannot support ad-hoc access to hadoop for community
> members.
> Here is some info:
> https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
>
>
>
>
>
>
> On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel <[email protected]> wrote:
>
> Hello Nuria,
>
> This seems like an interesting alternative for some data (page, users,
> revision). It can really help and make some processes faster (at the moment
> we gave up running again the revision, as the new user_agent change made it
> also slower). So we will take a look at it as soon as it is ready.
>
> However, the scripts are struggling with other tables: pagelinks and
> category graph.
>
> For instance, we need to count the percentage of links an article directs
> to other pages or the percentage of links it receives from a group of
> pages. Likewise, we need to run down the category graph starting from a
> specific group of categories. At the moment, the query that uses pagelinks
> is not really working when counting when passing parameters for the entire
> table or for specific parts (using batches).
>
> Will there be a release for these two tables? Could I connect to the
> Hadoop to see if the queries on pagelinks and categorylinks run faster?
>
> If there is any other alternative we'd be happy to try as we cannot
> progress for several weeks.
> Thanks again,
>
> Marc
>
> Missatge de Nuria Ruiz <[email protected]> del dia dt., 9 de jul. 2019
> a les 0:56:
>
> Hello,
>
> From your description seems that your problem is not one of computation
> (well,  your main problem) but rather data extraction. The labs replicas
> are not meant for big data extraction jobs as you have just found out.
> Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
> denormalized data that you can probably use, it is still up for discussion
> whether the data will be released as a JSON dump or other but basically is
> a denormalized version of all the data held in the replicas that will be
> created monthly.
>
> Please take a look at the documentation of the dataset:
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>
> This is the phab ticket:
> https://phabricator.wikimedia.org/T208612
>
> So, to sum up, once this dataset is out (we hope late this quarter or
> early next) you can probably build your own datasets from it thus rendering
> your usage of the replicas obsolete. Hopefully this makes sense.
>
> Thanks,
>
> Nuria
>
>
>
>
> On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel <[email protected]> wrote:
>
> To whom it might concern,
>
> I am writing in regards of the project *Cultural Diversity Observatory*
> and the data we are collecting. In short, this project aims at bridging the
> content gaps between language editions that relate to cultural and
> geographical aspects. For this we need to retrieve data from all language
> editions and Wikidata, and run some scripts in order to crawl down the
> category and the link graph, in order to create some datasets and
> statistics.
>
> The reason that I am writing is because we are stuck as we cannot
> automatize the scripts to retrieve data from the Replicas. We could create
> the datasets few months ago but during the past months it is impossible.
>
> We are concerned because one thing is to create the dataset once for
> research purposes and another thing is to create them on monthly basis.
> This is what we promised in the project grant
> <https://meta.wikimedia.org/wiki/Grants:Project/WCDO/Culture_Gap_Monthly_Monitoring>
> details and now we cannot do it because of the infrastructure. It is
> important to do it on monthly basis because the data visualizations and
> statistics Wikipedia communities will receive need to be updated.
>
> Lately there had been some changes in the Replicas databases and the
> queries that used to take several hours are getting stuck completely. We
> tried to code them in multiple ways: a) using complex queries, b) doing the
> joins as code logics and in-memory, c) downloading the parts of the table
> that we require and storing them in a local database. *None is an option
> now *considering the current performance of the replicas.
>
> Bryan Davis suggested that this might be a moment to consult the Analytics
> team, considering the Hadoop environemnt is design to run long, complex
> queries and it has massively more compute power than the Wiki Replicas
> cluster. We would certainly be relieved If you considerd we could connect
> to these Analytics databases (Hadoop).
>
> Let us know if you need more information on the specific queries or the
> processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
> will be happy to explain in detail anything you require.
>
> Thanks.
> Best regards,
>
> Marc Miquel
>
> PS: You can read about the method we follow to retrieve data and create
> the dataset here:
>
> * Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
> Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
> the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
> 2334-0770 *
> wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

Reply via email to