Thank you Houcemeddine for your answer. At the moment the project is already funded by a Project Grant by the WMF. Nuria had referred to a formal collaboration as a sort of frame to access the Hadoop resource.
Thanks Nuria for your recommendation on importing data and computing somewhere else. I will do some tests and estimate the time it might take along with the rest of computations, as it needs to run on a monthly basis. This is something I definitely need to verify. Best regards, Marc Missatge de Houcemeddine A. Turki <[email protected]> del dia dt., 9 de jul. 2019 a les 17:20: > Dear Mr., > I thank you for your efforts. The link to H2020 is > https://ec.europa.eu/programmes/horizon2020/en/how-get-funding. > Yours Sincerely, > Houcemeddine Turki > ------------------------------ > *De :* Analytics <[email protected]> de la part de > Houcemeddine A. Turki <[email protected]> > *Envoyé :* mardi 9 juillet 2019 16:12 > *À :* A mailing list for the Analytics Team at WMF and everybody who has > an interest in Wikipedia and analytics. > *Objet :* Re: [Analytics] project Cultural Diversity Observatory / > accessing analytics hadoop databases > > Dear Mr., > I thank you for your efforts. When we were in WikiIndaba 2018, it was > interesting to see your research work. The project is interesting > particularly because there are many cultures across the work that are > underrepresented in Internet and mainly Wikipedia. Concerning the formal > collaboration, I think that if your team can apply for a H2020 grant, this > will be useful. This worked for the Scholia project and can work for you as > well. > Yours Sincerely, > Houcemeddine Turki > ------------------------------ > *De :* Analytics <[email protected]> de la part de > Nuria Ruiz <[email protected]> > *Envoyé :* mardi 9 juillet 2019 16:00 > *À :* A mailing list for the Analytics Team at WMF and everybody who has > an interest in Wikipedia and analytics. > *Objet :* Re: [Analytics] project Cultural Diversity Observatory / > accessing analytics hadoop databases > > Marc: > > >We'd like to start the formal process to have an active collaboration, as > it seems there is no other solution available > > Given that formal collaborations are somewhat hard to obtain (research > team has so many resources) my recommendation would be to import the > public data into other computing platform that is not as constrained as > labs in terms of space and do your calculations there. > > Thanks, > > Nuria > > > > On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel <[email protected]> wrote: > > Thanks for your clarification Nuria. > > The categorylinks table is working better lately. Computing counts at the > pagelinks table is critical. I'm afraid there is no solution for this one. > > I thought about creating a temporary table pagelinks with data from the > dumps for each language edition. But to replicate the pagelinks database in > the sever local disk would be so costful in terms of time and space. The > magnitude of the enwiki table for pagelinks must be more than 50GB. The > entire process would run during many many days considering the other > language editions too. > > Other counts I need to do is the number of editors per article, which also > gets stuck with the revision table. For the rest of data, as you said, it > is more about retrieval, as you said, and I can use alternatives. > > The queries to obtain count for pagelinks is something that worked before > with the database replicas and a database with more power like Hadoop would > do with certain ease. The problem is both a mixture of retrieval but also > computing power. > > We'd like to start the formal process to have an active collaboration, as > it seems there is no other solution available and we cannot be stuck and > not deliver the work promised. I'll let you know when I have more info. > > Thanks again. > Best, > > Marc Miquel > > > Missatge de Nuria Ruiz <[email protected]> del dia dt., 9 de jul. 2019 > a les 1:44: > > >Will there be a release for these two tables? > No, sorry, there will not be. The dataset release is about pages and > users. To be extra clear though, it is not tables but a denormalized > reconstruction of the edit history. > > > Could I connect to the Hadoop to see if the queries on pagelinks and > categorylinks run faster? > It is a bit more complicated that just "connecting" but I do not think we > have to dwell on that, cause, as far as I know, there is no categorylink > info in hadoop. > > > Hadoop has the set of data from mediawiki that we use to create the > dataset I pointed you to: > https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history > and > a bit more. > > Is it possible to extract some of this information from the xml dumps? > Perhaps somebody in the list has other ideas? > > Thanks, > > Nuria > > P.S. So you know in order to facilitate access to our computing resources > and private data (there is no way for us to give access to only "part" of > the data we hold in hadoop) we require an active collaboration with our > research team. We cannot support ad-hoc access to hadoop for community > members. > Here is some info: > https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations > > > > > > > On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel <[email protected]> wrote: > > Hello Nuria, > > This seems like an interesting alternative for some data (page, users, > revision). It can really help and make some processes faster (at the moment > we gave up running again the revision, as the new user_agent change made it > also slower). So we will take a look at it as soon as it is ready. > > However, the scripts are struggling with other tables: pagelinks and > category graph. > > For instance, we need to count the percentage of links an article directs > to other pages or the percentage of links it receives from a group of > pages. Likewise, we need to run down the category graph starting from a > specific group of categories. At the moment, the query that uses pagelinks > is not really working when counting when passing parameters for the entire > table or for specific parts (using batches). > > Will there be a release for these two tables? Could I connect to the > Hadoop to see if the queries on pagelinks and categorylinks run faster? > > If there is any other alternative we'd be happy to try as we cannot > progress for several weeks. > Thanks again, > > Marc > > Missatge de Nuria Ruiz <[email protected]> del dia dt., 9 de jul. 2019 > a les 0:56: > > Hello, > > From your description seems that your problem is not one of computation > (well, your main problem) but rather data extraction. The labs replicas > are not meant for big data extraction jobs as you have just found out. > Neither is Hadoop. Now, our team will be releasing soon a dataset of edit > denormalized data that you can probably use, it is still up for discussion > whether the data will be released as a JSON dump or other but basically is > a denormalized version of all the data held in the replicas that will be > created monthly. > > Please take a look at the documentation of the dataset: > > https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history > > This is the phab ticket: > https://phabricator.wikimedia.org/T208612 > > So, to sum up, once this dataset is out (we hope late this quarter or > early next) you can probably build your own datasets from it thus rendering > your usage of the replicas obsolete. Hopefully this makes sense. > > Thanks, > > Nuria > > > > > On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel <[email protected]> wrote: > > To whom it might concern, > > I am writing in regards of the project *Cultural Diversity Observatory* > and the data we are collecting. In short, this project aims at bridging the > content gaps between language editions that relate to cultural and > geographical aspects. For this we need to retrieve data from all language > editions and Wikidata, and run some scripts in order to crawl down the > category and the link graph, in order to create some datasets and > statistics. > > The reason that I am writing is because we are stuck as we cannot > automatize the scripts to retrieve data from the Replicas. We could create > the datasets few months ago but during the past months it is impossible. > > We are concerned because one thing is to create the dataset once for > research purposes and another thing is to create them on monthly basis. > This is what we promised in the project grant > <https://meta.wikimedia.org/wiki/Grants:Project/WCDO/Culture_Gap_Monthly_Monitoring> > details and now we cannot do it because of the infrastructure. It is > important to do it on monthly basis because the data visualizations and > statistics Wikipedia communities will receive need to be updated. > > Lately there had been some changes in the Replicas databases and the > queries that used to take several hours are getting stuck completely. We > tried to code them in multiple ways: a) using complex queries, b) doing the > joins as code logics and in-memory, c) downloading the parts of the table > that we require and storing them in a local database. *None is an option > now *considering the current performance of the replicas. > > Bryan Davis suggested that this might be a moment to consult the Analytics > team, considering the Hadoop environemnt is design to run long, complex > queries and it has massively more compute power than the Wiki Replicas > cluster. We would certainly be relieved If you considerd we could connect > to these Analytics databases (Hadoop). > > Let us know if you need more information on the specific queries or the > processes we are running. The server we are using is wcdo.eqiad.wmflabs. We > will be happy to explain in detail anything you require. > > Thanks. > Best regards, > > Marc Miquel > > PS: You can read about the method we follow to retrieve data and create > the dataset here: > > * Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity > Dataset: A Complete Cartography for 300 Language Editions. Proceedings of > the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM. > 2334-0770 * > wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/ > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
