Thanks Aaron and Oliver! Strategy 2 sounds like the right way to go.
By the way, I wrote a document [1] that describes the features to search for when trying to estimate whether a page was translated or is originally written. Your comments are highly appreciated. [1] How_to_detect_translated_articles <https://www.mediawiki.org/w/index.php?title=Wikipedia_article_translation_metrics/How_to_detect_translated_articles&redirect=no> Cheers, Neta On Mon, Jan 26, 2015 at 5:23 AM, Oliver Keyes <[email protected]> wrote: > Yup. For context; because of the scale of Wikimedia's MediaWiki > instances, we actually store revision contents in their own cluster, > not in the pertinent field within the MediaWiki database schema - that > field instead acts as a pointer to where the content really lives. One > of the consequences of this is that even the R&D analysts don't have > direct access :/. If you're operating on python, I'd thoroughly > recommend Aaron's proposed utility; it's probably my favourite way to > process the dumps. > > On 25 January 2015 at 19:18, Aaron Halfaker <[email protected]> > wrote: > > Neta, > > > > There are two ways to get revision text. > > > > 1. Query the API. See > > https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brevisions > > Take special note of the "content" value of the rvprop parameter. This > > strategy is good when you want to process only few revisions. > > > > 2. Process the XML dumps. http://dumps.wikimedia.org/backup-index.html > If > > you are working in python, I have some nice utilities for processing the > XML > > dump files. See > > > http://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw-xml-dump > > This strategy is good when you want to process the entire history of a > wiki. > > > > -Aaron > > > > On Sun, Jan 25, 2015 at 2:24 PM, Neta Livneh <[email protected]> > wrote: > >> > >> Hi, > >> > >> I'm trying to reach the text table (for read only purposes), but it > seems > >> that I it is not available to me (It is not in the table when I run SHOW > >> TABLES). > >> > >> Does anybody know why I don't have access and if I can get one? It is > >> crucial for my research as I need to analyse the text. > >> > >> Thanks, > >> Neta > >> > >> > >> > >> > >> > >> > >> On Thu, Jan 15, 2015 at 7:36 PM, Neta Livneh <[email protected]> > >> wrote: > >>> > >>> yeah, I do have access - Thanks! > >>> I already used ssh, and also used the quarry tool for smaller quick > >>> queries. > >>> > >>> Cheers, > >>> Neta > >>> > >>> > >>> On Thu, Jan 15, 2015 at 7:35 PM, Neta Livneh <[email protected]> > >>> wrote: > >>>> > >>>> > >>>> > >>>> On Thu, Jan 15, 2015 at 4:42 PM, Dan Andreescu > >>>> <[email protected]> wrote: > >>>>> > >>>>> Sorry, old thread, but I wanted to point out that > >>>>> http://quarry.wmflabs.org seems like a good tool for this use case. > >>>>> > >>>>> > >>>>> On Wednesday, December 24, 2014, Leila Zia <[email protected]> > wrote: > >>>>>> > >>>>>> Hi Neta, > >>>>>> > >>>>>> On Wed, Dec 24, 2014 at 7:19 AM, Neta Livneh <[email protected] > > > >>>>>> wrote: > >>>>>>> > >>>>>>> > >>>>>>> Actually, this is a great opportunity to say that I would love to > get > >>>>>>> you guys involved or at least hear insights from the analytics team > >>>>>>> regarding the project's direction. > >>>>>> > >>>>>> > >>>>>> Feel free to keep me in the loop for the latter. > >>>>>> > >>>>>> Best, > >>>>>> Leila > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Dec 24, 2014 at 4:39 PM, Aaron Halfaker > >>>>>>> <[email protected]> wrote: > >>>>>>>> > >>>>>>>> Here's the instructions that Christian gave with some screenshots > >>>>>>>> and discussion: > >>>>>>>> > https://meta.wikimedia.org/wiki/Research:Labs2/Getting_started_with_Tool_Labs > >>>>>>>> > >>>>>>>> If you're just looking to run a few queries, you might consider > >>>>>>>> http://quarry.wmflabs.org which requires no shell access -- just > a Wikimedia > >>>>>>>> sites account. > >>>>>>>> > >>>>>>>> -Aaron > >>>>>>>> > >>>>>>>> On Wed, Dec 24, 2014 at 7:22 AM, Christian Aistleitner > >>>>>>>> <[email protected]> wrote: > >>>>>>>>> > >>>>>>>>> Hi Neta, > >>>>>>>>> > >>>>>>>>> On Wed, Dec 24, 2014 at 11:28:33AM +0200, Neta Livneh wrote: > >>>>>>>>> > For my project, we will need to sql queries on current > wikipedia > >>>>>>>>> > data > >>>>>>>>> > (mostly revision history table). > >>>>>>>>> > > >>>>>>>>> > I already have a Gerrit account. Can I get SSH access for > running > >>>>>>>>> > such > >>>>>>>>> > queries? > >>>>>>>>> > >>>>>>>>> It sounds like the redacted labs databases would nicely fit your > >>>>>>>>> use > >>>>>>>>> case. The easiest way to get access there is to apply for Tool > Labs > >>>>>>>>> [1]. > >>>>>>>>> > >>>>>>>>> To get access, please file a request through > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request > >>>>>>>>> > >>>>>>>>> (Many parts around the WMF are currently getting migrated to > >>>>>>>>> phabricator.wikimedia.org, so if someone knows a phabricator > >>>>>>>>> procedure > >>>>>>>>> for that please chime in!) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Once you've got Tool Labs [1] access you can ssh to > >>>>>>>>> > >>>>>>>>> tools-login.wmflabs.org > >>>>>>>>> > >>>>>>>>> and running > >>>>>>>>> > >>>>>>>>> sql enwiki > >>>>>>>>> > >>>>>>>>> on that host connects you to labsdb's enwiki database and you can > >>>>>>>>> run > >>>>>>>>> your queries there (similar for other wikis). > >>>>>>>>> > >>>>>>>>> Have fun, > >>>>>>>>> Christian > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [1] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs > >>>>>>>>> has more information and links about Tool Labs. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- > >>>>>>>>> Companies' registry: 360296y in Linz > >>>>>>>>> Christian Aistleitner > >>>>>>>>> Kefermarkterstrasze 6a/3 Email: [email protected] > >>>>>>>>> 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 > >>>>>>>>> Fax: +43 7946 / 20 5 81 > >>>>>>>>> Homepage: http://quelltextlich.at/ > >>>>>>>>> --------------------------------------------------------------- > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Analytics mailing list > >>>>>>>>> [email protected] > >>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Analytics mailing list > >>>>>>>> [email protected] > >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Analytics mailing list > >>>>>>> [email protected] > >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>>>>>> > >>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Analytics mailing list > >>>>> [email protected] > >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>>>> > >>>> > >>> > >> > >> > >> _______________________________________________ > >> Analytics mailing list > >> [email protected] > >> https://lists.wikimedia.org/mailman/listinfo/analytics > >> > > > > > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
