TL;DR I would like to access wikipedia's articles' metadata (such as # edits, 
pageviews etc). I need to access a big volume of instances in order to train 
and maintain an online classifier and the API seems not sustainable. I was 
wondering which tool is the most appropriate for this task.

Hello everyone, 

It is my first time interacting in this mailing list, so I will be happy to 
receive further feedbacks on how to better interact with the community :)
I crossposted this message to Wiki-research-l as well.

I am trying to access Wikipedia meta data in a streaming and time/resource 
sustainable manner. By meta data I mean many of the voices that can be found in 
the statistics of a wiki article, such as edits, editors list, page views etc.
I would like to do such for an online classifier type of structure: retrieve 
the data from a big number of wiki pages every tot time and use it as input for 
predictions.

I tried to use the Wiki API, however it is time and resource expensive, both 
for me and Wikipedia.

My preferred choice now would be to query the specific tables in the Wikipedia 
database, in the same way this is done through the Quarry tool. The problem 
with Quarry is that I would like to build a standalone script, without having 
to depend on a user interface like Quarry. Do you think that this is possible? 
I am still fairly new to all of this and I don’t know exactly which is the best 
direction. 
I saw [1] that I could access wiki replicas both through Toolforge and PAWS, 
however I didn’t understand which one would serve me better, could I ask you 
for some feedback?

Also, as far as I understood [2], directly accessing the DB through Hive is too 
technical for what I need, right? Especially because it seems that I would need 
an account with production shell access and I honestly don’t think that I would 
be granted access to it. Also, I am not interested in accessing sensible and 
private data.

Last resource is parsing analytics dumps, however this seems less organic in 
the way of retrieving and polishing the data. As also, it would be strongly 
decentralised and physical-machine dependent, unless I upload the polished data 
online every time.

Sorry for this long message, but I thought it was better to give you a clearer 
picture (hoping this is clear enough). If you could give me even some hint it 
would be highly appreciated.

Best,
Cristina
_______________________________________________
Analytics mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to