Hello there,

For research purpose I would like to retrieve information, such as article text 
and all revisions (revision content, time stamps, usernames), for English 
articles under certain categories (including sub-categories) or probably a set 
of randomly selected articles, but not necessary for the whole English 
Wikipedia. 

I tried Export page 
(https://en.wikipedia.org/w/index.php?title=Special:Export&action=submit) but 
it limits revisions to 1000. And it generates an XML document as output.

I have been reading some information online but still don't have a very clear 
picture. I know there are downloadable dumps compressed in XML format, and also 
it appears the same content can be downloaded in form of MySQL database as well.

I am familiar with java, and have some experience with MySQL, XML, PHP, and 
HTML.

My questions are:
What are the better ways for me to get the information I need? Please be 
specific. 

For example, if I download the data in XML format, do I use MediaWiki (PHP) to 
retrieve the information from those XML documents or is there a good java XML 
parser for wikipedia to retrieve my desired results?

If the whole content can be downloaded to a MySQL database in my local 
computer, can I write a java program with SQL queries to get my desired results 
from the database, or MediaWiki is better to retrieve the results from the 
database?

Thank you,
Ming


_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Reply via email to