[mwlib] Processing all of wikipedia

Nick Johnson Wed, 10 Jun 2009 09:55:09 -0700

I'm wanting to write something that involves processing the entire en
Wikipedia dump, in order to generate a packaged wikipedia in a new
format (specifically, .epub, for ebook readers). Currently, it looks
like the mwlib tools are oriented around using APIs to directly
extract small segments of Wikipedia, not around processing entire
dumps.


It looks like my best option is to devise my own db format (such as
extracting the articles from the XML dump and writing them to a BDB
database), write a tool to turn a dump into this format, and write a
WikiDBBase subclass that implements that DB format. I can then iterate
through all the articles I want to include, parse and render them, and
output them to my eventual format. Am I correct in this, or is there
an easier way?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [email protected]
For more options, visit this group at http://groups.google.com/group/mwlib?hl=en
-~----------~----~----~----~------~----~------~--~---

[mwlib] Processing all of wikipedia

Reply via email to