Well I had noticed a bit of a shortfall in documentation so figured I
would make some notes while I figured it out for myself.
Hope this will be useful for any newcomers that want to use mwlib
inside there own scripts.
        cheers, Shane Norris

PS if anyone have more notes to add please share

Programming with mwlib 101.
-------------------------------------------
[the following assumes you have successfully installed mwlib in your
environment already]

1. Unless you already have a local mediawiki set up (I'm still sore
from trying to get en:wikipedia imported onto a local install) your
going to want to convert your xml dump files into a CDB database -
from the command line:

        mw-buildcdb --input=/xml/dump/file.bz2 --output=/my/output/directory

this gives you three files in the specified directory:
wikiidx.cdb - the articles index file.
wikidata.bin - the wikitext for the articles.
wikiconf.txt - config file for mwlib (you will need to update this if
you move the files later).

2. next inside your python code your going to want access to the
articles database:

        from mwlib import wiki
        env = wiki.makewiki('/location/of/wikiconf.txt') # wiki Environment
object

3. now to access the parse tree of an article:

        a = env.wiki.getParseArticle('name-of-article')

The result is a parse tree of nodes (starting with an Article node in
this case).
[see mwlib/parser/nodes.py for the different possible node types along
with there attributes]

4. to walk the tree each node iterates over its immediate children.

        from mwlib.parser import nodes

        for section in a:
                if section.__class__ == nodes.Section:
                        print section.firstchild.asText() # first child is the
sections caption
                        # ... do stuff for this section

or you can access all its descendants in document order with either
node.allchildren()

        for child in a:
                if child.__class__ == nodes.Text:
                        pass # ... do stuff
                elif child.__class__ == nodes.Item
                        pass # ... do other stuff

or if you prefer there is a node.filter(..expr..) method.

        parts = []
        for t in a.filter(lambda x: x.__class__==nodes.Text or isinstance(x,
nodes.Link)):
              if isinstance(t, nodes.Link):
                  if t.firstchild is None: # link doesn't have any
replacement text so use target
                      parts.append(t.target)
                  # otherwise don't print, the replacement text will show up
next iteration
              else:
                  parts.append(t.asText())
        # still has some kruft such as inline references and language link
content that need
        # removing before its useful for NLP but you get the picture
        print ''.join(parts)

from there you just need to figure which nodes you are interested in
and work your magic, good luck!

-- 
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.

Reply via email to