The basic parse tree provided by mwlib simply provides a children property on each node, which is a list of the node's children. This provides it the tree structure, and ultimately allows you to do whatever you like, albeit at a low level of processing. Each node also has a find method which allows you to do something like tree.find(mwlib.parser.Paragraph). (By the way, I assume you're using uparser to get a parse tree back.)

Applying the transformations in mwlib.advtree enables navigation and modification of the tree in a more declarative manner, e.g. being able to move between siblings and parents. It also allows you to more easily inspect some attributes such as HTML class, and to extract all display text (node.getAllDisplayText(); unfortunately, mwlib stores the display text in different fields depending on the node type).

mwlib.treecleaner adds onto this a number of processes to clean the tree, removing some nodes that wouldn't generally be useful in a printed edition. It is engineered towards printing, not machine processing (which is what I imagine you want it for), so it doesn't do things like remove references.

But these last two post-parsing functions might do work not needed for your purpose, and might do them slower than is worthwhile if, say, you're processing the whole of English Wikipedia. So it might be better to modify these tools, or devise your own walk through the tree to extract text from only the nodes you want.

All the best,

- Joel


On Mon, 06 Sep 2010 22:26:14 +1000, Nick Ruiz <[email protected]> wrote:

Hello,

This is my first time using mwlib. I was wondering if there was any
documentation on how to iterate through a parse tree. I am having
trouble finding the right sections of code to demonstrate how you can
incrementally (and recursively) iterate through the children and
determine the node type.

Ultimately, I would like to use this to extract section headers and
paragraphs from articles (without links, references, and other markup,
for the time being)

Thanks for your help,
Nick

--
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.

Reply via email to