Hi Travis,
Firstly, from the sound of things you seem to have missed the need to
perform template expansion prior to parsing. MediaWiki, and mwlib, don't
try to interpret what is displayed until all templates are expanded. See
mwlib.expander.Expander, or just pass an appropriate wiki to
uparser.parseString.
Article.asText() is a very naive approach; it seems to be designed for
parser debugging. You generally want to ignore certain parts of the text.
See advtree.AdvancedNode.getAllDisplayText(). Note too that AdvancedTree
construction involves some cleaning up before that function is called.
I don't use advtree because I wanted something efficient and more tightly
designed for my needs. But I do have a similar recursive parse tree
descent that does not traverse nodes of certain types.
You say you want to identify section headings by name and extract text
prior to that heading. Perhaps this would bettewr be described as
traversing the article to extract all display text, but stopping upon
reaching a particular heading.
However, if you actually want to find the Section heading nodes, try:
root_node.find(parser.Section)
The caption may not be directly stored on the Section node, so you might
need a function like getAllDisplayText() to get the full text off it.
My own extraction framework involves:
* A template handling process which uses one Expander for multiple
documents (clearing the template cache every 10000 docs)
* parse included templates
* extract information about template key-value pair arguments
* expand templates
* Parses and cleans using the following pipeline:
* mwlib.compat.parse_txt
* mwlib.old_uparser.postprocessors
* remove non-print nodes
* remove redundant nodes
* some link text and target normalisation (e.g. resolving redirects;
making in-page links absolute)
* identify sentence boundaries over paragraph text
* tokenise all text (i.e. splitting off punctuation) and number all
paragraph tokens
* Then I am able to extract the following information:
* the tokenised paragraph text
* token offset spans corresponding to each paragraph/sentence
* outgoing links and their token offsets
* section headings and their token offsets
* categories
* language links
* etc.
* Most textual operations can then be performed using token offsets
- Joel
On Fri, 27 Apr 2012 00:15:01 +1000, Travis Briggs <[email protected]>
wrote:
Hi Joel,
My needs are pretty simple. The basic 'algorithm' of what I want to do is
identify section headers with their names:
if(isinstance(node, Section) and node.name == "External Links"):
finish_node = node
Then, given the location in the document of a section header with a given
name, I want to take all the data in the document up to that point as
plain
text. So, for :
"""
{{About|the rock band|their debut album|Foo Fighters (album)|the aerial
phenomenon|foo fighter}}
{{pp-move-indef}}
{{pp-semi|small=yes}}
{{Infobox musical artist
| name = Foo Fighters
| image = Foo Fighters 2007.jpg
| [...]}}
'''Foo Fighters''' is an<!--Awards don't belong here--> American
[[alternative rock]] band
"""
It becomes:
"Foo Fighters is an American alternative rock band"
I tried using uparser.simple_parse, but the results of Article.asText()
calls was very disappointing. mwlib.refine.compat.parse_text seems to
give
much better results, but the infobox and other templates are still stuck
in
the text.
And of course, my psuedo code is wrong, I still need to figure out how to
identify Sections with a certain name, and then collect nodes between the
head and that node.
All help is greatly appreciated, thanks,
-Travis
On 25 April 2012 19:43, Joel Nothman <[email protected]>
wrote:
I have been using mwlib for exactly that since 2008, but I haven't
checked
if my scripts work with a more recent version of mwlib. (I mostly use
mwlib.refine.compat.parse_**text.)
I and others may be able to help you with more detail if you give us
some
idea what you would like to get out of the parse. For instance I needed
standard structured Wikipedia features (category links, template
information, etc.) as well as tokenised sentences with outgoing links as
standoff annotations.
- Joel
On Thu, 26 Apr 2012 02:35:44 +1000, Travis Briggs <[email protected]>
wrote:
Hello,
Is there a way to get an abstract syntax tree from wikitext input
using mwlib? The documentation seems to only cover creating PDF or
some other documents.
Thanks,
-Travis
--
You received this message because you are subscribed to the Google Groups
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/mwlib?hl=en.