Re: [mwlib] Ignore Redirects?

Lars Jørgen Solberg Tue, 27 Mar 2012 01:17:51 -0700

Hi

This seems to work for me


wiki_env.wiki.nshandler.redirect_matcher(wiki_env.wiki.reader[page_title])

You could also look at page.names. It contains the name of any redirectsget_page had to follow and the article name, meaning that for redirectsthis will be true: len(page.names) > 1.


--ljs

On 26. mars 2012 22:01, UltraNurd wrote:

I'm using mwlib in Python to iterate over a Wikipedia dump. I want to
ignore redirects and just look at page contents with the actual full
title. I've already run mw-buildcdb, and I'm loading that:

wiki_env = wiki.makewiki(wiki_conf_file)

When I loop over wiki_env.wiki.articles(), the strings appear to
contain redirect titles (I've checked this on a couple of samples
against Wikipedia). I don't see an accessor that skips these, and
wiki_env.wiki.redirects is an empty dictionary, so I can't check which
article titles are actually just redirects that way.

I've tried looking through the mwlib code, but if I use

page = wiki_env.wiki.get_page(page_title)
wiki_env.wiki.nshandler.redirect_matcher(page.rawtext)

the page.rawtext appears to already be redirected (containing the full
page content, and no indication that there is a title mismatch).
Similarly the Article node returned by getParsedArticle() does not
appear to contain the "true" title to check against.

Anyone know how to do this? Do I need to run mw-buildcdb in a way to
not store redirects? As far as I can tell that command just takes an
input dump file and an output CDB, with no other options.



--
The moth of wrath goads the rat on! The rat goes berserk!

--
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.

Re: [mwlib] Ignore Redirects?

Reply via email to