Re: [mwlib] Re: Templates from wikipedia

Joel Nothman Sun, 19 Sep 2010 16:36:42 -0700

Right, so your issue is actually that *all* templates are expanding tonil, right?

DictDB describes itself as "for testing" for a reason. It's not asufficient implementation to necessarily do what you want.

When you are expanding templates (using mwlib.expander.Expander orwhatever), the DB it has access to needs to be able to provide it with theraw mediawiki markup of each template through the functionnormalize_and_get_page. This method takes arguments (title, defaultns).Title is the title as indicated in the Wikipedia page (e.g. "Main article"or "main article"); for template expansion, defaultns=NS_TEMPLATE (i.e.10).

Note that defaultns only notes the default namespace. You can havetemplates in other namespaces, most commonly user space, or main space forsub-pages, so you might see something like {{User:Foo/Bar/Template}},which has to be expanded to an object outside of the template namespace.

So you need to be able to recognise prefixes to determine the correctnamespace, normalise the title, and then look up the approprite namespaceand title in your database (resolving redirects as necessary).


To recognise prefixes, you should use an nshandler object:

nsnumber, title, fully_qualified_title = nshandler.splitname(title,defaultns=defaultns)


I hope that helps...

~J

On Mon, 20 Sep 2010 06:13:00 +1000, Peter W<[email protected]> wrote:

Hi Joel,

Thanks for the quick reply; I had been using the latest release, so I
deleted that and installed following the instructions here (http://
code.pediapress.com/wiki/wiki/mwlib-install) to use the git
repository. To make sure it worked, I checked that mwlib/templ/
magics.py had the changes indicated here

http://code.pediapress.com/git/mwlib/?p=mwlib;a=blobdiff;f=mwlib/templ/magics.py;h=b64c00a7d7aa5b243b9177aa4ff8123c446e6b16;hp=ca284d437eb250bbdb6a7e591fdf6cd47b831b67;hb=2e72ccdfd085a3fa69f51c1bc28a767bc25d89f3;hpb=b84fcb106b1dc6f3f55e2c6f9bde6419128e9fba

(which it does).

I re-ran the operation, but the result was exactly the same as
before.

Do I need to instruct the expander to be aware of the templates that
are specific to that namespace (i.e. to be aware of all the templates
defined in each namespace's equivalent of
http://en.wikipedia.org/wiki/Category:Wikipedia_template_categories )?
Is mwlib designed to be able to access that massive list of
templates?

I'm not 100% sure that I'm asking the right question, but the symptoms
are that:

1. [links] become <a>elements</a> (this is good/what I'd expect)
2a. <ref>references</ref> become super/subscripts. (this is good/what
I'd expect)
2b. However, the references section doesn't have any of the
corresponding references (this is bad/not what I'd expect)
3. Wikitext like {{Main article|Article A|Article B|Article C|Article
D}} doesn't have any representation in the generated xhtml (this is
bad/not what I'd expect)

If it'd be helpful, I can post the code + wikiText that I'm testing
against.

Thanks again,

Peter


On Sep 19, 6:00 am, "Joel Nothman" <[email protected]>
wrote:

Are you using the Git repository HEAD, or the latest release?

Some recent changes fixed some of the "magic templates", including onesthat would cause namespace sensitivity.

In particular, I committed the patches after code like this (appearingin Wikipedia's {{asbox}}) didn't work:


{{#switch:{{FULLPAGENAME:{{{name|}}}}} ...

So try get the Git HEAD first...

~J

On Sun, 19 Sep 2010 12:01:49 +1000, Peter W<[email protected]> wrote:

> Hi there,

> === Background ===

> I'm doing some research involving parsing many revisions of some
> articles in wikipedia across namespaces (i.e. parsing the English
> version and the German version and even the be-x-old version.). I have
> all of the revisions stored initially in xml dumps from Wikipedia, but
> I've already parsed those dumps into a Django database.

> === Actual Question ===
> I set up mwlib to get an XHTML version of the raw text of the revision
> I send it; I mocked up one of XHTML tests to do this. I don't have
> MediaWiki installed, so I sub-classed DictDB and added methods to
> "getURL". I also downloaded all of the siteinfos using
> mwlib.siteinfo.fetch_siteinfo and made the DictDB "get" the
> appropriate one.

> All of that done, I can't figure out how to make mwlib aware of
> namespace-specific templates: for example, {{too long}}
> {{neutrality}}
> {{Infobox Military Conflict [...] }}

> etc. When I run the parser currently, the xhtml simply deletes all of
> those templates.

> Is there a way to get mwlib to parse those templates into xhtml?
> If so, what do I need to do?

> Thanks so much for the help,

> Peter


--
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.

Re: [mwlib] Re: Templates from wikipedia

Reply via email to