Hmmm... it's been a while since I've not used custom extensions of mwlib to do some of the things you're talking about below, so I'll be a bit vague.

* mw-buildcdb should build a CDB from a MediaWiki dump. It simply stores a mapping from title to MediaWiki markup. I don't know how it handles full revision histories. * In the case of redirects, they are indicated in Wikipedia pages by markup like #REDIRECT[[target]]. I'm not sure how they're officially handled in mwlib nowadays, but I'm sure they are... there's certainly stuff related to recognising them in nshandling and dumpparser, though strangely neither actually accounts for multilingual variation (which nshandling should). * As far as I gather nuwiki was designed as a way to refactor and add features to the generic wiki object where each implementation (CDB, Zip, etc.) had previously done it themselves. nuwiki.adapt does this transformation.

~J


On Tue, 21 Sep 2010 13:02:21 +1000, Peter W <[email protected]> wrote:

Hi again Joel,

Your answer was very helpful in that it revealed that I was quite far
out of my depth. You're correct, all templates are expanding to nil.

Besides DictDB, I could find three other classes defined that seem to
have similar characteristics: DummyDB (which is even less-
implemented), cdbwiki.WikiDB, and a nuwiki. If I try and create either
of the latter two, I require a cdb file that is a constant database.
Here's where I'm confused. What does this database hold?

My inference is that it holds a collection of objects that define an
abstract representation of various templates. Each of these objects is
associated with a language (e.g. "en") and a namespace (for example,
"Template" or "Users", since you suggest that some templates are
defined in the User namespace). If this is right, how do I get a hold
of such a set of objects from Wikipedia and construct a cdb to hold
it? Is there a mwlib method for this, or does it need to be done
somewhat manually?

Then, in usage, when the parser finds a template it needs to expand,
it passes it to the normalize_and_get_page method of the database.
This returns a page object, whose 'rawtext' has been retrieved from
the database and whose 'names' is a list of all the places its been
redirected to. I'm not certain how the redirects work, but presumably
some of the objects in the databases are just aliases of others. Is my
understanding even remotely on-target?

I hope that makes sense and I'm not completely wrong.

Thanks so much,

Peter


On Sep 19, 7:36 pm, "Joel Nothman" <[email protected]>
wrote:
Right, so your issue is actually that *all* templates are expanding to nil, right?

DictDB describes itself as "for testing" for a reason. It's not a sufficient implementation to necessarily do what you want.

When you are expanding templates (using mwlib.expander.Expander or whatever), the DB it has access to needs to be able to provide it with the raw mediawiki markup of each template through the function normalize_and_get_page. This method takes arguments (title, defaultns). Title is the title as indicated in the Wikipedia page (e.g. "Main article" or "main article"); for template expansion, defaultns=NS_TEMPLATE (i.e. 10).

Note that defaultns only notes the default namespace. You can have templates in other namespaces, most commonly user space, or main space for sub-pages, so you might see something like {{User:Foo/Bar/Template}}, which has to be expanded to an object outside of the template namespace.

So you need to be able to recognise prefixes to determine the correct namespace, normalise the title, and then look up the approprite namespace and title in your database (resolving redirects as necessary).

To recognise prefixes, you should use an nshandler object:

nsnumber, title, fully_qualified_title = nshandler.splitname(title, defaultns=defaultns)

I hope that helps...

~J

On Mon, 20 Sep 2010 06:13:00 +1000, Peter W <[email protected]> wrote:
> Hi Joel,

> Thanks for the quick reply; I had been using the latest release, so I
> deleted that and installed following the instructions here (http://
> code.pediapress.com/wiki/wiki/mwlib-install) to use the git
> repository. To make sure it worked, I checked that mwlib/templ/
> magics.py had the changes indicated here

>http://code.pediapress.com/git/mwlib/?p=mwlib;a=blobdiff;f=mwlib/temp...

> (which it does).

> I re-ran the operation, but the result was exactly the same as
> before.

> Do I need to instruct the expander to be aware of the templates that
> are specific to that namespace (i.e. to be aware of all the templates
> defined in each namespace's equivalent of
>http://en.wikipedia.org/wiki/Category:Wikipedia_template_categories)?
> Is mwlib designed to be able to access that massive list of
> templates?

> I'm not 100% sure that I'm asking the right question, but the symptoms
> are that:

> 1. [links] become <a>elements</a> (this is good/what I'd expect)
> 2a. <ref>references</ref> become super/subscripts. (this is good/what
> I'd expect)
> 2b. However, the references section doesn't have any of the
> corresponding references (this is bad/not what I'd expect)
> 3. Wikitext like {{Main article|Article A|Article B|Article C|Article
> D}} doesn't have any representation in the generated xhtml (this is
> bad/not what I'd expect)

> If it'd be helpful, I can post the code + wikiText that I'm testing
> against.

> Thanks again,

> Peter

> On Sep 19, 6:00 am, "Joel Nothman" <[email protected]>
> wrote:
>> Are you using the Git repository HEAD, or the latest release?

>> Some recent changes fixed some of the "magic templates", including ones >> that would cause namespace sensitivity.

>> In particular, I committed the patches after code like this (appearing >> in Wikipedia's {{asbox}}) didn't work:

>> {{#switch:{{FULLPAGENAME:{{{name|}}}}} ...

>> So try get the Git HEAD first...

>> ~J

>> On Sun, 19 Sep 2010 12:01:49 +1000, Peter W >> <[email protected]> wrote:
>> > Hi there,

>> > === Background ===

>> > I'm doing some research involving parsing many revisions of some
>> > articles in wikipedia across namespaces (i.e. parsing the English
>> > version and the German version and even the be-x-old version.). I have >> > all of the revisions stored initially in xml dumps from Wikipedia, but
>> > I've already parsed those dumps into a Django database.

>> > === Actual Question ===
>> > I set up mwlib to get an XHTML version of the raw text of the revision
>> > I send it; I mocked up one of XHTML tests to do this. I don't have
>> > MediaWiki installed, so I sub-classed DictDB and added methods to
>> > "getURL". I also downloaded all of the siteinfos using
>> > mwlib.siteinfo.fetch_siteinfo and made the DictDB "get" the
>> > appropriate one.

>> > All of that done, I can't figure out how to make mwlib aware of
>> > namespace-specific templates: for example, {{too long}}
>> > {{neutrality}}
>> > {{Infobox Military Conflict [...] }}

>> > etc. When I run the parser currently, the xhtml simply deletes all of
>> > those templates.

>> > Is there a way to get mwlib to parse those templates into xhtml?
>> > If so, what do I need to do?

>> > Thanks so much for the help,

>> > Peter



--
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.

Reply via email to