Re: [mwlib] Re: Templates from wikipedia

Joel Nothman Tue, 21 Sep 2010 00:28:07 -0700

Hmmm... it's been a while since I've not used custom extensions of mwlibto do some of the things you're talking about below, so I'll be a bitvague.

* mw-buildcdb should build a CDB from a MediaWiki dump. It simply stores amapping from title to MediaWiki markup. I don't know how it handles fullrevision histories.* In the case of redirects, they are indicated in Wikipedia pages bymarkup like #REDIRECT[[target]]. I'm not sure how they're officiallyhandled in mwlib nowadays, but I'm sure they are... there's certainlystuff related to recognising them in nshandling and dumpparser, thoughstrangely neither actually accounts for multilingual variation (whichnshandling should).* As far as I gather nuwiki was designed as a way to refactor and addfeatures to the generic wiki object where each implementation (CDB, Zip,etc.) had previously done it themselves. nuwiki.adapt does thistransformation.

~J

On Tue, 21 Sep 2010 13:02:21 +1000, Peter W<[email protected]> wrote:

Hi again Joel,

Your answer was very helpful in that it revealed that I was quite far
out of my depth. You're correct, all templates are expanding to nil.

Besides DictDB, I could find three other classes defined that seem to
have similar characteristics: DummyDB (which is even less-
implemented), cdbwiki.WikiDB, and a nuwiki. If I try and create either
of the latter two, I require a cdb file that is a constant database.
Here's where I'm confused. What does this database hold?

My inference is that it holds a collection of objects that define an
abstract representation of various templates. Each of these objects is
associated with a language (e.g. "en") and a namespace (for example,
"Template" or "Users", since you suggest that some templates are
defined in the User namespace). If this is right, how do I get a hold
of such a set of objects from Wikipedia and construct a cdb to hold
it? Is there a mwlib method for this, or does it need to be done
somewhat manually?

Then, in usage, when the parser finds a template it needs to expand,
it passes it to the normalize_and_get_page method of the database.
This returns a page object, whose 'rawtext' has been retrieved from
the database and whose 'names' is a list of all the places its been
redirected to. I'm not certain how the redirects work, but presumably
some of the objects in the databases are just aliases of others. Is my
understanding even remotely on-target?

I hope that makes sense and I'm not completely wrong.

Thanks so much,

Peter


On Sep 19, 7:36 pm, "Joel Nothman" <[email protected]>
wrote:

Right, so your issue is actually that *all* templates are expanding tonil, right?
DictDB describes itself as "for testing" for a reason. It's not asufficient implementation to necessarily do what you want.
When you are expanding templates (using mwlib.expander.Expander orwhatever), the DB it has access to needs to be able to provide it withthe raw mediawiki markup of each template through the functionnormalize_and_get_page. This method takes arguments (title, defaultns).Title is the title as indicated in the Wikipedia page (e.g. "Mainarticle" or "main article"); for template expansion,defaultns=NS_TEMPLATE (i.e. 10).
Note that defaultns only notes the default namespace. You can havetemplates in other namespaces, most commonly user space, or main spacefor sub-pages, so you might see something like{{User:Foo/Bar/Template}}, which has to be expanded to an objectoutside of the template namespace.
So you need to be able to recognise prefixes to determine the correctnamespace, normalise the title, and then look up the appropritenamespace and title in your database (resolving redirects as necessary).
To recognise prefixes, you should use an nshandler object:
nsnumber, title, fully_qualified_title = nshandler.splitname(title,defaultns=defaultns)
I hope that helps...

~J
On Mon, 20 Sep 2010 06:13:00 +1000, Peter W<[email protected]> wrote:
> Hi Joel,

> Thanks for the quick reply; I had been using the latest release, so I
> deleted that and installed following the instructions here (http://
> code.pediapress.com/wiki/wiki/mwlib-install) to use the git
> repository. To make sure it worked, I checked that mwlib/templ/
> magics.py had the changes indicated here

>http://code.pediapress.com/git/mwlib/?p=mwlib;a=blobdiff;f=mwlib/temp...

> (which it does).

> I re-ran the operation, but the result was exactly the same as
> before.

> Do I need to instruct the expander to be aware of the templates that
> are specific to that namespace (i.e. to be aware of all the templates
> defined in each namespace's equivalent of
>http://en.wikipedia.org/wiki/Category:Wikipedia_template_categories)?
> Is mwlib designed to be able to access that massive list of
> templates?

> I'm not 100% sure that I'm asking the right question, but the symptoms
> are that:

> 1. [links] become <a>elements</a> (this is good/what I'd expect)
> 2a. <ref>references</ref> become super/subscripts. (this is good/what
> I'd expect)
> 2b. However, the references section doesn't have any of the
> corresponding references (this is bad/not what I'd expect)
> 3. Wikitext like {{Main article|Article A|Article B|Article C|Article
> D}} doesn't have any representation in the generated xhtml (this is
> bad/not what I'd expect)

> If it'd be helpful, I can post the code + wikiText that I'm testing
> against.

> Thanks again,

> Peter

> On Sep 19, 6:00 am, "Joel Nothman" <[email protected]>
> wrote:
>> Are you using the Git repository HEAD, or the latest release?
>> Some recent changes fixed some of the "magic templates", includingones >> that would cause namespace sensitivity.
>> In particular, I committed the patches after code like this(appearing >> in Wikipedia's {{asbox}}) didn't work:
>> {{#switch:{{FULLPAGENAME:{{{name|}}}}} ...

>> So try get the Git HEAD first...

>> ~J
>> On Sun, 19 Sep 2010 12:01:49 +1000, Peter W >><[email protected]> wrote:
>> > Hi there,

>> > === Background ===

>> > I'm doing some research involving parsing many revisions of some
>> > articles in wikipedia across namespaces (i.e. parsing the English
>> > version and the German version and even the be-x-old version.). Ihave>> > all of the revisions stored initially in xml dumps from Wikipedia,but
>> > I've already parsed those dumps into a Django database.

>> > === Actual Question ===
>> > I set up mwlib to get an XHTML version of the raw text of therevision
>> > I send it; I mocked up one of XHTML tests to do this. I don't have
>> > MediaWiki installed, so I sub-classed DictDB and added methods to
>> > "getURL". I also downloaded all of the siteinfos using
>> > mwlib.siteinfo.fetch_siteinfo and made the DictDB "get" the
>> > appropriate one.

>> > All of that done, I can't figure out how to make mwlib aware of
>> > namespace-specific templates: for example, {{too long}}
>> > {{neutrality}}
>> > {{Infobox Military Conflict [...] }}
>> > etc. When I run the parser currently, the xhtml simply deletes allof
>> > those templates.

>> > Is there a way to get mwlib to parse those templates into xhtml?
>> > If so, what do I need to do?

>> > Thanks so much for the help,

>> > Peter


--
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.

Re: [mwlib] Re: Templates from wikipedia

Reply via email to