Hi.
What you are trying to do is probably pretty complicated - the lack of
documentation does not make it easier unfortunately.
I hope the following sample code will get you started:
import sys
from mwlib import wiki
from mwlib.parser import show
zip_fn = '/home/volker/test/test.zip'
wiki_obj = wiki.makewiki(zip_fn)
for item in wiki_obj.metabook.walk():
if item.type == 'article':
parse_tree = item.wiki.getParsedArticle(title=item.title,
revision=item.revision)
show(sys.stdout, parse_tree)
If you want to manipulate the parsetree you should do yourself a favor
and transform the parsetree with "buildAdvancedTree" in mwlib.advtree
I hope this helps,
Volker
On 05/08/2011 09:23 AM, Matthew Honnibal wrote:
Hi,
What's the recommended way to get parse-trees in Python for a list of
articles?
I'm having trouble using mw-zip and mwlib.zipwiki. I'm trying to do
mwlib.zipwiki.Wiki('tmp.zip') with a zip created on the commandline
using "mw-zip -m -c :en -o tmp.zip Sun". However, mwlib.zipwiki.Wiki
expects a contents.json file in tmp.zip, which mw-zip hasn't created:
$ mw-zip -c :en -o tmp.zip Sun Moon Stars
creating nuwiki in u'tmpuMGaLk/nuwiki'
2011-05-08T17:13:23 mwlib.utils.info>> fetching 'http://
en.wikipedia.org/w/index.php?title=Help:Books/
License&action=raw&templates=expand'
256/256 80.65 48.39s
import mwlib.zipwiki
w = mwlib.zipwiki.Wiki('tmp.zip')
Traceback (most recent call last):
File "<stdin>", line 1, in<module>
File "/usr/local/lib/python2.6/dist-packages/mwlib-0.12.14-py2.6-
linux-x86_64.egg/mwlib/zipwiki.py", line 48, in __init__
content = json.loads(unicode(self.zf.read('content.json'),
'utf-8'))
File "/usr/lib/python2.6/zipfile.py", line 834, in read
return self.open(name, "r", pwd).read()
File "/usr/lib/python2.6/zipfile.py", line 857, in open
zinfo = self.getinfo(name)
File "/usr/lib/python2.6/zipfile.py", line 824, in getinfo
'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'content.json' in the archive"
$ ls -la
total 20260
drwxr-xr-x 3 matt matt 4096 2011-05-08 17:16 .
drwxr-xr-x 3 matt matt 4096 2011-05-08 17:16 ..
-rw-r--r-- 1 matt matt 4926482 2011-05-08 17:14 edits.json
-rw-r--r-- 1 matt matt 1276 2011-05-08 17:14 excluded.json
-rw-r--r-- 1 matt matt 53979 2011-05-08 17:14 imageinfo.json
drwxr-xr-x 2 matt matt 4096 2011-05-08 17:16 images
-rw-r--r-- 1 matt matt 25543 2011-05-08 17:14 licenses.json
-rw-r--r-- 1 matt matt 919 2011-05-08 17:13 metabook.json
-rw-r--r-- 1 matt matt 149 2011-05-08 17:14 nfo.json
-rw-r--r-- 1 matt matt 1632927 2011-05-08 17:14 parsed_html.json
-rw-r--r-- 1 matt matt 452 2011-05-08 17:14 redirects.json
-rw-r--r-- 1 matt matt 1492682 2011-05-08 17:14 revisions-1.txt
-rw-r--r-- 1 matt matt 126884 2011-05-08 17:13 siteinfo.json
-rw------- 1 matt matt 12402072 2011-05-08 17:14 tmp.zip
--
volker haas brainbot technologies ag
fon +49 6131 2116394 boppstraße 64
fax +49 6131 2116392 55118 mainz
[email protected] http://www.brainbot.com/
--
You received this message because you are subscribed to the Google Groups
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/mwlib?hl=en.