Hi Volker, Thanks for that, I think I see what to do.
> the lack of documentation does not make it easier unfortunately. Yes it's a pity, as mwlib is a nice project. I'll try to be a good citizen and write up a tutorial when I'm done with this, as I think what I want to do might be a common use-case. On Fri, May 20, 2011 at 8:31 PM, Volker Haas <[email protected]>wrote: > Hi. > > What you are trying to do is probably pretty complicated - the lack of > documentation does not make it easier unfortunately. > > I hope the following sample code will get you started: > > import sys > from mwlib import wiki > from mwlib.parser import show > > zip_fn = '/home/volker/test/test.zip' > > wiki_obj = wiki.makewiki(zip_fn) > > for item in wiki_obj.metabook.walk(): > if item.type == 'article': > parse_tree = item.wiki.getParsedArticle(title=item.title, > revision=item.revision) > show(sys.stdout, parse_tree) > > If you want to manipulate the parsetree you should do yourself a favor and > transform the parsetree with "buildAdvancedTree" in mwlib.advtree > > I hope this helps, > Volker > > > > > On 05/08/2011 09:23 AM, Matthew Honnibal wrote: > >> Hi, >> What's the recommended way to get parse-trees in Python for a list of >> articles? >> >> I'm having trouble using mw-zip and mwlib.zipwiki. I'm trying to do >> mwlib.zipwiki.Wiki('tmp.zip') with a zip created on the commandline >> using "mw-zip -m -c :en -o tmp.zip Sun". However, mwlib.zipwiki.Wiki >> expects a contents.json file in tmp.zip, which mw-zip hasn't created: >> >> $ mw-zip -c :en -o tmp.zip Sun Moon Stars >> creating nuwiki in u'tmpuMGaLk/nuwiki' >> 2011-05-08T17:13:23 mwlib.utils.info>> fetching 'http:// >> en.wikipedia.org/w/index.php?title=Help:Books/ >> License&action=raw&templates=expand' >> 256/256 80.65 48.39s >> >> import mwlib.zipwiki >>>>> w = mwlib.zipwiki.Wiki('tmp.zip') >>>>> >>>> Traceback (most recent call last): >> File "<stdin>", line 1, in<module> >> File "/usr/local/lib/python2.6/dist-packages/mwlib-0.12.14-py2.6- >> linux-x86_64.egg/mwlib/zipwiki.py", line 48, in __init__ >> content = json.loads(unicode(self.zf.read('content.json'), >> 'utf-8')) >> File "/usr/lib/python2.6/zipfile.py", line 834, in read >> return self.open(name, "r", pwd).read() >> File "/usr/lib/python2.6/zipfile.py", line 857, in open >> zinfo = self.getinfo(name) >> File "/usr/lib/python2.6/zipfile.py", line 824, in getinfo >> 'There is no item named %r in the archive' % name) >> KeyError: "There is no item named 'content.json' in the archive" >> >> $ ls -la >> total 20260 >> drwxr-xr-x 3 matt matt 4096 2011-05-08 17:16 . >> drwxr-xr-x 3 matt matt 4096 2011-05-08 17:16 .. >> -rw-r--r-- 1 matt matt 4926482 2011-05-08 17:14 edits.json >> -rw-r--r-- 1 matt matt 1276 2011-05-08 17:14 excluded.json >> -rw-r--r-- 1 matt matt 53979 2011-05-08 17:14 imageinfo.json >> drwxr-xr-x 2 matt matt 4096 2011-05-08 17:16 images >> -rw-r--r-- 1 matt matt 25543 2011-05-08 17:14 licenses.json >> -rw-r--r-- 1 matt matt 919 2011-05-08 17:13 metabook.json >> -rw-r--r-- 1 matt matt 149 2011-05-08 17:14 nfo.json >> -rw-r--r-- 1 matt matt 1632927 2011-05-08 17:14 parsed_html.json >> -rw-r--r-- 1 matt matt 452 2011-05-08 17:14 redirects.json >> -rw-r--r-- 1 matt matt 1492682 2011-05-08 17:14 revisions-1.txt >> -rw-r--r-- 1 matt matt 126884 2011-05-08 17:13 siteinfo.json >> -rw------- 1 matt matt 12402072 2011-05-08 17:14 tmp.zip >> >> > -- > volker haas brainbot technologies ag > fon +49 6131 2116394 boppstraße 64 > fax +49 6131 2116392 55118 mainz > [email protected] http://www.brainbot.com/ > > > -- > You received this message because you are subscribed to the Google Groups > "mwlib" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/mwlib?hl=en. > > -- You received this message because you are subscribed to the Google Groups "mwlib" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/mwlib?hl=en.
