On Thu, Jul 30, 2009 at 6:30 PM, Merlijn van Deen <[email protected]>wrote:
> On Thu, July 30, 2009 9:11 pm, Stig Meireles Johansen wrote: > > I hacked a little on some old perl-code I had laying around which I once > [..snip..] > > here it is: > http://toolserver.no/~stigmj/tools/src/xml-search.pl.txt<http://toolserver.no/%7Estigmj/tools/src/xml-search.pl.txt> > > > > /Stigmj > > Suggestion: pywikipediabot has good built-in support. My attempt at > building a simple parser > (http://arctus.nl/~valhallasw/pulldom.py<http://arctus.nl/%7Evalhallasw/pulldom.py>) > is > about 10 times slower than just using four (much more readable) lines of > code: > That may be, but when I tried your code on http://download.wikimedia.org/nowiki/20090729/nowiki-20090729-pages-articles.xml.bz2(after unpacking of course) I got this: Traceback (most recent call last): File "search.py", line 5, in <module> print page.title UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 1: ordinal not in range(128) While my code ran like this: $ time ./xml-search.pl nowiki-20090729-pages-articles.xml "\{\|" 0 > t.t real 1m16.511s user 1m15.657s sys 0m0.856s $ grep ^Searched t.t Searched through 407565 articles and found 20889 matches Give me some working code and I'll do a comparison.. :) /Stig
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
