On Sun, August 2, 2009 11:57 am, Stig Meireles Johansen wrote:
>> is about 10 times slower than just using four (much more readable)
>> lines of code:
>>
>> (..snip..)
>
> That may be, but when I tried your code on
> http://download.wikimedia.org/nowiki/20090729/nowiki-20090729-pages-articles.xml.bz2(after
> unpacking of course) I got this:
> Traceback (most recent call last):
> File "search.py", line 5, in <module>
> print page.title
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in
> position
> 1: ordinal not in range(128)
Yes, it breaks. To mimic the behaviour of your script (which blindly
ignores the encoding and as such works), use page.title.encode('utf-8'),
which should work fine.
Additionally, xmlreader actually supports reading bzip2-ed xml (which is
probably faster than unzipping and running, and possibly even faster than
running it on the plain xml, depending on processor speed and disk speed):
import xmlreader
for page in
xmlreader.XmlDump('/home/valhallasw/download/nowiki-20090729-pages-articles.xml.bz2').parse():
if '{|' in page.text:
print page.title.encode('utf-8')
valhall...@elladan:~/pywikipedia/trunk/pywikipedia$ python stig.py > results
valhall...@elladan:~/pywikipedia/trunk/pywikipedia$ wc -l results
20890 results
(which includes one line 'Reading XML dump...', so that is the same result).
-Merlijn van Deen
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l