On Sun, August 2, 2009 11:57 am, Stig Meireles Johansen wrote:
>> is about 10 times slower than just using four (much more readable)
>> lines of code:
>>
>> (..snip..)
>
> That may be, but when I tried your code on
> http://download.wikimedia.org/nowiki/20090729/nowiki-20090729-pages-articles.xml.bz2(after
> unpacking of course) I got this:
> Traceback (most recent call last):
>   File "search.py", line 5, in <module>
>     print page.title
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in
> position
> 1: ordinal not in range(128)

Yes, it breaks. To mimic the behaviour of your script (which blindly
ignores the encoding and as such works), use page.title.encode('utf-8'),
which should work fine.

Additionally, xmlreader actually supports reading bzip2-ed xml (which is
probably faster than unzipping and running, and possibly even faster than
running it on the plain xml, depending on processor speed and disk speed):

import xmlreader

for page in
xmlreader.XmlDump('/home/valhallasw/download/nowiki-20090729-pages-articles.xml.bz2').parse():
  if '{|' in page.text:
    print page.title.encode('utf-8')

valhall...@elladan:~/pywikipedia/trunk/pywikipedia$ python stig.py > results
valhall...@elladan:~/pywikipedia/trunk/pywikipedia$ wc -l results         
     20890 results

(which includes one line 'Reading XML dump...', so that is the same result).

-Merlijn van Deen



_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Reply via email to