Hi, thanks for your quick response, but I have a question. Why are deleted
pages included in the dump? Also, the page of the error is not deleted in
the wiki.[1]

[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497

2010/9/30 Dmitry Chichkov <[email protected]>

> Hi Emijrp,
>
> That's "normal". Page id/title can be None/empty for deleted pages.
>
> -- Regards, Dmitry
>
>
> On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected]> wrote:
>
>> Hi all;
>>
>> I think that there is an error in xmlreader.py. When parsing a full
>> revision XML (in this case[1]), using this code[2] (look at the try-catch,
>> it writes when fails) I get correctly username, timestamp and revisionid,
>> but sometimes, the page title and the page id are None or empty string.
>>
>> The first error is:
>> ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
>>
>> But if we do:
>> 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null |
>> egrep -i '2004-10-10T04::14Z' -C20
>>
>> We get this[3], which is OK, the page title and the page id are available
>> in the XML, but not correctly parsed. And this is not the only page title
>> and page it that fails.
>>
>> Perhaps I have missed something, because I'm learning to parsing XML.
>> Sorry in that case.
>>
>> Regards,
>> emijrp
>>
>> [1]
>> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
>> [2] http://pastebin.ca/1951930
>> [3] http://pastebin.ca/1951937
>>
>> _______________________________________________
>> Pywikipedia-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>>
>>
>
> _______________________________________________
> Pywikipedia-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>
>
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Reply via email to