Furthermore, if you see the chunk of the dump that I have posted, the page
title and page id are there. But the parser doesn't get them.

2010/10/1 emijrp <[email protected]>

> Hi, thanks for your quick response, but I have a question. Why are deleted
> pages included in the dump? Also, the page of the error is not deleted in
> the wiki.[1]
>
> [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497
>
> 2010/9/30 Dmitry Chichkov <[email protected]>
>
> Hi Emijrp,
>>
>> That's "normal". Page id/title can be None/empty for deleted pages.
>>
>> -- Regards, Dmitry
>>
>>
>> On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected]> wrote:
>>
>>> Hi all;
>>>
>>> I think that there is an error in xmlreader.py. When parsing a full
>>> revision XML (in this case[1]), using this code[2] (look at the try-catch,
>>> it writes when fails) I get correctly username, timestamp and revisionid,
>>> but sometimes, the page title and the page id are None or empty string.
>>>
>>> The first error is:
>>> ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
>>>
>>> But if we do:
>>> 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null |
>>> egrep -i '2004-10-10T04::14Z' -C20
>>>
>>> We get this[3], which is OK, the page title and the page id are available
>>> in the XML, but not correctly parsed. And this is not the only page title
>>> and page it that fails.
>>>
>>> Perhaps I have missed something, because I'm learning to parsing XML.
>>> Sorry in that case.
>>>
>>> Regards,
>>> emijrp
>>>
>>> [1]
>>> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
>>> [2] http://pastebin.ca/1951930
>>> [3] http://pastebin.ca/1951937
>>>
>>> _______________________________________________
>>> Pywikipedia-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>>>
>>>
>>
>> _______________________________________________
>> Pywikipedia-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>>
>>
>
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Reply via email to