I see. Strange... That indeed looks like a parser bug.

-- Dmitry


On Thu, Sep 30, 2010 at 3:37 PM, emijrp <[email protected]> wrote:

> Furthermore, if you see the chunk of the dump that I have posted, the page
> title and page id are there. But the parser doesn't get them.
>
> 2010/10/1 emijrp <[email protected]>
>
> Hi, thanks for your quick response, but I have a question. Why are deleted
>> pages included in the dump? Also, the page of the error is not deleted in
>> the wiki.[1]
>>
>> [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497
>>
>> 2010/9/30 Dmitry Chichkov <[email protected]>
>>
>> Hi Emijrp,
>>>
>>> That's "normal". Page id/title can be None/empty for deleted pages.
>>>
>>> -- Regards, Dmitry
>>>
>>>
>>> On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected]> wrote:
>>>
>>>> Hi all;
>>>>
>>>> I think that there is an error in xmlreader.py. When parsing a full
>>>> revision XML (in this case[1]), using this code[2] (look at the try-catch,
>>>> it writes when fails) I get correctly username, timestamp and revisionid,
>>>> but sometimes, the page title and the page id are None or empty string.
>>>>
>>>> The first error is:
>>>> ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
>>>>
>>>> But if we do:
>>>> 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null |
>>>> egrep -i '2004-10-10T04::14Z' -C20
>>>>
>>>> We get this[3], which is OK, the page title and the page id are
>>>> available in the XML, but not correctly parsed. And this is not the only
>>>> page title and page it that fails.
>>>>
>>>> Perhaps I have missed something, because I'm learning to parsing XML.
>>>> Sorry in that case.
>>>>
>>>> Regards,
>>>> emijrp
>>>>
>>>> [1]
>>>> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
>>>> [2] http://pastebin.ca/1951930
>>>> [3] http://pastebin.ca/1951937
>>>>
>>>> _______________________________________________
>>>> Pywikipedia-l mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Pywikipedia-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>>>
>>>
>>
>
> _______________________________________________
> Pywikipedia-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>
>
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Reply via email to