Hi Emijrp,

That's "normal". Page id/title can be None/empty for deleted pages.

-- Regards, Dmitry


On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected]> wrote:

> Hi all;
>
> I think that there is an error in xmlreader.py. When parsing a full
> revision XML (in this case[1]), using this code[2] (look at the try-catch,
> it writes when fails) I get correctly username, timestamp and revisionid,
> but sometimes, the page title and the page id are None or empty string.
>
> The first error is:
> ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
>
> But if we do:
> 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep
> -i '2004-10-10T04::14Z' -C20
>
> We get this[3], which is OK, the page title and the page id are available
> in the XML, but not correctly parsed. And this is not the only page title
> and page it that fails.
>
> Perhaps I have missed something, because I'm learning to parsing XML. Sorry
> in that case.
>
> Regards,
> emijrp
>
> [1]
> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
> [2] http://pastebin.ca/1951930
> [3] http://pastebin.ca/1951937
>
> _______________________________________________
> Pywikipedia-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>
>
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Reply via email to