Hi Emijrp, That's "normal". Page id/title can be None/empty for deleted pages.
-- Regards, Dmitry On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected]> wrote: > Hi all; > > I think that there is an error in xmlreader.py. When parsing a full > revision XML (in this case[1]), using this code[2] (look at the try-catch, > it writes when fails) I get correctly username, timestamp and revisionid, > but sometimes, the page title and the page id are None or empty string. > > The first error is: > ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267'] > > But if we do: > 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep > -i '2004-10-10T04::14Z' -C20 > > We get this[3], which is OK, the page title and the page id are available > in the XML, but not correctly parsed. And this is not the only page title > and page it that fails. > > Perhaps I have missed something, because I'm learning to parsing XML. Sorry > in that case. > > Regards, > emijrp > > [1] > http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z > [2] http://pastebin.ca/1951930 > [3] http://pastebin.ca/1951937 > > _______________________________________________ > Pywikipedia-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > >
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
