Hi, thanks for your quick response, but I have a question. Why are deleted pages included in the dump? Also, the page of the error is not deleted in the wiki.[1]
[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497 2010/9/30 Dmitry Chichkov <[email protected]> > Hi Emijrp, > > That's "normal". Page id/title can be None/empty for deleted pages. > > -- Regards, Dmitry > > > On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected]> wrote: > >> Hi all; >> >> I think that there is an error in xmlreader.py. When parsing a full >> revision XML (in this case[1]), using this code[2] (look at the try-catch, >> it writes when fails) I get correctly username, timestamp and revisionid, >> but sometimes, the page title and the page id are None or empty string. >> >> The first error is: >> ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267'] >> >> But if we do: >> 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | >> egrep -i '2004-10-10T04::14Z' -C20 >> >> We get this[3], which is OK, the page title and the page id are available >> in the XML, but not correctly parsed. And this is not the only page title >> and page it that fails. >> >> Perhaps I have missed something, because I'm learning to parsing XML. >> Sorry in that case. >> >> Regards, >> emijrp >> >> [1] >> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z >> [2] http://pastebin.ca/1951930 >> [3] http://pastebin.ca/1951937 >> >> _______________________________________________ >> Pywikipedia-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l >> >> > > _______________________________________________ > Pywikipedia-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > >
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
