Furthermore, if you see the chunk of the dump that I have posted, the page title and page id are there. But the parser doesn't get them.
2010/10/1 emijrp <[email protected]> > Hi, thanks for your quick response, but I have a question. Why are deleted > pages included in the dump? Also, the page of the error is not deleted in > the wiki.[1] > > [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497 > > 2010/9/30 Dmitry Chichkov <[email protected]> > > Hi Emijrp, >> >> That's "normal". Page id/title can be None/empty for deleted pages. >> >> -- Regards, Dmitry >> >> >> On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected]> wrote: >> >>> Hi all; >>> >>> I think that there is an error in xmlreader.py. When parsing a full >>> revision XML (in this case[1]), using this code[2] (look at the try-catch, >>> it writes when fails) I get correctly username, timestamp and revisionid, >>> but sometimes, the page title and the page id are None or empty string. >>> >>> The first error is: >>> ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267'] >>> >>> But if we do: >>> 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | >>> egrep -i '2004-10-10T04::14Z' -C20 >>> >>> We get this[3], which is OK, the page title and the page id are available >>> in the XML, but not correctly parsed. And this is not the only page title >>> and page it that fails. >>> >>> Perhaps I have missed something, because I'm learning to parsing XML. >>> Sorry in that case. >>> >>> Regards, >>> emijrp >>> >>> [1] >>> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z >>> [2] http://pastebin.ca/1951930 >>> [3] http://pastebin.ca/1951937 >>> >>> _______________________________________________ >>> Pywikipedia-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l >>> >>> >> >> _______________________________________________ >> Pywikipedia-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l >> >> >
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
