I see. Strange... That indeed looks like a parser bug. -- Dmitry
On Thu, Sep 30, 2010 at 3:37 PM, emijrp <[email protected]> wrote: > Furthermore, if you see the chunk of the dump that I have posted, the page > title and page id are there. But the parser doesn't get them. > > 2010/10/1 emijrp <[email protected]> > > Hi, thanks for your quick response, but I have a question. Why are deleted >> pages included in the dump? Also, the page of the error is not deleted in >> the wiki.[1] >> >> [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497 >> >> 2010/9/30 Dmitry Chichkov <[email protected]> >> >> Hi Emijrp, >>> >>> That's "normal". Page id/title can be None/empty for deleted pages. >>> >>> -- Regards, Dmitry >>> >>> >>> On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected]> wrote: >>> >>>> Hi all; >>>> >>>> I think that there is an error in xmlreader.py. When parsing a full >>>> revision XML (in this case[1]), using this code[2] (look at the try-catch, >>>> it writes when fails) I get correctly username, timestamp and revisionid, >>>> but sometimes, the page title and the page id are None or empty string. >>>> >>>> The first error is: >>>> ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267'] >>>> >>>> But if we do: >>>> 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | >>>> egrep -i '2004-10-10T04::14Z' -C20 >>>> >>>> We get this[3], which is OK, the page title and the page id are >>>> available in the XML, but not correctly parsed. And this is not the only >>>> page title and page it that fails. >>>> >>>> Perhaps I have missed something, because I'm learning to parsing XML. >>>> Sorry in that case. >>>> >>>> Regards, >>>> emijrp >>>> >>>> [1] >>>> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z >>>> [2] http://pastebin.ca/1951930 >>>> [3] http://pastebin.ca/1951937 >>>> >>>> _______________________________________________ >>>> Pywikipedia-l mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l >>>> >>>> >>> >>> _______________________________________________ >>> Pywikipedia-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l >>> >>> >> > > _______________________________________________ > Pywikipedia-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > >
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
