The dump doesn't include deleted pages or revisions. The dump has the values but the parser doesn't parse them.
2010/10/1 Dr. Trigon <[email protected]> > May be I am wrong, but xqt told me once that the PreloadingGenerator > has problems with API. I myself had problems due to deleted (and re- > direct) pages with API loading multiple pages at once too. > > So my assumption is, this xml parser has indeed problem parsing the > deleted (and maybe redirect) pages and thus fails to return them all > and so the PreloadingGenerator does not work with API. > If I am right with this, the solution to the problem mentioned here > can also solve the Preloading with API problem. This would be very > nice! But the be sure I would appreciate a comment by xqt on this ;)) > > Just some thoughts... > > Greetings > DrTrigon > > > Am 01.10.2010 00:52, schrieb Dmitry Chichkov: > > I see. Strange... That indeed looks like a parser bug. > > > > -- Dmitry > > > > > > On Thu, Sep 30, 2010 at 3:37 PM, emijrp <[email protected] > > <mailto:[email protected]>> wrote: > > > > Furthermore, if you see the chunk of the dump that I have posted, > > the page title and page id are there. But the parser doesn't get > them. > > > > 2010/10/1 emijrp <[email protected] <mailto:[email protected]>> > > > > Hi, thanks for your quick response, but I have a question. Why > > are deleted pages included in the dump? Also, the page of the > > error is not deleted in the wiki.[1] > > > > [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497 > > > > 2010/9/30 Dmitry Chichkov <[email protected] > > <mailto:[email protected]>> > > > > Hi Emijrp, > > > > That's "normal". Page id/title can be None/empty for deleted > > pages. > > > > -- Regards, Dmitry > > > > > > On Thu, Sep 30, 2010 at 9:50 AM, emijrp <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hi all; > > > > I think that there is an error in xmlreader.py. When > > parsing a full revision XML (in this case[1]), using > > this code[2] (look at the try-catch, it writes when > > fails) I get correctly username, timestamp and > > revisionid, but sometimes, the page title and the page > > id are None or empty string. > > > > The first error is: > > ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', > > '4267'] > > > > But if we do: > > 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z > > 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20 > > > > We get this[3], which is OK, the page title and the page > > id are available in the XML, but not correctly parsed. > > And this is not the only page title and page it that > fails. > > > > Perhaps I have missed something, because I'm learning to > > parsing XML. Sorry in that case. > > > > Regards, > > emijrp > > > > [1] > > > http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z > > [2] http://pastebin.ca/1951930 > > [3] http://pastebin.ca/1951937 > > > > _______________________________________________ > > Pywikipedia-l mailing list > > [email protected] > > <mailto:[email protected]> > > > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > > > > > _______________________________________________ > > Pywikipedia-l mailing list > > [email protected] > > <mailto:[email protected]> > > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > > > > > > > _______________________________________________ > > Pywikipedia-l mailing list > > [email protected] > > <mailto:[email protected]> > > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > > > > > > > _______________________________________________ > > Pywikipedia-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > _______________________________________________ > Pywikipedia-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l >
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
