I think that the problem is in the xmlreader.py module. I don't know why, but, I think that sometimes it clears the title, user, or other variables before complete the entire list of revision for a page. So when you read a revision these values have disappeared in some cases.
2010/11/7 emijrp <[email protected]> > You didn't replicated the exact case. You must use: > xmlreader.XmlDump(dumpfilename, allrevisions=True). I guess you parsed only > one revision (the last?) for every page, so, it shows 4711. But you skipped > the errors which happen when parsing the whole dump. > > 2010/10/5 Russell Blau <[email protected]> > > "emijrp" <[email protected]> wrote in message >> news:[email protected]... >> >> > I think that there is an error in xmlreader.py. When parsing a full >> > revision XML (in this case[1]), using this code[2] (look at the >> > try-catch, it writes when fails) I get correctly username, >> > timestamp and revisionid, but sometimes, the page title and the page >> > id are None or empty string. >> >> > [1] >> > >> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z >> > [2] http://pastebin.ca/1951930 >> > [3] http://pastebin.ca/1951937 >> >> I have been completely unable to replicate this supposed error. I >> downloaded the same kwwiki dump file that you referenced. I loaded it >> with >> xmlreader.XmlDump, ran it through the parser, and counted the number of >> XMLEntry objects it generated: 4711. Then as a test I opened the same >> dump >> as a text file and counted the number of lines that contain the string >> "<page>": 4711. So the parser is correctly returning one object per page >> item found in the file. >> >> Next I ran the parser again with a script that would print out a message >> if >> any XMLEntry object had a missing title (None or empty string); no >> messages. >> >> Then I searched for the specific page entry you showed in your pastebin >> item >> [3]. The result of this test is shown at [4]. In short, it found exactly >> the >> page title you said was missing. >> >> I cannot explain why your results are different than mine, unless perhaps >> you have a corrupted copy of the dump file, or are not using the current >> version of xmlreader.py. >> >> Russ >> >> [4] http://pastebin.ca/1955170 >> >> >> >> >> _______________________________________________ >> Pywikipedia-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l >> > >
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
