Re: [Pywikipedia-l] XMLreader.py

emijrp Sat, 06 Nov 2010 17:18:34 -0700

I think that the problem is in the xmlreader.py module. I don't know why,
but, I think that sometimes it clears the title, user, or other variables
before complete the entire list of revision for a page. So when you read a
revision these values have disappeared in some cases.


2010/11/7 emijrp <[email protected]>

> You didn't replicated the exact case. You must use:
> xmlreader.XmlDump(dumpfilename, allrevisions=True). I guess you parsed only
> one revision (the last?) for every page, so, it shows 4711. But you skipped
> the errors which happen when parsing the whole dump.
>
> 2010/10/5 Russell Blau <[email protected]>
>
> "emijrp" <[email protected]> wrote in message
>> news:[email protected]...
>>
>> > I think that there is an error in xmlreader.py. When parsing a full
>> > revision XML (in this case[1]), using this code[2] (look at the
>> > try-catch, it writes when fails) I get correctly username,
>> > timestamp and revisionid, but sometimes, the page title and the page
>> > id are None or empty string.
>>
>> > [1]
>> >
>> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
>> > [2] http://pastebin.ca/1951930
>> > [3] http://pastebin.ca/1951937
>>
>> I have been completely unable to replicate this supposed error.  I
>> downloaded the same kwwiki dump file that you referenced.  I loaded it
>> with
>> xmlreader.XmlDump, ran it through the parser, and counted the number of
>> XMLEntry objects it generated: 4711.  Then as a test I opened the same
>> dump
>> as a text file and counted the number of lines that contain the string
>> "<page>": 4711.  So the parser is correctly returning one object per page
>> item found in the file.
>>
>> Next I ran the parser again with a script that would print out a message
>> if
>> any XMLEntry object had a missing title (None or empty string); no
>> messages.
>>
>> Then I searched for the specific page entry you showed in your pastebin
>> item
>> [3]. The result of this test is shown at [4]. In short, it found exactly
>> the
>> page title you said was missing.
>>
>> I cannot explain why your results are different than mine, unless perhaps
>> you have a corrupted copy of the dump file, or are not using the current
>> version of xmlreader.py.
>>
>> Russ
>>
>> [4] http://pastebin.ca/1955170
>>
>>
>>
>>
>> _______________________________________________
>> Pywikipedia-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>>
>
>

_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Re: [Pywikipedia-l] XMLreader.py

Reply via email to