Revision: 6608
Author: cosoleto
Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)
Log Message:
-----------
Correction for a CPU overload problem introduced with the recent changes in
PageGenerators module that would use the DuplicateFilterPageGenerator always
(probably a bad idea). This filter was using a 'list' object to check for
duplicated 'Page' and was storing 'Page' objects making the comparision process
much more complicated...
'set' looks here more appropriate, as it is hashed; and storing for comparision
the title and the interwiki link should be enough. This also reduces allocated
memory a lot compared with the previous revision (60-65% estimated with a fixed
title length of 14 chars).
This commit reduces CPU usage for a so simple task on my five/six years old
system from 99% to 30%.
Modified Paths:
--------------
trunk/pywikipedia/pagegenerators.py
Modified: trunk/pywikipedia/pagegenerators.py
===================================================================
--- trunk/pywikipedia/pagegenerators.py 2009-04-15 08:28:21 UTC (rev 6607)
+++ trunk/pywikipedia/pagegenerators.py 2009-04-15 17:53:59 UTC (rev 6608)
@@ -705,10 +705,11 @@
Wraps around another generator. Yields all pages, but prevents
duplicates.
"""
- seenPages = []
+ seenPages = set()
for page in generator:
- if page not in seenPages:
- seenPages.append(page)
+ _page = page.aslink(forceInterwiki = True)[2:-2]
+ if _page not in seenPages:
+ seenPages.add(_page)
yield page
def RegexFilterPageGenerator(generator, regex):
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l