2009/4/16 <[email protected]>: > Revision: 6608 > Author: cosoleto > Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009) > > Log Message: > ----------- > Correction for a CPU overload problem introduced with the recent changes in > PageGenerators module that would use the DuplicateFilterPageGenerator always > (probably a bad idea). This filter was using a 'list' object to check for > duplicated 'Page' and was storing 'Page' objects making the comparision > process much more complicated... > > 'set' looks here more appropriate, as it is hashed; and storing for > comparision the title and the interwiki link should be enough. This also > reduces allocated memory a lot compared with the previous revision (60-65% > estimated with a fixed title length of 14 chars). > > This commit reduces CPU usage for a so simple task on my five/six years old > system from 99% to 30%. >
Good, very nice catch =) A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :) -- Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ] _______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
