pagegenerators.py

Nicolas Dumazet Wed, 15 Apr 2009 11:06:42 -0700

2009/4/16  <[email protected]>:
> Revision: 6608
> Author:   cosoleto
> Date:     2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)
>
> Log Message:
> -----------
> Correction for a CPU overload problem introduced with the recent changes in 
> PageGenerators module that would use the DuplicateFilterPageGenerator always 
> (probably a bad idea). This filter was using a 'list' object to check for 
> duplicated 'Page' and was storing 'Page' objects making the comparision 
> process much more complicated...
>
> 'set' looks here more appropriate, as it is hashed; and storing for 
> comparision the title and the interwiki link should be enough. This also 
> reduces allocated memory a lot compared with the previous revision (60-65% 
> estimated with a fixed title length of 14 chars).
>
> This commit reduces CPU usage for a so simple task on my five/six years old 
> system from 99% to 30%.
>


Good, very nice catch =)

A small note here: set is not really meant to be used on incremental
.add(), because sets are frozen (not mutable), and add() instantiates
a new set on each .add() action. Sets are useful for set operations
(union, intersection), but are not really helpful when it comes to
incrementally construct them. When I need performance for such kind of
lookups, a simple dictionary is usually way faster than sets :) I
would suggest using a dictionary here :)

-- 
Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]

_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Re: [Pywikipedia-l] SVN: [6608] trunk/pywikipedia/pagegenerators.py

Reply via email to