There is another issue with that generator: it always checks for replacements 
but does not apply them which means replacements are always done twice which 
might slow down the run too. I think we should open a Phabricator task for it.
Best
Xqt

> Am 16.09.2018 um 22:03 schrieb Bináris <[email protected]>:
> 
> Hi folks,
> 
> I still use trunk/compat for many reasons, but as I see the new code at 
> https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the 
> core version must suffer from the same problem.
> 
> If we use -namespace for namespace filtering, class 
> XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is 
> filtered by a namespace generator. This may MULTIPLY the running time in some 
> cases and this may cost hours or even days for a fix of complicated, slow 
> regexes.
> I have just checked, that dump does contain namespace informátion. So why 
> don't we filter during the scan?
> 
> I made an experiment. I modified my copy to display count of articles and 
> count of matching pages. The replacement was: 
> (ur'(\d)\s*%', ur'\1%'),
> which seems pretty slow. :-(
> The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole dump, 
> I used -xmlstart.) It went through 820 thousand pages and found 240+ matches 
> (I displayed every 10th match).
> Then the bot worked further 30-40 minutes to check the actual pages from live 
> wiki, this time with namespace filtering on. (I don't replace in this phase, 
> just save the list, so no human interaction is implied in this time.)
> Guess the result! 62 out of 240 remained. This means that the bigger part of 
> these 14 hours went into /dev/null.
> Now I realize how much time I wasted in the past 10 years. :-(
> 
> I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.
> 
> -- 
> Bináris
> _______________________________________________
> pywikibot mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/pywikibot
_______________________________________________
pywikibot mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot

Reply via email to