There is another issue with that generator: it always checks for replacements but does not apply them which means replacements are always done twice which might slow down the run too. I think we should open a Phabricator task for it. Best Xqt
> Am 16.09.2018 um 22:03 schrieb Bináris <[email protected]>: > > Hi folks, > > I still use trunk/compat for many reasons, but as I see the new code at > https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the > core version must suffer from the same problem. > > If we use -namespace for namespace filtering, class > XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is > filtered by a namespace generator. This may MULTIPLY the running time in some > cases and this may cost hours or even days for a fix of complicated, slow > regexes. > I have just checked, that dump does contain namespace informátion. So why > don't we filter during the scan? > > I made an experiment. I modified my copy to display count of articles and > count of matching pages. The replacement was: > (ur'(\d)\s*%', ur'\1%'), > which seems pretty slow. :-( > The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole dump, > I used -xmlstart.) It went through 820 thousand pages and found 240+ matches > (I displayed every 10th match). > Then the bot worked further 30-40 minutes to check the actual pages from live > wiki, this time with namespace filtering on. (I don't replace in this phase, > just save the list, so no human interaction is implied in this time.) > Guess the result! 62 out of 240 remained. This means that the bigger part of > these 14 hours went into /dev/null. > Now I realize how much time I wasted in the past 10 years. :-( > > I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth. > > -- > Bináris > _______________________________________________ > pywikibot mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikibot
_______________________________________________ pywikibot mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikibot
